JetpackCompose.app's Dispatch: Context #3

๐Ÿ’Œ In today's Context: The ceiling of AI delegation isn't your prompting skill. It's your tests, your linters, your CI pipeline. That's the bottleneck not many people are talking about.

๐Ÿ”ฅ Meet the most powerful AI in Android Studio ๐Ÿ”ฅ

Code 10x faster. Tell Firebender to create full screens, ship features, or fix bugs - and watch it do the work for you. It's been battle tested by the best Android teams at companies like Instacart, Tinder, and Adobe.

GM Friends. This is JetpackCompose.appโ€™s Dispatch: Context, back with another installment on navigating the AI shift. In the previous issue of Context, I laid out the 40/30/30 framework โ€” three distinct modes of working with AI tools. The response was great, and an obvious follow-up that expect some of you to ask is: "Okay, I buy the framework. But when I try to delegate real work, the results are... mid. What am I doing wrong?"

It's Not the Model. It's Not the Prompt. It's Your Infrastructure.

After watching thousands of engineers use AI agents across teams at Databricks, and before that at Airbnb, Iโ€™ve come to this conclusion:

The ceiling of what you can delegate to AI is exactly equal to the quality of your verification infrastructure.

Not your prompting skill. Not your choice of model. Not whether you're using Claude or GPT or Gemini. The bottleneck is your tests, your linters, your type system, your CI pipeline. That's it.

I know that sounds reductive. It's not. Let me explain.

The Feedback Loop That Changes Everything

When an AI agent works on your codebase, there's a loop happening โ€” whether you realize it or not:

  1. Agent writes code

  2. Verification catches issues (tests fail, linter flags, types don't match)

  3. Agent reads the feedback and fixes

  4. Verification passes

  5. You review and ship

Each cycle can take seconds. Seconds! An agent can iterate 10, 20, 30 times before you even finish your coffee. But here's where it breaks down: if any link in that chain is weak, the whole loop degrades.

Flaky tests? The agent can't tell if it broke something or if the test is just being moody. No lint rules? The agent generates code that works but violates every convention your team has established. No type safety? The agent produces something that compiles but blows up at runtime in ways that are painful to debug.

When the verification loop is solid, you have an autonomous agent. When it's broken, you have an expensive autocomplete that generates plausible-looking code you now have to manually verify line by line. The difference isn't the agent. It's the loop.

This is exactly what I think explains the METR study that went viral a few months ago โ€” the one showing experienced developers were actually 19% slower when using AI tools, despite believing they were 20% faster. That sounds damning until you ask: what did their verification infrastructure look like? Because I'd bet good money those developers were spending their "saved" time manually reviewing AI output that a good test suite would have validated automatically.

Are you tracking agent views on your docs?

AI agents already outnumber human visitors to your docs โ€” now you can track them.

The Great Irony

For years, teams cut corners on testing. "We need to move fast." "We'll add tests later." "Manual QA is fine for now." You've heard it. You've probably said it. I definitely have.

Those teams? They're now the ones who can't leverage AI effectively.

The investment they skipped โ€” comprehensive test suites, strict linting, robust CI pipelines โ€” is the exact infrastructure that determines AI ROI. It was always important. But it just went from "best practice we all agree on but quietly ignore" to force multiplier.

Meanwhile, the teams that were "slow" because they insisted on good test coverage? They're the ones casually delegating 40% of their work to background agents and having it come back correct. The "slow" teams are now the fast ones. I love the irony.

Two teams, same tools, same models, same access. Team A has a mature test suite that runs in under 2 minutes and a thorough AGENTS.md. Team B has spotty coverage and tribal knowledge. Six months in, the delta in how much they can confidently delegate is enormous. And it's widening every month.

Tests Aren't for You Anymore

This is the โ€œmindset shiftโ€ that I think matters most, and the one I see fewest people making.

For most of our careers, we wrote tests for ourselves. For our future selves, for our teammates, for that person who'd maintain this code two years from now. Tests were a safety net for humans.

That's still true. But there's a new primary consumer of your test suite: the agent.

You're writing tests so that a machine can know if it succeeded. That changes what "good test coverage" means. You stop optimizing for vanity metrics like code coverage percentage and start optimizing for signal density and fast feedback.

Think about it practically:

  • A test that takes 8 minutes to run? Useless for an agent iterating rapidly. That's 8 minutes of dead time per cycle. The agent needs feedback in seconds, not minutes.

  • A test that's flaky 5% of the time? Actively harmful. The agent can't distinguish "I broke something" from "the test is being unreliable." It'll either waste cycles fixing something that isn't broken, or learn to ignore failures. Neither is good.

  • A test that validates behavior but not architecture? Insufficient. The code might work, but it might work in a way that violates your team's patterns. The agent doesn't know the difference unless your linters and architecture tests tell it.

Here's the framework I've been using: every piece of verification infrastructure should answer a yes/no question the agent can act on. Did the tests pass? Did the linter clear? Did the build succeed? Binary signals. Fast feedback. No ambiguity.

The Android Playbook

Alright, let's get specific. You're an Android engineer. You have an existing codebase. How do you actually retrofit verification loops that agents can use?

Here's what I'd prioritize, in order:

1. Module-Level Test Suites That Run in Seconds

This is the single highest-leverage thing you can do. Break your tests down so an agent can run ./gradlew :feature-profile:test and get feedback in 10-15 seconds, not ./gradlew test which takes 6 minutes across your entire project.

Agents are smart enough to figure out which module they're working in. Give them a fast feedback loop for that module and they'll iterate like a machine (because, well, they are one).

2. Detekt + Ktlint as Style Guardrails

Your team has conventions. Naming patterns, import ordering, function length limits, whatever. Write them down as lint rules, not wiki pages. An agent can read a Detekt output that says FunctionTooLong: profileViewModel.loadUser(). It cannot read a Confluence page that says "we prefer functions under 30 lines."

If you're not using Detekt yet, start with the default ruleset and customize from there. The investment is limited, especially in an AI first world where code is free. The payoff is that every agent session respects your team's style automatically.

3. Compose Preview Screenshots as Visual Regression

This one is underrated. Compose Previews aren't just a development convenience โ€” they're a verification mechanism.

Why does this matter for agents? Because an agent can generate Compose UI that compiles, passes unit tests, and looks completely wrong. Wrong spacing. Wrong colors. Wrong layout behavior at different screen sizes. Screenshot tests catch this. The agent sees "screenshot diff detected," looks at what changed, and fixes it. Without this, you're the one eyeballing every UI change. That doesn't scale. The assumption is that your infrastructure is โ€œagent accessibleโ€ and the agent has a mechanism to access these screenshots. If you donโ€™t, hereโ€™s a great idea for your Q2 roadmap.

Having a mechanism to easily visualize the latest changes is a game changing feature that makes your agents a lot more effective.

4. Architecture Tests

Iโ€™ve been involved with Android for 16 years at this point. Iโ€™ve only stumbled on these libraries as I was writing this newsletter so I imagine this might be new information for a lot of you as well.

Tools like ArchUnit (or Konsist for Kotlin-specific checks) let you encode architectural rules as tests: "ViewModels should never import from the data.local package directly," "Use cases must go through the Repository layer," etc. Hereโ€™s an example of what this test looks like in actionโ€”

@Test
fun `clean architecture layers have correct dependencies`() {
    Konsist
        .scopeFromProduction()
        .assertArchitecture {
            // Define layers
            val domain = Layer("Domain", "com.myapp.domain..")
            val presentation = Layer("Presentation", "com.myapp.presentation..")
            val data = Layer("Data", "com.myapp.data..")

            // Define architecture assertions
            domain.dependsOnNothing()
            presentation.dependsOn(domain)
            data.dependsOn(domain)
        }
}

These are the rules that live in senior engineers' heads. The ones that an agent will violate not because it's dumb, but because it's locally optimizing without understanding the global architecture. Even if you add this as a best practice in your AGENTS.md file, agents will still forget about them in sessions with a lot of tokens.

I like to use this line in many conversations Iโ€™m having these daysโ€”

Policy and best practices without any enforcement is just fiction.

Vinay Gaba (lol)

Architecture tests make the implicit explicit โ€” and that's exactly what an agent needs.

Are you tracking agent views on your docs?

AI agents already outnumber human visitors to your docs โ€” now you can track them.

The Compound Effect

Here's what happens when you stack these together:

> An agent writes a new feature. 
> Module-level tests run in 12 seconds โ€” two failures. 
> The agent reads the test output, fixes the logic. Tests pass.
> Detekt runs โ€” one style violation. The agent fixes it.
> Compose screenshot test shows a spacing regression. The agent adjusts the padding. 
> Architecture test flags a direct database access from the ViewModel. The agent routes it through the Repository.

Total time: maybe 90 seconds. Total human involvement: zero.

Now compare that to the same agent without this infrastructure.

> Agent generates the feature. It... looks right? 
> You pull it up in Android Studio. 
> Eyeball the code. 
> Run it on an emulator. 
> Click around. 
> Hmm, the spacing looks off. 
> You dig into the ViewModel โ€” wait, is it accessing the database directly? 
> You check the team conventions doc. Yeah, that's wrong. 
> You either fix it yourself or prompt the agent again with context it should have had.

Total time: 15 minutes. You just became the verification loop. And you don't scale.

The Uncomfortable ROI Conversation

I want to connect this to something broader. You've probably seen the stat floating around: 95% of enterprise AI pilots have not shown measurable financial return. And in surveys, 66% of developers cite "almost right, but not quite" as their top frustration with AI coding tools.

I don't think these numbers mean AI tools don't work. I think they mean most teams deployed agents without building the infrastructure those agents need to verify their own work. They bought the car but didn't build the road.

The teams seeing real ROI aren't the ones with the best prompts or the fanciest models. They're the ones who invested in the boring stuff: tests, linters, CI, documentation. The unsexy infrastructure that turns an agent from "impressive demo" to "reliable teammate."

What This Means for You

If you take one thing from this issue, let it be this: the next hour you spend isn't best spent learning a new AI tool or crafting better prompts. It's best spent improving your test suite.

Seriously. Go look at your project right now. Ask yourself:

  • Can an agent run your tests and get feedback in under 30 seconds?

  • Are your tests stable enough that a failure means something actually broke?

  • Do your lint rules encode your team's conventions, or are they the defaults you never customized?

  • If an agent generated a Compose screen, would any automated check catch visual regressions?

If the answer to most of these is "no" โ€” that's your bottleneck. Not the model. Not the prompt. The loop.

๐Ÿค The Honor Code

I've been creating content for the Android community for over a decade now. Talks, Articles, Open-source projects like Showkase, this newsletter, JetpackCompose.app, etc I've never charged a penny for any of it.

But here's the deal.

If you've read more than one article that helped you. If you've had more than one "aha" moment from something I wrote or shared. If Dispatch has ever made you smarter at your job or helped you learn about something new.

Then I expect you to share this.

Not asking. Expecting. That's the honor code ๐Ÿค

Tweet it. Bluesky it. Send it to your team's Slack. Forward this email to that one Android friend who needs to read this. Whatever works.

๐Ÿ‘‚ Let me hear it!

What did you think of this email?

Login or Subscribe to participate in polls.

On that note, hereโ€™s hoping that your bugs are minor and your compilations are error free,

โ€”

AI @ Databricks | Google Developer Expert for Android | ex-Airbnb, Snap, Spotify, Deloitte

Vinay Gaba is a Google Developer Expert for Android and serves as an Engineering Leader at Databricks. Previously, he was the Tech Lead Manager at Airbnb, where he spearheaded the UI Tooling team. His team's mission is to enhance developer productivity through cutting-edge tools leveraging LLMs and Generative AI. Vinay has deep expertise in LLMs and GenAI, Android, UI Infrastructure, Developer Tooling, Design Systems and Figma Plugins. Prior to these roles, he worked at Snapchat, Spotify, and Deloitte. Vinay holds a Master's degree in Computer Science from Columbia University.

Reply

or to participate.