AI Strategy6 min read

We're Building Agents Wrong

The most-hyped AI agent in the industry just rebuilt itself around a human checkpoint. That correction is the one the whole industry needs, and it points to where AI PMs should be designing instead.

By Kumar ThangamuthuMay 15, 2026

The autonomous AI agent that launched as "the first AI software engineer" now asks you to approve its plan before it starts working.

Devin, from Cognition Labs, launched in March 2024 with a striking demo: a fully autonomous agent that plans, codes, tests, and ships software with minimal supervision. The framing was bold: the first AI software engineer. The valuation followed: $2 billion at launch, doubling to $4 billion by March 2025.

Then the benchmarks came in. On SWE-bench, the standard test for real GitHub issue resolution, Devin scored 13.86% end-to-end (vendor-reported, on their chosen subset). Independent real-world testing has tracked closer to 14-15%. A meaningful jump over prior baselines (GPT-4 hit 1.96% on the same test). Also: an ~85% failure rate on tasks the product was marketed to handle.

In April 2025, Cognition shipped Devin 2.0. The headline feature was called Interactive Planning: the agent now drafts a step-by-step plan, you approve or adjust it, then it executes. The price also dropped from $500/month to $20/month.

The company that defined the autonomous-agent category quietly rebuilt their product around a human checkpoint.

This is the pattern I want to talk about. Not because Devin is uniquely flawed (it isn't, the team is shipping real work), but because the correction happening inside the highest-profile agent company is the correction the whole industry needs. We've been building agents as autonomous executors when we should be building them as decision surfaces.

Architecture diagram comparing autonomous AI agent loop with plan-approve-execute decision surface pattern

The autonomy fantasy

The pitch for autonomous agents has always been seductive. Describe a goal in natural language, let the agent plan and execute, come back when it's done. No prompts to refine, no tools to chain, no glue code. The agent figures it out.

This pitch works in two settings: demos and toy problems. The demo runs in a curated environment with a well-scoped task and the model's training distribution on its side. The toy problem ("organize my calendar," "summarize these emails") has loose enough success criteria that anything reasonable looks correct.

Production work is neither. Production has long-tail edge cases, unstated constraints, real consequences for being wrong, and humans accountable for outcomes. In production, the question isn't "can the agent complete the task?" The real question is: can the agent complete the task reliably enough that the human doesn't have to verify the work, which would defeat the point of the automation?

For most tasks worth automating, the honest answer is: not yet, and possibly not for a while.

What the autonomous loop misses

Spend time building decision-intelligence products and you start seeing the same shape across industries. A claims adjuster, a credit analyst, a clinician, a planner: they're not executing tasks. They're making decisions. The actual work is:

Frame the question correctly
Pull the relevant evidence
Weigh trade-offs against constraints
Commit to a path with reasoning that can be audited later

Autonomous agents collapse all four into a single opaque action. They execute. They don't surface. The human downstream gets a result (maybe right, maybe wrong) with no view into what alternatives were considered, what evidence was weighed, or where the system was uncertain.

This is fine if you're sorting emails. It's catastrophic if you're approving a loan, scheduling a surgery, or pushing code to production.

The decision-intelligence framing makes the alternative clear. The agent's job isn't to make the decision. The agent's job is to make the decision legible: to frame it, surface the trade-offs, show its work, and hand the choice back to a human with structure. Execute only after the human commits.

This is exactly what Devin 2.0's Interactive Planning does. The agent proposes a plan. You read it. You approve or adjust. Then it executes. The autonomy didn't disappear; it just got placed after a decision checkpoint instead of replacing one.

Three patterns that work better

If autonomous execution is the wrong default, what does the right design look like? Three patterns I keep seeing succeed where pure agents fail.

Plan-then-execute with checkpoints

The agent proposes, the human approves, the agent acts. Devin 2.0 lands here. So does Cursor's agent mode, Claude Code's plan mode, and most production-grade GitHub Copilot workflows. The checkpoint is the product feature, not a limitation.

Structured workflow with selective AI calls

Most "AI agents" in production are actually deterministic workflows that call the LLM at specific decision points. The system knows what tools to use and in what order; the LLM decides which branch to take or interprets a specific input. This is unsexy and it works. My compound AI system for motorcycle telemetry is built this way: Python guardrails handle safety, an ML model handles risk scoring, the LLM orchestrates and explains. No autonomous loop. Zero hallucinated brake checks.

Decision surfaces, not action surfaces

The agent gathers evidence, weighs options against constraints, and surfaces a structured recommendation: "Here are three viable paths. Path A is fastest but assumes X. Path B is safest but costs Y. Path C is the default if we can't reach you." The human picks. The agent executes the pick. This is closer to how good human analysts work, and it's the design pattern most enterprise buyers actually want, even when they ask for "an autonomous agent."

Example of an AI agent surfacing three structured options with trade-offs for human selection

The pattern these share: the AI does the cognitive work humans are bad at (synthesis, recall, structured comparison), and humans do the cognitive work AI is bad at (judgment, accountability, knowing when something feels wrong).

Why the industry keeps reaching for autonomy

If structured workflows work better, why does every agent demo lead with autonomy? Three reasons, none of them good.

Autonomy demos better

"Watch the AI build an entire app from a one-line prompt" is a screenshot. "Watch the AI surface three options, then execute the one the user picks" is a workflow diagram.

Autonomy gets bigger valuations

Investors are pricing the future of labor automation, not the future of better software tools. Autonomous framing matches the labor-replacement thesis. Structured workflow framing doesn't.

Autonomy is intellectually easier to scope

"The agent does the whole thing" is one product spec. "The agent surfaces these specific decisions to this specific human role at these specific moments" is dozens of product specs, each requiring real customer research.

The third reason is the one that matters for PMs. Designing decision surfaces is slower, more domain-specific, and harder to demo. It's also where the actual product value lives. Autonomous loops collapse the design problem. Decision surfaces force you to do the design work.

What this means for AI PMs

If you're building agents right now, the question to ask is not "how do we make this more autonomous?" It's: where are the decision checkpoints, who owns each one, and what does the agent need to surface for that decision to be made well?

Two practical moves.

Map the decision points before you build

Walk through the workflow with the actual human user. Find every moment they currently make a judgment call, including the ones they don't notice making. Those are your checkpoints. Build the agent around them, not over them.

Optimize for legibility, not autonomy

The agent's output should make the decision easier, not invisible. Show the alternatives considered. Show the evidence weighed. Show the constraints applied. Show uncertainty when it exists. A 70%-confidence answer flagged as 70% is more useful than a 95%-confidence answer presented as fact.

Devin 2.0's Interactive Planning isn't a retreat from agentic AI. It's the category growing up. The teams that figure this out (that build agents as decision surfaces instead of autonomous executors) will ship products that actually work in production. The teams that don't will keep launching impressive demos at 14% reliability.

The autonomy will come. It'll come slowly, narrowly, and at specific checkpoints inside larger structured systems. The agents that survive the next two years won't be the ones with the most autonomy. They'll be the ones that knew when to ask.

ai-agentsagent-architecturedecision-intelligencedecision-surfaceshuman-in-the-loopai-product-managementcompound-aiproduction-aidevin