On April 17, 2026, OpenAI launched the Codex desktop app with the ability to directly control macOS applications. Codex can now move its own cursor, click interface elements, type into form fields, and navigate between apps without human intervention. The same release added multi-agent parallelism, persistent memory across sessions, and 90+ plugins combining skills, app integrations, and MCP servers. Anthropic's Claude Code shipped Agent Teams the same week, letting one Claude session coordinate a team of subagents working on shared projects.
Both vendors are racing each other to build agents with more autonomy, more parallelism, and more surface area. The demos are impressive. The productivity evidence is bleaker than the marketing suggests.
The most rigorous study we have on AI coding productivity is METR's randomized controlled trial of experienced open-source developers. It found that AI coding tools made experienced developers 19% slower on real tasks in codebases they knew well. The same developers believed they were 20% faster. That is a 39-point gap between perception and reality, and 69% continued using the tools after the study ended. Multi-agent tools magnify both halves of that gap: more parallel agents means more perceived momentum and more real overhead. OpenAI and Anthropic just shipped the amplifier.
The Architecture
Codex desktop control gives an AI agent keyboard and mouse access to any macOS application that does not expose an API. The pitch is reasonable: frontend iteration, application testing, and workflows locked behind GUIs. The implementation is that Codex now does what a malicious script with accessibility permissions would do, in exchange for a text prompt.
Multi-agent parallelism compounds that capability. OpenAI's subagent documentation shows the defaults: agents.max_threads is 6 concurrent threads, agents.max_depth is 1 to prevent recursive delegation, and critically, "subagents inherit your current sandbox policy." If the parent session can read your home directory, so can every child it spawns. If the parent can run shell commands, so can every child it spawns. OpenAI's own docs note that "subagent workflows consume more tokens than comparable single-agent runs." More parallelism means more cost, more tool calls, and more surface area for each model to do something the user did not intend.
Plugins expand the surface again. Codex ships with 90+ integrations spanning skills, MCP servers, and third-party tools. This is the same MCP ecosystem where a design flaw just put 200,000 servers at risk of complete takeover, where Anthropic's response was "expected behavior" and input sanitization is the developer's responsibility. Codex desktop inherits the MCP security debt wholesale and adds a new layer on top.
Claude Code's Agent Teams follow a similar pattern. One session acts as team lead, assigns tasks, and synthesizes results. Teammates work independently in their own context windows and communicate directly with each other. The architecture is elegant. The permissions model is the same one that already failed when Anthropic shipped Claude's computer use API last year and prompt injection moved from the chat window to the mouse cursor.
Where the Assumptions Break
The productivity assumption is the first thing to crack. The METR study measured experienced developers on codebases averaging more than a million lines and 22,000 GitHub stars. These are not novices, and the codebases are not toy problems. With AI enabled, participants spent less time writing and searching and more time prompting, reviewing, and waiting. The perception gap held even after the data came out. Multi-agent parallelism does not solve that; it multiplies it. Six agents reviewing six drafts in parallel still has to be consolidated by one human who is the real bottleneck.
The security assumption is the next thing to crack. On April 16, 2026, the Cloud Security Alliance released a study finding that 53% of organizations have had AI agents exceed their intended permissions, and 47% experienced an AI-agent-related security incident in the past year. Fifty-four percent of organizations already run between one and one hundred unsanctioned AI agents with unclear ownership. Only 13% feel prepared for the regulatory scrutiny they expect is coming. This was before Codex desktop control and Claude Agent Teams expanded what a misbehaving agent could actually do.
The accountability assumption is the third thing to crack. In the last sixteen months, AI agents have already deleted production databases, leaked customer data, and misconfigured cloud infrastructure. The pattern I documented in Ten AI Agents Destroyed Production. Zero Postmortems. has not improved. When the agent was a single model making sequential tool calls, post-incident forensics was hard but tractable. When the agent is six parallel subagents sharing a sandbox and clicking buttons in six apps at once, good luck reconstructing which one did what. The Snowflake Cortex incident already demonstrated that an AI coding agent can flip its own sandbox flag without asking. Now multiply that by a default of six concurrent agents inheriting the same sandbox.
The cost assumption fails quietly. OpenAI's own documentation tells you subagents consume more tokens. Anthropic's Agent Teams has the same property. Enterprise buyers who signed capacity agreements for single-agent use are about to discover that the defaults have shifted under them. This is the same compute dynamic I wrote about in AI's Compute Crisis Is Real, now compounding at the architectural level. Vendors hit rate limits, developers get throttled, and the next capacity negotiation starts from a higher floor.
The Counterargument
Multi-agent is not a meaningless capability. For tasks that genuinely parallelize, six agents running in parallel can finish faster than one agent running sequentially. Linting a monorepo, generating test cases across modules, and triaging a backlog of issues are all real wins. The 69% of METR participants who kept using AI even when measurably slower were not irrational. They liked not having to context-switch, they liked the option to delegate boring work, and they valued the ergonomics even when the wall-clock time was worse.
The counterargument is that productivity is not the only thing being optimized. If the tool reduces cognitive load on boring tasks and frees experienced developers to focus on the hard ones, the wall-clock time is the wrong metric. That's fair. But the enterprise procurement story being sold right now is not "this will reduce cognitive load." It is "this will make your engineers faster." The buyers signing seven-figure contracts are being told the output will scale with the headcount that adopts it, and that assumption is the one METR's data directly contradicts.
What Would Actually Change the Math
Three changes would close the gap between what multi-agent tools are and what enterprises are buying.
Per-agent scoped permissions, not inherited blanket sandboxes. The Okta identity crisis research I covered in AI Agents Are an Identity Crisis was already pointing this direction before Codex desktop shipped. The architectural fix is a per-agent identity with a scoped permission set, not a parent session that spawns children who inherit everything by default. OpenAI's subagent documentation acknowledges that custom agents can override sandbox settings individually. That capability needs to become the default, not the exception.
Action auditing that survives parallelism. Every click, keystroke, and tool call should be attributable to a specific agent instance with a specific task. The GitHub Agent HQ control plane I wrote about is the closest thing to this pattern in the market today, and it still lags the runtime by months. Security teams need to be able to answer "which agent ran which command in which session" without reverse-engineering the logs.
Cost visibility per agent, not per session. Subagents consume more tokens; that's in the docs. What's not in the docs is how to see that cost as it accrues, per agent, per task, with a hard stop before runaway recursion eats through a monthly budget. The YOLO-mode convenience trap that Sam Altman himself fell into inside OpenAI is now shipped as a default product configuration. The fix is not better marketing; it is a meter on the machine.
The multi-agent releases this week are not the wrong direction. They are the right direction shipped without the governance that should have accompanied them. OpenAI's own documentation tells you subagents consume more tokens and inherit sandboxes. CSA's data tells you half of enterprises have already watched AI agents exceed their permissions. METR's study tells you the productivity story is 39 points more optimistic than the reality. These are not opposing views. They are the same story from three angles.
The agents just got hands. The evidence says the operator still matters more than the tool, and the operator is a human who already believes the agent is faster than it is. That gap is where enterprise AI wins and loses in 2026.