The Federal Government Just Declined to Vet Frontier AI Models Before Release. That Diligence Did Not Disappear; It Landed on Your Vendor Questionnaire.
On the morning of May 21, 2026, Trump postponed an executive order he was scheduled to sign that day. The reason he gave was about the race with China, not about the mechanism: "I didn't like certain aspects of it. I postponed it," he told reporters, adding, "I think it gets in the way of, you know, we're leading China, we're leading everybody, and I didn't want to do anything to get in the way of that lead." The postponement followed a round of late opposition from AI adviser David Sacks, Mark Zuckerberg, and Elon Musk, several of whom reached the president the night before or the morning of the signing.
Most of the coverage treated this as a story about palace politics: who called whom, which agency wanted to own the testing framework, whether Treasury (rather than CISA or NIST) had any business leading it. That is a real story, but it is not the one that matters to anyone buying AI. The draft order contained a section on "covered frontier models" that would have asked labs to share models with the government at least 90 days before public release, prompted by concern over the vulnerability-finding capabilities of systems like Anthropic's Mythos and OpenAI's GPT-5.5-Cyber. Under the postponed draft, Treasury, the NSA, and CISA would have jointly built a classified benchmarking process to set the thresholds that defined a covered model.
Strip away the agency turf fight and a concrete category of work remains: somebody evaluates a frontier model's dangerous capabilities before that model is deployed into systems that other people depend on. The federal government just declined to be that somebody, at least by mandate. The evaluation does not become unnecessary because the order was shelved. It becomes unowned, and unowned diligence does not evaporate; it falls to whoever is contractually exposed to the model, the same migration that followed Colorado's stayed AI law when its procurement artifacts moved to sector floors rather than disappearing. In the enterprise, the party left holding it is the buyer.
The vetting work is real, and it now lands on third-party risk by default
I spent a year at Houlihan Lokey running diligence on M&A transactions, and the structural lesson from that work applies cleanly here: when a regulator declines to certify something, the obligation to verify it does not disappear, it moves to the party with money at risk. In a deal, that is the acquirer's diligence team. In AI procurement, that is the security, procurement, and governance functions that sign off on a vendor.
Pre-deployment capability evaluation is the most upstream control in the entire AI risk stack. It asks a question no downstream control can answer: before this model touches anything, what can it actually do at the edges of its training, including the things its builder did not intend and would prefer not to discuss? A red-team exercise on your own deployment cannot reach that question, because you are testing a guardrailed production endpoint, not the underlying model's raw capability. A SOC 2 report cannot reach it either, because SOC 2 attests to control operation, not to what a model is latently able to do.
The current AI vendor questionnaire reflects exactly this gap. The questionnaires that have appeared over the past year now ask how a vendor evaluates and monitors AI outputs before delivery, and they demand roughly 90 days of documented governance evidence. What they do not contain is a single row for pre-release model capability evaluation or pre-deployment government testing. The questionnaire asks how the vendor watches the model in production. It does not ask what the model was found capable of before anyone shipped it. That row is missing because, until May 21, buyers reasonably assumed the question would be answered upstream of them.
The federal substitute exists, but it produces nothing a buyer can cite
The natural objection is that the federal mechanism already exists in voluntary form, so nothing was actually lost. The Center for AI Standards and Innovation, the renamed successor to the US AI Safety Institute, signed pre-deployment evaluation agreements with Google DeepMind, Microsoft, and xAI on May 5, 2026; Anthropic and OpenAI signed originally in August 2024 and renegotiated under America's AI Action Plan. All five major labs now have voluntary agreements in place. CAISI has completed more than 40 model evaluations, including unreleased systems. The vetting, in other words, is happening regardless of the executive order.
It is happening, but it is not citable, and for vendor diligence that distinction is what decides whether the evaluation is usable at all. The CAISI agreements are voluntary, with no legal enforcement: no company is compelled to submit a model, and no public process exists for what happens when a finding flags a model as unsafe. There is no published pass, no published fail, no remediation path, no certificate, no expiration date. A buyer cannot point a questionnaire at an artifact that is not produced. "CAISI evaluated this model" is not a control a procurement team can rely on, because the sentence has no verifiable predicate; you cannot attach it to an outcome. Compare that to a Fannie Mae governance mandate, where the regulator publishes an explicit requirement a vendor either meets or does not. The CAISI regime has the testing without the attestation, which from the buyer's chair is the half that does not help.
This is also where the ad hoc replacements make the buyer's position worse rather than better. With the standing order shelved, the labs are reportedly moving to bespoke arrangements: Anthropic's Mythos under something called "Project Glasswing," OpenAI's GPT-5.5 under a "trusted access" program. When a lab does publish an auditable access rubric, as I worked through in the case of OpenAI's frontier cyber rubric and the same-week Mythos breach, a buyer at least has a fixed disclosure to score against; a bespoke government arrangement offers no such fixed thing. A single published standard, even a voluntary one, is at least something a questionnaire can reference. A per-model, per-lab government arrangement is unciteable by construction; the buyer cannot verify the terms, cannot see the findings, and cannot compare one vendor's arrangement against another's. A standard would have given the diligence process a common denominator. The bespoke arrangements give it a folder of things it has to take on faith.
The hardest part: the most important test cannot migrate to you at all
Grant that a buyer wants to close the gap and run the evaluation themselves. Here is the wall. The evaluation that carries the most signal is guardrails-off capability probing, and that is precisely the test a buyer cannot run against a vendor's production model. CAISI's value comes from the fact that developers frequently provide models with reduced or removed safeguards so evaluators can measure raw capability across cybersecurity, biosecurity, and chemical-weapons risk domains. That is the load-bearing test, because the entire concern is what the model can do when the safety layer is stripped, not what it does with the safety layer on.
No enterprise buyer will ever be handed a de-guardrailed frontier model for diligence. The vendor will not ship it, the vendor's lawyers will not allow it, and the buyer has neither the eval harness nor the classified threat context to interpret the results. So the single most valuable input to pre-deployment risk is structurally non-transferable to the buyer's process. You can approximate everything else through documentation, contracts, and your own production red-teaming. This one input you cannot approximate at all; you can only ask the vendor to assert something about it and write the assertion into the contract. This is the same vendor-self-assessment limit I worked through in reading Anthropic's safety report as threat intelligence: when the only evidence of a model's dangerous capability comes from the party being evaluated, the document is intelligence about the vendor's own findings, not a control you can operate.
That constraint is what makes this different from ordinary vendor risk, where given enough budget a buyer can usually reproduce a control internally. Jurisdictional exposure, for instance, is something a diligence team can actually trace; I worked through that in detail on the Manus and Meta jurisdictional-diligence question, where the relevant facts sit in corporate filings and infrastructure you can inspect. Raw model capability under removed safeguards is not inspectable from outside the lab. The buyer is left holding a question that only the vendor can answer and only the vendor can verify.
What the row actually says, and what answer to expect
So here is the honest version of the row to add to the AI vendor questionnaire, written to match what a vendor can truthfully answer rather than what a buyer would prefer to verify:
- Did an independent party, governmental or otherwise, evaluate this model's dangerous-capability profile, including under reduced-safeguard conditions, prior to release? Name the party. Acceptable answers are a specific evaluator, not "yes, internally."
- What artifact resulted, and will you provide it or a redacted summary under NDA? Expect, in most cases, that no pass/fail artifact exists, because CAISI does not produce one. The honest vendor answer is a description of the process and a summary of domains tested, not a certificate.
- Do you contractually attest that no evaluated capability exceeded the thresholds in your published safety policy, and will you notify us within a defined window if a post-release finding changes that? This is the part that has teeth, because it converts "trust us" into a representation you can sue on. The notification window is its own design problem, and the Glasswing disclosure-clause revision showed how much rides on the exact timing language.
Notice what that row resolves to. It is not a verifiable control; it is a contractual attestation plus a notification obligation. The buyer cannot independently confirm the underlying capability claim, because the de-guardrailed test that would confirm it is one only the lab and its chosen evaluators can run. What the buyer can do is force the vendor to put the claim in writing, attach liability to it, and commit to a disclosure timeline. That is a meaningfully different posture from an unrecorded assurance, even though it falls short of verification.
This also reframes the agency turf fight that drove the news coverage. The argument over whether Treasury, CISA, or NIST should own frontier-model testing is the same orphaned-ownership problem every enterprise now has to settle internally: pre-release model risk does not naturally belong to security, to procurement, to legal, or to the AI governance committee, so it tends to belong to none of them until something forces the assignment. The same diffusion of responsibility that left a supply-chain monoculture unexamined operates here. Somebody has to own the attestation row, decide what answer is disqualifying, and hold the vendor to the notification clause.
The politics of this are not settled, which is worth registering before assuming the federal floor stays absent. A Future of Life Institute poll of Republican voters released May 22 found 79% in favor of government testing of AI models for safety and 87% supporting government authority to block models that pose national-security threats, and more than 60 figures including Steve Bannon signed a "Humans First" letter urging pre-release testing. A federal mechanism may yet return. Until it does, the question of what a frontier model can do before it ships is one the buyer has to ask directly, on the questionnaire, knowing the most honest answer a vendor can give is a written promise rather than a proof. Add the row anyway, name the owner, and write the notification window into the contract; an attestation you can enforce is the strongest control available when the test that would verify it is one you are never allowed to run.