Braintrust's Breach Proves AI Eval Platforms Are Credential Warehouses. SOC 2 Audits Them Like They Are Not.

Braintrust confirmed on May 6, 2026 that an AWS account holding customer integration data was breached on May 4, then notified every customer to rotate every API key the platform held on their behalf. The customer roster Braintrust published with its February Series B includes Notion, Replit, Cloudflare, Ramp, and Dropbox, and the platform brokers calls to model endpoints across OpenAI, Anthropic, Google, AWS Bedrock, Azure OpenAI, Together AI, Fireworks, Groq, and Replicate. The product page still says the proxy "cannot see your data and does not store or log API keys" nine days after the public disclosure. Both statements are true; they describe two different product modes that the same SOC 2 Type II scope cannot tell apart.

AI Evaluation Platforms Have Become the Most Concentrated Credential Layer in the Stack

The thesis is narrow and load-bearing: an AI evaluation and observability vendor that runs in proxy-and-store mode is custodying more sensitive third-party credentials per customer than almost any other category of SaaS in the enterprise AI stack, and procurement instruments do not yet treat that custodianship as a distinct scope. A traditional observability vendor (Datadog, Honeycomb) holds API tokens for its own ingestion endpoints. An AI eval vendor that brokers calls to provider APIs holds the customer's keys to OpenAI, Anthropic, Google, and every other lab the customer integrates with. Those are not equivalent risk surfaces, and the questionnaire that approves both vendors is the same questionnaire.

Jaime Blasco, CTO of Nudge Security, described the shape to SecurityWeek directly: "The blast radius isn't Braintrust, it's every downstream customer's AI stack, and a single SaaS compromise fans out across dozens of LLM provider accounts." That is structurally identical to the fourth-party-AI vendor taxonomy I walked through in the Vercel Context breach post, except the credential concentration is one architectural layer deeper. The eval platform sits between the customer's prompt and every model the customer wants to evaluate, and the proxy mode that makes evaluation easy makes the platform the single key custodian for every provider relationship the customer holds.

The Evidence: One IAM Compromise, Every Customer's Keys

SecurityWeek reported that Braintrust detected the breach on May 4 and notified customers on May 5, with public disclosure following on May 6. SecurityWeek also reported three additional customers beyond the one Braintrust initially confirmed seeing suspicious provider-usage spikes inside the disclosure window, which suggests the attacker had time to exercise at least some of the stolen credentials before rotation completed. A Braintrust spokesperson framed the customer notification as having been issued "out of an abundance of caution", language that is appropriate for the company's outward communications and that does not change the underlying procurement question: every customer received a rotate-everything advisory, which means the platform's IAM principal structure was a single point of failure for every customer's provider relationships.

The Rescana advisory mapped the incident to MITRE ATT&CK technique T1078.004 (valid cloud accounts) and noted that Braintrust declined to specify which IAM principal was compromised or which tenancy boundary the principal sat behind. That refusal is the artifact procurement actually needs and does not get. The customer cannot infer credential blast radius from a SOC 2 report alone; the customer needs the IAM principal structure, the cloud tenancy boundary, and the recovery-time estimate for credentials that the vendor itself cannot rotate, because those credentials belong to a third-party provider the vendor only proxies for.

The Provider Policies Already Made the Bet Look Risky

The two largest model providers have already published policies that should have informed the procurement decision before May 4. OpenAI's Terms of Use prohibit sharing API keys with third parties, and the API key safety best-practice page explicitly warns against giving keys to other applications. Anthropic moved in February 2026 to clarify its terms prohibiting third-party tools from holding Claude credentials, formalizing the same constraint from the other side. Both providers are saying the same thing in different documents: the customer is responsible for key custody, and the customer is not authorized to delegate that custody to an intermediary.

An AI eval platform operating in proxy-and-store mode requires the customer to violate the plain reading of both policies. The customer can argue that the proxy mode is a technical convenience rather than a delegation, and the argument has some force when the proxy genuinely does not store the key. The Braintrust documentation page asserts that the proxy mode does not store keys, and that assertion may well describe the documented proxy product accurately. The breach disclosure tells the customer that some other product mode at the same vendor does store keys, and SOC 2 Type II does not distinguish which mode the customer was using.

The Counterargument: SOC 2 Type II Already Covers This

The strongest objection is that SOC 2 Type II is exactly the audit framework designed to cover third-party data handling, and that Braintrust holds a Type II report. The objection has weight. SOC 2 audits do test logical access controls, encryption at rest, key management, and incident response under the Trust Services Criteria, and a vendor that fails any of those tests will not pass the audit.

Where the objection falls short is scope granularity. The SOC 2 third-party criteria as documented test that the vendor has controls over the data the vendor processes; the criteria do not ask the vendor to enumerate which third-party credentials the vendor custodies on the customer's behalf, in which cloud tenancy, under which IAM principal structure. An eval vendor that brokers no provider calls (running purely in customer-bring-key SDK mode) and an eval vendor that custodies keys for one hundred providers across a single shared AWS tenancy will both pass the same SOC 2 Type II audit. The report attests to the vendor's controls; it does not attest to the scope of credentials those controls are protecting. This is the same pattern I traced in the Delve / Context / LiteLLM shared-auditor post: SOC 2 attestation is not a property of the vendor, it is a chain whose properties depend on what the auditor scoped. Here the chain is intact, and the scope is the part that does not include the credential warehouse.

What SOC 2 Cannot Distinguish, the Customer Has to Ask

The resolution is not to retire SOC 2. The resolution is to add a procurement artifact that SOC 2 does not produce and that no current AI-vendor questionnaire requires. Call it a credential blast-radius disclosure. It has five required answers from the vendor, all written into the diligence file before contract signature.

First, a complete inventory of third-party credentials the vendor custodies on the customer's behalf, by provider and credential type (API key, OAuth token, IAM role assumption). This is the asset inventory for the credential warehouse.

Second, the cloud tenancy boundary that contains those credentials: single shared account, per-customer subaccount, or per-customer separate cloud organization. The Braintrust disclosure did not specify this and the customer could not reconstruct it from the SOC 2 report.

Third, the IAM principal structure that has access to the credentials inside the boundary: single shared service principal, per-customer service principal, or per-engineer principal with conditional access. T1078.004 attacks land on whichever principal has the broadest scope.

Fourth, the credential-rotation recovery-time estimate for credentials the vendor cannot rotate itself, because the credentials are issued by a third party. Braintrust's rotate-everything advisory was correct, and the recovery time for every customer was a function of how many provider keys the customer held, not a function of any control Braintrust operated.

Fifth, the contractual notification window for compromise of the credential warehouse specifically, written separately from any general breach-notification clause. The same notification-right gap I named in the Mistral advisory post and the GTIG embargo post applies here, with one addition: the customer needs the notification fast enough to rotate provider keys before the attacker exercises them, and three customers seeing usage spikes inside the notification window is the empirical evidence that the current window is not fast enough.

This is the same procurement-instrument gap I mapped in the Five Eyes agentic-AI questionnaire post, extended to a sixth risk class the framework did not name: credential-custody concentration. It is the same questionnaire I argued was missing a row for self-hosted AI runtimes, missing a different row here. And it is closely related to the privileged-access vendor pattern I covered in the Mercor breach analysis, with the inversion that the privileged access here is to the customer's third-party provider accounts rather than to the customer's own data.

The Row Missing From the Questionnaire

The procurement question every enterprise AI lead can answer this week is whether the diligence file for their primary evaluation or observability vendor includes a credential blast-radius row. Pull the questionnaire. Search it for the words "third-party credentials," "cloud tenancy," "IAM principal," and "rotation recovery time." If none of the four appears, write the row before the next contract renewal: list every provider whose credentials the vendor custodies, name the cloud account ID that holds them, name the IAM principal with access, and require a contractual notification window measured in hours not days. Braintrust's customers had to rotate every key on May 5; the customers who write the row now will be rotating fewer keys, faster, when the next eval-platform breach lands.

Braintrust's Breach Proves AI Eval Platforms Are Credential Warehouses. SOC 2 Audits Them Like They Are Not.

AI Evaluation Platforms Have Become the Most Concentrated Credential Layer in the Stack

The Evidence: One IAM Compromise, Every Customer's Keys

The Provider Policies Already Made the Bet Look Risky

The Counterargument: SOC 2 Type II Already Covers This

What SOC 2 Cannot Distinguish, the Customer Has to Ask

The Row Missing From the Questionnaire

Keep Reading

Coinbase's BPO Contractors Sold Customer Records for $200 a Pop. The DDQ Row Your Outsourced Support Vendor Has Not Answered Is About Personnel Blast Radius, Not Their SOC 2.

Anthropic's Glasswing Clause Revision Did Not Build a Threat-Sharing Pipeline. It Opened One Row on Your Vendor DDQ.

Microsoft Shipped Exchange CVE-2026-42897 Without a Patch. The DDQ Should Ask Whether Your Mailbox Tier's Emergency Mitigation Service Is Still Running.

AI Evaluation Platforms Have Become the Most Concentrated Credential Layer in the Stack

The Evidence: One IAM Compromise, Every Customer's Keys

The Provider Policies Already Made the Bet Look Risky

The Counterargument: SOC 2 Type II Already Covers This

What SOC 2 Cannot Distinguish, the Customer Has to Ask

The Row Missing From the Questionnaire