Copy Fail Concretized Shared-Kernel SaaS Risk. Your Vendor's SOC 2 Attestation Does Not Tell You If Their Node Pools Were Patched.
CVE-2026-31431, the kernel vulnerability now known as Copy Fail, lands a deterministic 4-byte write into Linux's page cache via the algif_aead AEAD module, and the cache is shared across containers and the host. Microsoft's Defender team estimates the affected population at "millions of Kubernetes clusters", Unit 42 confirmed the affected kernel range as 4.14 through 6.19.12 (every Linux distribution shipped since 2017) with a 732-byte proof-of-concept reaching 100% reliability, and a public PoC on GitHub already validates cross-tenant container escape on EKS, GKE, and Alibaba ACK. CISA added Copy Fail to the Known Exploited Vulnerabilities catalog on May 1, 2026, with a federal civilian patch deadline of May 15 under BOD 22-01, which is the same KEV-catalog floor I walked through in the edge-device reckoning post now landing inside the managed-Kubernetes substrate.
For a CIO buying SaaS, the relevant question is no longer whether Copy Fail is bad. It is bad. The relevant question is what your vendors did between April 29, when disclosure happened, and today. The standard SOC 2 attestation, which says only that the vendor runs on AWS or GCP, cannot answer that question, because the cloud provider's patched AMI does not patch the vendor's existing node pools by default. The diligence question that used to live in a footnote now sits at the top of the renewal checklist.
The Failure: A Container Escape That Crosses Tenant Boundaries
Copy Fail is not the usual kernel bug that a SOC 2 report can hand-wave with "we patch within 30 days." The page cache that the bug corrupts is shared between containers running on the same node, and through shared base images, between containers running on the same node belonging to different tenants. Wiz Research documented the shared-base-image container-escape mechanism explicitly: containers built from the same upstream image share page-cache pages on the host, which means a 4-byte deterministic write inside one container is reachable from another. On a multi-tenant Kubernetes node, that is a cross-tenant escape.
The public PoC operationalizes this on the three multi-tenant managed Kubernetes services most enterprise SaaS vendors actually run on. EKS, GKE, and Alibaba ACK are all confirmed exploitable in the Percivalll repository, which is to say: the Snowflake Cortex sandbox-escape pattern I wrote about in the Cortex post, where shared-kernel attack surface was still a theoretical procurement concern, is no longer theoretical. The PoC is 732 bytes. The discovery, by Theori's Taeyang Lee using AI-assisted auditing, took roughly an hour. Unit 42 names CI/CD pipelines specifically as exploit targets, because CI runners are the densest concentration of multi-tenant container workloads inside most enterprise SaaS deployments.
The Architecture: SOC 2 Attestations Stop at the Hypervisor
The architectural assumption that a SOC 2 Type II attestation is supposed to validate is that the vendor has a defined patch-management control, that the control is operating, and that the auditor sampled evidence of it during the period. What the attestation does not establish, and what no AICPA trust service criterion requires, is that the vendor's specific node pools, on a specific date, are running a specific patched kernel version against a specific KEV-listed CVE. That is the gap between the control framework and the operational state.
On the cloud-provider side, the patched-AMI release timeline is a matter of public record. The patched EKS AMI shipped in early May 2026, per the AWS Labs amazon-eks-ami issue tracker. On AKS, the existing nodes are not auto-patched: customers have to trigger image upgrades or apply DaemonSet mitigations themselves, per Azure's own AKS issue thread. The implication, which the SOC 2 framework is structurally unable to answer, is whether your vendor pulled the new AMI as soon as it shipped, rotated their nodes, and reached patched state before the federal May 15 deadline; or whether their nodes are still running the vulnerable kernel today and the auditor will not look at this until the next attestation period closes. This is the same procurement-as-graph problem I described in the Delve and LiteLLM post: the attestation describes a control, not a state, and the buyer's exposure tracks the state.
Where the Assumptions Break: The Visible Benchmark Is Five Days
Cloudflare published its Copy Fail response timeline, and the timeline establishes what a competent fleet-wide response looks like at scale. Disclosure landed at 16:00 UTC on April 29; an interim bpf-lsm mitigation was deployed across the entire fleet within roughly 30 hours; a patched 6.12 LTS kernel was in production on May 4, five days from disclosure to patched fleet, and a fleet-wide audit of algif_aead callers found exactly one legitimate internal service using AF_ALG across more than 330 cities. This is not a small operator's response. It is the visible bar for a security-mature infrastructure provider operating at hyperscale.
The gap between five days and "we will address this in our next quarterly patch window" is the gap that procurement diligence is now responsible for measuring. The same gap I named for hosting providers as mean time to ingest upstream threat intel in the cPanel post now applies one layer deeper, against a kernel CVE on a federal patch clock. Vendor disclosure maturity, which I framed in the Lovable 76-day post as a leading indicator of operational competence, is the same signal here: a vendor that cannot tell you, on a Monday morning ten days after KEV listing, whether their algif_aead was scanned, whether their node pools were rotated to the patched AMI, and which of their multi-tenant workloads share kernels with workloads from tenants the buyer does not control, is a vendor whose patch-SLA control did not survive contact with a real KEV-listed kernel CVE.
What Would Actually Fix This: Three Questions on the Renewal Checklist
The procurement diligence question stack changed on May 1, 2026, and it changed permanently, because Copy Fail will not be the last shared-kernel CVE that lands inside a federal patch deadline. Add three artifacts to the next renewal questionnaire and refuse to close without them.
First, ask the vendor for the output of their algif_aead fleet scan, dated after April 29, 2026, with the count of legitimate internal callers and the action taken on illegitimate ones. Cloudflare's number was one across more than 330 cities. A vendor that cannot produce a comparable artifact has not done the scan.
Second, ask the vendor for their patch SLA against KEV-listed kernel CVEs, expressed in calendar days from CISA listing to fleet-wide patched state, and ask for evidence that the SLA was met for Copy Fail specifically against the May 15 BOD 22-01 deadline. "We patch within our quarterly window" is not an SLA against a 14-day federal deadline.
Third, ask the vendor to enumerate which of their multi-tenant workloads, by product line, share kernels with workloads from tenants the buyer does not control, and what runtime isolation, beyond a kernel patch, prevents a 4-byte page-cache write in one tenant from being reachable in another. CI/CD runners, code-execution sandboxes inside AI products, and shared notebook environments are the three architectures most likely to fall in that category, and the buyer should expect a specific answer for each.
A vendor that produces all three artifacts in a week is operating at the visible bar Cloudflare set. A vendor that cannot produce them at all is informing the buyer that the SOC 2 attestation on file is the ceiling of what diligence can verify, which is, after May 1, no longer enough.