ChatGPT Health: The Data Security Questions You Should Be Asking

OpenAI just launched ChatGPT Health, a dedicated space within ChatGPT designed to help users manage their health and wellness. The pitch is compelling: connect your medical records and fitness apps, get personalized health insights, and feel more informed about your wellbeing. According to OpenAI, over 230 million people globally ask health-related questions on ChatGPT every week. Now they want those users to share their most sensitive data.

Before you connect your Apple Health, upload lab results, or link your medical records through their various integrations, there are questions you should be asking. Questions that OpenAI's marketing materials don't fully answer.

The Promise vs. The Fine Print

OpenAI describes ChatGPT Health as having "purpose-built encryption and isolation" to keep health conversations protected. They state that health data is encrypted by default, stored separately from other chats, and not used to train their foundation models. On the surface, this sounds reassuring.

But here's what privacy advocates at the Center for Democracy and Technology point out: "When your health data is held by your doctor or your insurance company, the HIPAA privacy rules apply. The same is not true for non-HIPAA-covered entities, like developers of health apps, wearable health trackers, or AI companies."

OpenAI does not describe ChatGPT Health as HIPAA compliant; consumer health apps aren't covered by HIPAA. Your data exists in a regulatory gray zone the moment it leaves your healthcare provider's system.

This Isn't Theoretical

If you think privacy concerns about consumer AI are hypothetical, consider what's happened in just the past two years.

In August 2025, over 370,000 Grok conversations were indexed by Google, making them searchable by anyone. The exposed data included medical and psychological questions, business details, and at least one password. Researchers found personally identifiable information including names and addresses, data that could enable harassment or doxxing.

Just weeks earlier, ChatGPT faced the same problem. Nearly 4,500 shared conversations appeared in Google search results, including discussions about mental health, career concerns, and legal issues. Some conversations contained email addresses, names of children, home locations, resumes, business strategies, and developer credentials. OpenAI disabled the feature and worked to de-index the content, but once data hits search engine caches, it can persist indefinitely.

This wasn't even the first time. In March 2023, a Redis bug caused ChatGPT to expose other users' chat titles and first messages to unrelated users. The same bug exposed payment information for 1.2% of ChatGPT Plus subscribers.

And in November 2025, Tenable researchers discovered a "Conversation Injection" vulnerability where malicious prompts could be injected into ChatGPT sessions. These injections could even update ChatGPT's persistent memories, creating an ongoing data exfiltration risk that persisted across sessions.

The pattern is clear: when these platforms fail, they expose raw data. Every piece of information users shared is unprotected, unobfuscated, fully readable.

The Questions Users Should Be Asking

Working on data security for AI and healthcare at Capital One Software, I've spent significant time with CTOs, CIOs, and CISOs in the healthcare industry. The executives responsible for protecting patient data at scale ask pointed questions that consumer users rarely think to ask:

1. How is my data protected at the field level?

Encryption is often discussed as if it's binary: either your data is encrypted or it isn't. But there's a massive difference between encrypting a database and protecting individual data fields. When OpenAI says your health data is "encrypted," what does that actually mean?

Is my Social Security number encrypted separately from my diagnosis codes?
Are my lab values protected differently than my demographic information?
What happens when data is decrypted for processing?

Field-level protection matters because breaches don't expose entire databases uniformly. Attackers target specific high-value fields: SSNs, medical record numbers, diagnosis codes. If your data is encrypted only at the storage layer, it's exposed the moment it's processed.

2. How can OpenAI ensure my data has been de-identified?

OpenAI knows exactly who you are. You have an account, a payment method, a conversation history. When they claim health data is "de-identified" for certain purposes, the critical question is: de-identified from what?

True de-identification under HIPAA's Safe Harbor standard requires removing 18 specific identifiers. But OpenAI isn't bound by HIPAA. Their definition of de-identification may be far less rigorous. And here's the uncomfortable truth: when the company holding your data knows your identity, re-identification is always technically possible.

3. What happens if someone gets past the security controls?

This is where the "crown jewels" analogy becomes relevant.

Think of your health data as the Crown Jewels of England. Traditional security approaches treat encryption like a moat around the Tower of London, with your database as the vault inside. Role-based access controls (RBAC) are the guards. Defense in depth adds walls, checkpoints, and surveillance.

But what happens if an attacker gets past all of that? They get the crown jewels. The actual, raw, sensitive data. Your SSN. Your HIV status. Your mental health records. Everything.

This is the fundamental limitation of perimeter-based security: it protects access to data, but once access is achieved, the data itself is fully exposed.

A Different Approach to Data Protection

Most organizations today, including consumer AI platforms, rely on the perimeter model I just described. But there's a fundamentally different approach I advocate for: replacing the crown jewels with costume jewelry through deterministic tokenization.

Here's how it works: instead of storing your actual Social Security number, the system stores a token. This token looks and behaves like an SSN but is inherently meaningless if compromised. The token maintains referential integrity (so systems can still link records), but the actual sensitive value never exists in the same place as the token.

Traditional tokenization architectures use a "vault," a lookup table mapping tokens to their original values. But this creates a concentrated target. Compromise the vault, and every token becomes reversible. The costume jewelry turns back into crown jewels.

The more robust approach is vaultless tokenization. Instead of a lookup table, reversing a token requires multiple factors to be true simultaneously: specific metadata, cryptographic keys held by different parties, and access to systems that don't store the sensitive data themselves. There's no single table an attacker can steal to unravel all your tokens. You can't reverse a token by performing a database lookup; you need a constellation of independently-secured components working together.

If an attacker breaches a system storing tokenized data, they get costume jewelry. Data that looks real but is useless for identity theft, fraud, or blackmail.

Consider what this would have meant for the incidents I described earlier. When 370,000 Grok conversations hit Google's index, every medical question, every password, every piece of PII was fully readable. If that data had been tokenized, the indexed conversations would have contained meaningless values: format-preserving placeholders that reveal nothing about the actual users.

This is a fundamentally different data protection model than encryption alone. I explored this further in my post on building AI systems that enterprises can trust.

The AI Training Problem

Here's a concern that gets less attention: even if OpenAI doesn't use your ChatGPT Health conversations to train their foundation models (as they claim), what about the model weights themselves?

AI models can memorize training data. Research has demonstrated that large language models can regurgitate verbatim content from their training sets. If sensitive health data ever touches model training, even temporarily or "anonymously," there's a risk that information could be extracted through carefully crafted prompts.

Tokenization addresses this at the architectural level. When AI systems are trained on tokenized data instead of raw sensitive information, the model weights never contain actual PII or PHI. Even if an attacker extracts memorized content from the model, they get tokens. Not Social Security numbers, not diagnosis codes, not prescription histories.

This keeps raw sensitive data out of places where it could be compromised and extracted later.

The Quantum Computing Threat

There's another dimension to this that security professionals increasingly discuss: the "Harvest Now, Decrypt Later" threat.

Encrypted data stolen today can be stored until quantum computers become capable of breaking current encryption standards. Nation-states and sophisticated attackers are already collecting encrypted data, betting that future computing capabilities will make today's cryptography obsolete.

Traditional encryption, no matter how strong, is vulnerable to this approach. Given enough time and computing power, encrypted data can theoretically be decrypted.

Deterministic tokens don't have this vulnerability. Tokens can't be reversed back to their original form by throwing compute at them, quantum or otherwise. The mathematical relationship between a token and its original value isn't based on encryption that can be broken with sufficient processing power. Without the specific combination of metadata, keys, and systems required for detokenization, the tokens are permanently meaningless.

For health data that could be sensitive for decades (genetic information, mental health history, chronic conditions), this distinction matters enormously.

What Should You Do?

I'm not suggesting you avoid ChatGPT Health entirely. AI-powered health insights could genuinely help people better understand their conditions and communicate with their doctors. But before you connect your most sensitive data to any AI system, ask these questions:

Read the actual privacy policy. Not the marketing summary. Read the Health Privacy Notice and understand what you're agreeing to.
Understand what "not used for training" actually means. Does it apply to all processing? What about fine-tuning? What about aggregate analytics?
Consider what you're sharing. Maybe connect your fitness app but think twice about uploading detailed medical records. The value of AI insights has to be weighed against the risks.
Know your recourse. If something goes wrong, what are your options? Consumer AI tools don't offer the same protections as healthcare providers.

The healthcare executives I work with approach every new AI tool with healthy skepticism. They ask hard questions because they understand what's at stake when health data is compromised. The average consumer deserves to approach ChatGPT Health with that same rigor.

The Bigger Picture

This isn't just about OpenAI or ChatGPT Health. It's about a fundamental tension in how we're building AI systems that handle sensitive information. The shadow AI problem I've written about, where employees share sensitive data with unauthorized AI tools, is a corporate manifestation of the same issue. People want the benefits of AI, and they'll accept significant privacy risks to get them.

The question for the industry is whether we can build AI systems that deliver those benefits without requiring users to hand over their crown jewels. Approaches like tokenization suggest we can. But right now, most consumer AI tools, including ChatGPT Health, are asking users to trust perimeter security alone.

OpenAI has built a sophisticated product with more privacy protections than many competitors. But "better than average" isn't the same as "adequate for health data." Before you connect your medical records, make sure you understand what you're trusting and what you're risking.

The Promise vs. The Fine Print

This Isn't Theoretical

If you think privacy concerns about consumer AI are hypothetical, consider what's happened in just the past two years.

The pattern is clear: when these platforms fail, they expose raw data. Every piece of information users shared is unprotected, unobfuscated, fully readable.

The Questions Users Should Be Asking

1. How is my data protected at the field level?

Is my Social Security number encrypted separately from my diagnosis codes?
Are my lab values protected differently than my demographic information?
What happens when data is decrypted for processing?

2. How can OpenAI ensure my data has been de-identified?

3. What happens if someone gets past the security controls?

This is where the "crown jewels" analogy becomes relevant.

But what happens if an attacker gets past all of that? They get the crown jewels. The actual, raw, sensitive data. Your SSN. Your HIV status. Your mental health records. Everything.

This is the fundamental limitation of perimeter-based security: it protects access to data, but once access is achieved, the data itself is fully exposed.

A Different Approach to Data Protection

If an attacker breaches a system storing tokenized data, they get costume jewelry. Data that looks real but is useless for identity theft, fraud, or blackmail.

This is a fundamentally different data protection model than encryption alone. I explored this further in my post on building AI systems that enterprises can trust.

The AI Training Problem

Here's a concern that gets less attention: even if OpenAI doesn't use your ChatGPT Health conversations to train their foundation models (as they claim), what about the model weights themselves?

This keeps raw sensitive data out of places where it could be compromised and extracted later.

The Quantum Computing Threat

There's another dimension to this that security professionals increasingly discuss: the "Harvest Now, Decrypt Later" threat.

Traditional encryption, no matter how strong, is vulnerable to this approach. Given enough time and computing power, encrypted data can theoretically be decrypted.

For health data that could be sensitive for decades (genetic information, mental health history, chronic conditions), this distinction matters enormously.

What Should You Do?

Read the actual privacy policy. Not the marketing summary. Read the Health Privacy Notice and understand what you're agreeing to.
Understand what "not used for training" actually means. Does it apply to all processing? What about fine-tuning? What about aggregate analytics?
Consider what you're sharing. Maybe connect your fitness app but think twice about uploading detailed medical records. The value of AI insights has to be weighed against the risks.
Know your recourse. If something goes wrong, what are your options? Consumer AI tools don't offer the same protections as healthcare providers.

ChatGPT Health: The Data Security Questions You Should Be Asking

The Promise vs. The Fine Print

This Isn't Theoretical

The Questions Users Should Be Asking

1. How is my data protected at the field level?

2. How can OpenAI ensure my data has been de-identified?

3. What happens if someone gets past the security controls?

A Different Approach to Data Protection

The AI Training Problem

The Quantum Computing Threat

What Should You Do?

The Bigger Picture

Related Posts

Who Validates the Validator? The Recursive Trust Problem in Agentic Security

The Pentagon Cut 99% of Its Targeting Analysts. Then a School Full of Children Was Hit.

The FBI Is Investigating a Steam Hacker. They Should Be Investigating the Platform.

The Promise vs. The Fine Print

This Isn't Theoretical

The Questions Users Should Be Asking

1. How is my data protected at the field level?

2. How can OpenAI ensure my data has been de-identified?

3. What happens if someone gets past the security controls?

A Different Approach to Data Protection

The AI Training Problem

The Quantum Computing Threat

What Should You Do?

The Bigger Picture