The New York Times just ran a simple test: give four leading AI chatbots, Google's Gemini, OpenAI's ChatGPT, Anthropic's Claude, and Grok from xAI, eight fictional tax scenarios drawn from actual training materials for a tax filing service. Provide all the necessary documentation. W-2s, mortgage forms, dependent information. Everything.
The chatbots miscalculated taxes by an average of more than $2,000 per scenario. Not rounding errors. Substantive miscalculations that, on a real return, would trigger IRS penalties or leave money on the table.
This isn't a tax story. It's a trust story.
The Confidence Problem
Here's what makes AI tax errors different from human tax errors: a human accountant who's unsure will tell you they're unsure. They'll flag an edge case, recommend a second opinion, or at minimum pause before signing off on something that doesn't look right.
AI chatbots don't do that. They deliver a wrong answer with exactly the same tone, formatting, and apparent certainty as a right one. There's no hedging. No "I'm not confident about this deduction." Just a clean, authoritative response that happens to be off by thousands of dollars.
As one tech analyst quoted in the Times report put it: asking a chatbot to do your taxes is like asking it how many times the letter R appears in the word "strawberry." It will give you the probable answer, not the precise one.
The tax code doesn't grade on probability. Neither does the IRS.
The Pattern Is Bigger Than Taxes
This isn't an isolated finding. It's the latest data point in a pattern that should concern anyone deploying AI in consequential domains.
A NerdWallet analysis published days before the Times test found that chatbots performed well on basic IRS quiz questions but produced faulty and outdated information as the questions grew more nuanced. The IRS's own Taxpayer Advocate Service previously found that AI chatbots from tax preparation companies provided inaccurate or irrelevant responses nearly half the time when tested on complex questions.
And taxes are just one domain. Earlier this year, ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard for 2026. A study found ChatGPT Health missed over half of emergencies in triage testing, while roughly 40 million people were using it daily for health-related questions. I wrote about the data security dimensions of that problem when it launched.
The failure mode is consistent across domains: AI performs well on straightforward questions where the answer is well-represented in training data, then fails quietly when complexity increases. The output looks identical in both cases. The user has no signal to distinguish between a reliable answer and a hallucinated one.
Why This Keeps Happening
The Times report explains the mechanical reason: chatbots struggle to hold and process the many interconnected details of a tax return. A W-2, a mortgage, dependents, deductions; errors accumulate as the calculations grow more complex.
But the deeper issue is architectural. Large language models are prediction engines. They generate the most statistically likely next token based on patterns in their training data. When those patterns align with correct answers, they're remarkably useful. When they don't, the model doesn't know it doesn't know. It just keeps generating tokens with the same confidence.
This is the same fundamental limitation I explored in the AI safety implementation gap: the distance between what AI systems can do in controlled demonstrations and what they actually deliver in production, with real-world variability, is wider than most organizations acknowledge.
Tax preparation is a perfect stress test because it requires exactly what language models are worst at: precise arithmetic, multi-step reasoning across interconnected variables, and zero tolerance for error. One wrong number cascades through the entire return. There's no "close enough."
The Accountability Void
Patrick Runyen of Modera Wealth Management called it a "user-beware scenario," adding: "The alibi can't be that ChatGPT told me to do it; that's kind of equivalent to the dog ate my homework."
He's right. The IRS will not accept AI reliance as a defense. If you underpay based on a chatbot's calculation, the interest, penalties, and back taxes are yours.
But this accountability gap extends well beyond tax filing. In healthcare, if a patient acts on an AI triage recommendation that missed an emergency, the AI vendor's terms of service almost certainly disclaim liability. In financial planning, in legal research, in any domain where people are increasingly turning to chatbots for answers: the confidence of the output implies a level of reliability that doesn't exist, and the consequences land entirely on the person who trusted it.
This is the pattern that concerns me most. Not that AI gets things wrong; every tool has limitations. It's that the user experience of AI actively obscures those limitations. A search engine returns a list of sources and lets you evaluate them. A chatbot returns a single, confident answer and buries the uncertainty.
What This Actually Means
I'm not making a case against using AI. I use these tools extensively, including for building and shipping software. They're genuinely transformative for brainstorming, drafting, pattern recognition, and accelerating work where precision isn't the primary constraint.
But the tax test is a useful calibration exercise. When you read that four major AI platforms averaged $2,000 in errors on straightforward tax scenarios, it should update your priors about every other domain where you're trusting AI output without independent verification.
Three things to keep in mind:
AI confidence is not a signal of accuracy. The format of AI responses, clean paragraphs, specific numbers, authoritative tone, creates an illusion of reliability that isn't backed by the underlying mechanics. Treat every AI output in a consequential domain as a draft that requires human verification.
Complexity is where AI fails silently. Simple questions get reasonable answers. Multi-step reasoning with interconnected variables is where errors compound. If your use case involves the latter, your verification process needs to be proportionally rigorous.
The liability is always yours. No AI vendor's terms of service will protect you when their model gives you a wrong answer that costs you money, health, or compliance. The accountability framework hasn't caught up with the adoption curve, and until it does, the risk sits entirely with the user.
The $2,000 average tax error isn't a story about chatbots being bad at math. It's a reminder that AI systems still have fundamental reliability gaps in high-stakes applications, and the technology's greatest risk might be how convincingly it hides that fact.