A Washington State University study shows ChatGPT flips answers, misses false statements, and only barely beats chance when judging scientific hypotheses.

Generative AI still struggles with basic scientific reasoning. Washington State University’s Mesut Cicek and collaborators tested ChatGPT on 719 hypotheses pulled from peer-reviewed business journals, asking the model ten times per hypothesis whether the claim was true or false. The headline accuracy looked decent—76.5% in 2024, creeping to 80% in 2025—but once they accounted for random guessing, performance dropped to barely 60% above chance.

The uncomfortable numbers

Metric	2024 (GPT-3.5)	2025 (GPT-5 mini)
Raw accuracy	76.5%	80.0%
Adjusted accuracy (chance-corrected)	~60% above chance	~60% above chance
Correctly identified false statements	16.4%
Consistency across 10 identical prompts	73%

Some hypotheses yielded five “true” and five “false” answers in a row. That inconsistency erodes trust quickly: if you ask the same question twice and get different replies, which one should drive a decision?

Study design highlights

Hypotheses came from business journals published since 2021, covering nuanced cause-and-effect statements rather than trivia.
The team used the free ChatGPT-3.5 for the 2024 test and the then-current GPT-5 mini in 2025.
Each hypothesis was posed ten times to measure stability; answers were logged and compared against the ground-truth conclusion from the underlying study.

This isn’t a cherry-picked trick question dataset; it represents the kind of ambiguous, data-backed claims executives grapple with daily.

Why it happens

Large language models are pattern engines, not logicians. They assemble plausible sentences by referencing training data rather than reasoning through causal chains. When asked to evaluate research hypotheses—statements that often hinge on nuanced assumptions—the model falls back on linguistic fluency. It “sounds” confident, but it isn’t doing hypothesis testing; it’s matching phrasings it has seen before. Without retrieval or verification, it can’t separate a plausible but false claim from an accurate one.

Implications for operators

If your org is using LLMs for knowledge management, you can’t treat outputs as facts. Cicek’s team urges leaders to build verification layers: cross-check LLM answers against primary sources, require human review for scientific or regulatory content, and teach staff to look for inconsistency (ask the same question twice and see if the answer flips).

The study also reinforces the gulf between marketing slogans about “AGI” and reality. True expert-level reasoning would not oscillate wildly or collapse on false statements. The researchers point out that artificial general intelligence capable of conceptual understanding is “further away than many expect.”

How to operationalize the findings

Deploy ensemble prompts. Run multiple prompts with varied wording and flag any hypothesis where the model disagrees with itself. Escalate those to humans.
Measure calibration. Require the model to output confidence scores and compare them to empirical correctness so you know when it’s overconfident.
Stick to retrieval-augmented workflows. Pair generative answers with citations from trusted databases; if no citation exists, treat the answer as speculative.
Log contradictions. Create audit trails showing which prompts produced divergent answers and how a human resolved them. That’s invaluable for compliance and internal learning.

Questions to ask your vendors

Do they report chance-corrected accuracy and consistency metrics, or only raw percentages?
Can they show failure cases on nuanced, domain-specific hypotheses like those in the study?
What governance exists when the model contradicts itself—does the system even detect it?

Training your teams

Don’t just roll out tools—teach people how they fail. Incorporate exercises where staff intentionally coax the model into contradictions, so they build intuition about when not to trust it. Encourage “human-in-the-loop” as a default posture, especially for anything touching compliance, medical claims, or investor communications.

Explain the limits plainly: generative AI “doesn’t have a brain,” as Cicek puts it. It memorizes text, it doesn’t understand it. That distinction should guide policy and messaging.

The road ahead

Cicek and colleagues plan to expand their hypothesis tests to other AI systems. They also recommend shared datasets of nuanced questions so model builders can benchmark reasoning progress. Until then, treat LLMs as brainstorming partners, not referees of scientific truth.

The bottom line: LLMs are great at sounding authoritative, terrible at evaluating evidence. Until that changes, any workflow that stakes decisions on their judgment without human review is rolling dice.

Source: “Study finds ChatGPT gets science wrong more often than you think,” ScienceDaily, March 17, 2026.

Aswin Sarang

Aswin Sarang is a technology professional and entrepreneur working across robotics, artificial intelligence, and automation. He focuses on building practical systems that bridge engineering, strategy, and real-world deployment, with an emphasis on clarity, scalability, and long-term value. His work spans product development, system integration, and technology consulting, helping organizations navigate complex technical decisions and translate emerging technologies into usable solutions. Known for a first-principles approach, Aswin prioritizes fundamentals over hype and execution over speculation. Beyond technology, he maintains a strong interest in human performance, learning, and personal development, bringing a multidisciplinary perspective to both his professional and creative pursuits.

All Transmissions

TABLE _OF_CONTENTS

ChatGPT Flunked a 719-Hypothesis Reality Check