Commonly used AI chatbots often struggle to provide correct medical advice for women’s health queries, particularly those requiring urgent attention. A recent study revealed that these models frequently fail to accurately diagnose or offer helpful guidance on critical issues across emergency medicine, gynecology, and neurology. The findings underscore a significant gap in AI’s ability to handle gender-specific medical inquiries effectively.
The Benchmark Test
Researchers from the US and Europe tested 13 large language models (LLMs), including those from OpenAI, Google, Anthropic, Mistral AI, and xAI, against a curated list of 96 medical queries. The test was designed by a team of 17 women’s health experts, pharmacists, and clinicians. The results were alarming: 60% of questions were answered with insufficient medical advice. GPT-5 performed the best, failing 47% of the time, while Mistral 8B had the highest failure rate at 73%.
This raises critical questions about the reliability of AI in healthcare, especially when women may be turning to these tools for self-diagnosis or decision support. The study’s lead, Victoria-Elisabeth Gruber of Lumos AI, noted that the high failure rate was surprising. “We expected gaps, but the degree of variation across models stood out,” she stated.
Why This Matters
The issue stems from the way AI models are trained. AI learns from historical data that contains inherent biases, including those found in medical knowledge. According to Cara Tannenbaum at the University of Montreal, this leads to systematic gaps in AI’s understanding of sex- and gender-related health issues. The findings highlight the urgent need for updated, evidence-based content on healthcare websites and professional guidelines.
Debate Over Testing Methods
Some experts, like Jonathan H. Chen at Stanford University, argue that the 60% failure rate is misleading because the test sample was limited and overly conservative. He points out that the scenarios tested—such as immediately suspecting pre-eclampsia in postpartum women with headaches—are designed to trigger high failure rates.
Gruber acknowledges this criticism, clarifying that the benchmark was intentionally strict. “Our goal wasn’t to claim models are broadly unsafe, but to define a clinically grounded standard for evaluation,” she explained. In healthcare, even minor omissions can have serious consequences.
OpenAI’s Response
OpenAI responded by stating that ChatGPT is intended to support, not replace, medical care. The company emphasizes ongoing evaluations and improvements, including gender-specific context in their latest GPT 5.2 model. OpenAI encourages users to rely on qualified clinicians for care and treatment decisions. Other companies tested did not respond to the study’s findings.
The study is a clear warning about the limitations of current AI chatbots in women’s health. While AI tools may evolve, it’s crucial to recognize that they cannot yet replace human expertise in medical diagnosis and treatment.
