Darwin's Our Take 6.22.26: Study finds general-purpose LLMs perform better than specialized clinical AI tools

Study finds general-purpose LLMs perform better than specialized clinical AI tools
When researchers pitted two clinical artificial intelligence tools, OpenEvidence and Wolters Kluwer’s UpToDate Expert AI, against three frontier large language models — OpenAI’s GPT-5.2, Google’s Gemini 3.1 Pro Preview, and Anthropic’s Claude Opus 4.6 — the general-purpose LLMs outperformed the specialized clinical AI tools in three medical benchmarks.
The findings were published June 12 in the journal Nature Medicine.
The evaluation was carried out in three stages. The first comprised 500 MedQA questions to assess medical knowledge. The second consisted of 500 HealthBench items designed to measure agreement with expert clinicians.
The third stage incorporated 100 de-identified real clinical queries (RCQs) pulled from physician LLM queries during live clinical deployment of NYU Langone Health’s HIPAA-compliant GPT instance. This stage underwent randomized, blinded review by 12 clinicians on four dimensions: clinical correctness, completeness, safety/harm, and clarity. The researchers also included Google Search AI Overview in this evaluation.
Across all three stages, the analysis spanned multiple-choice reasoning, expert clinical judgment, and everyday clinician use, the researchers noted. Their intent was to assess the models’ performance on the metrics of accuracy, safety, and reliability.
On the MedQA questions, Gemini had the highest ranking for accuracy, at 97.4%. GPT ranked second, at 97.4%, and Claude ranked third, at 90.2%. The clinical tools ranked lower, with OpenEvidence achieving 89.6% accuracy and UpToDate achieving 88.4%.
The HealthBench evaluation was graded by a panel of LLM judges on a scale of 0-100. The difference in scores between the general-purpose LLMs and the clinical AI tools was even greater than on the MedQA benchmark. GPT scored 88.0, Gemini scored 79.3, and Claude scored 77.0 versus scores of 62.6 for OpenEvidence and 61.3 for UpToDate.
The researchers found that GPT ranked first or tied for first in all seven HealthBench categories, whereas OpenEvidence and UpToDate ranked lowest or tied for lowest in all seven categories.
The results of the RCQ evaluation split the models into two performance tiers, with the frontier LLMs forming the top tier, and the clinical AI tools — along with Google Search AI Overview — forming the lower tier. There were no significant differences among models in the same tier.
Specifically, on a 1-4 point scale, Gemini had a mean aggregate score of 3.62, GPT scored 3.54, and Claude scored 3.52. OpenEvidence had a mean aggregate score of 3.24, UpToDate scored 3.17, and Google AI Overview scored 3.27.
The researchers noted that the general-purpose LLMs outperformed the clinical tools on most individual questions in the RCQ evaluation, not just on average. Google AI Overview scored as well as or better than the clinical tools across all dimensions, they found.
The six models showed the greatest amount of variation on the dimension of clarity and the least variation on clinical correctness. Safety outcomes did not differ across the models, and none generated more harmful content or hallucinations than any of the other models.
Of note, all 12 clinicians who rated the models on the RCG benchmark ranked them similarly, ranking the general-purpose LLMs higher than the clinical AI tools.
The researchers said they could not definitively assess a mechanistic understanding of why the clinical tools underperformed in comparison with the frontier LLMs because the architecture of proprietary clinical AI tools is inaccessible.
“The observed advantages of frontier general-purpose models may reflect the accelerated development and investment in these systems,” they suggested, adding that their results “should … be interpreted as a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches.”
“In particular,” they wrote, “deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation.”
They acknowledged several limitations to their study before stating that while clinical AI tools “may carry institutional legitimacy and are likely safe for routine use,” their finding showed that clinical AI tools “are not superior to frontier models on knowledge, communication, or clinical alignment.”
“The path forward may ultimately lie with hospital-specific LLMs that leverage institutional data to mitigate external harm, along with careful use of frontier models for less-sensitive tasks. As generative LLMs become integrated into healthcare at the enterprise, individual clinician, and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks,” they concluded.
OUR TAKE: OpenEvidence called for Nature Medicine to retract the study, Becker’s Hospital Review reported, claiming in a letter that the research relies on flawed methods.
The company noted that in a separate study evaluating the use of OpenEvidence to augment decision-making for primary care physicians, researchers at Mayo Clinic rated the AI tool “high in clarity, relevance, and evidence-based support.”
Wolters Kluwer likewise raised objections to the study results. Dr. Peter Bonis, chief medical officer of Wolters Kluwer Health, said UpToDate Expert AI had been tested recently on more than 1,600 clinical queries across more than 15,000 criteria and had returned clinically aligned information for 99.9% of assessed criteria, Becker’s reported.
A different study published April 30 in the journal Science compared the OpenAI o1 series LLM against hundreds of physicians on various clinical cases and clinical tasks. The research team found that the LLM outperformed the physicians and older OpenAI models.
“Our study suggests that LLMs have eclipsed most benchmarks of clinical reasoning, motivating the urgent need for prospective trials,” the study authors wrote.
Benjamin Mazer wrote in an article in The Atlantic (titled “AI Is Taking Over Hospitals”) that one of the study’s lead authors, Adam Rodman, said at a press conference before the study was published, “I get a little queasy about how some of those results might be used.”
Rodman said that as thorough as the research team’s work may have been, it didn’t prove that the AI tool they had evaluated — or any other AI tool — was ready to become a standard part of medical science, Mazer wrote. Nonetheless, Mazer added, “most people will ignore the warning.”
“AI has already wormed its way into the U.S. healthcare system, evidence and safeguards be damned,” Mazer, who works as a pathologist at a medical center, wrote.
He noted that most generative AI products that physicians are now using meet criteria that allow the products to be classified as clinical decision support tools, which means they can be deployed without being vetted by the FDA.
Mazur called attention to an “idea kicking around in the medical literature”: to stop treating AI products as though they’re “merely standard medical devices.” Instead, “[g]iven their humanlike ability to learn new information and tailor answers to individual patients,” he wrote, maybe medical AI products should be evaluated similarly to physicians. In place of FDA approvals, perhaps chatbots should “be asked to pass a medical licensing exam and undergo a period of supervision akin to a medical residency.”
While that may seem farfetched, Mazer said Rodman told him the healthcare system is not going to “slow down and wait for the evidence to accrue.”
Meanwhile, UCLA Health just launched a research-driven center of excellence to evaluate the safety and implementation of AI in healthcare. The program is known as the Innovations and Outcomes Validation of AI (INOVAi) Center.
Johnese Spisso, president of UCLA Health and CEO of the UCLA Health System, said in a press release that the center “will help address one of the most important gaps in health care AI — knowing whether these tools are safe, effective, and useful in real-world clinical practice.”
INOVAi will support AI evaluation throughout the lifecycle, from early usability, feasibility, and workflow testing to prospective clinical trials and implementation studies, UCLA Health noted.
What else you need to know
CMS issued a proposed rule to codify the Medicare Drug Price Negotiation Program into a permanent framework, as required by law. The agency said in a press release the framework will create a more transparent and sustainable process for lowering drug costs, as well as greater long-term certainty for participating drug manufacturers. It will include policies for negotiating and renegotiating the price of high-cost, single-source drugs beginning with initial price applicability year 2029.
According to a fact sheet, the proposed rule will also codify a pair of policies that affect the Part D benefit. One of the policies pertains to the inclusion of selected drugs on Part D plan formularies, and the other has to do with how “negotiated price” is defined.
CMS has been administering the negotiation program, which was established through the Inflation Reduction Act in 2022, via agency guidance. Under the proposed rule, CMS will select up to 20 additional drugs for the upcoming fourth cycle of negotiations and for subsequent cycles of the program.
Optum Rx has tentatively settled with the Federal Trade Commission regarding allegations the pharmacy benefit manager and its group purchasing organization, Emisar Pharma Services, inflated the cost of insulin. The FTC filed a lawsuit in 2024 against the top three PBMs — UnitedHealth Group’s Optum Rx, Cigna’s Express Scripts, and CVS Caremark — and their GPOs alleging the companies had “created a broken rebate system that inflated insulin drug prices.” Although all three PBMs have denied the allegations, Express Scripts reached an agreement with the FTC in February and CVS Caremark proposed a settlement in March that has not yet been finalized. Details of Optum Rx’s proposed consent agreement have not been made public.
AdventHealth is rolling out smart technology in 13,000 patient rooms across the health system’s 57 hospitals in nine states. The technology, which includes digital whiteboards, digital door signs, and in-room video capabilities, is integrated with the patient’s electronic medical record and care plan, allowing for significantly improved communications between clinicians, patients, and family members.
Dr. Valerie McKinnis, vice president of clinical innovation at AdventHealth, told Becker’s Hospital Review: “We’re reviewing their labs [with patients] in real time, and it actually decreases that disconnection with our patients that traditional care models have created.” The Altamonte Springs, Fla.-based health system is more than halfway through the installation process and expects to have it completed by the end of the year, according to Becker’s. Details about the technology are available here.
Executive moves
Doug Shirley has been appointed CEO of Louisville, Ky.-based ScionHealth. Shirley, currently ScionHealth’s chief financial officer, succeeds Rob Jay, who has been CEO of the health system since it was established in 2021. Jay will transition to the role of executive chairman.
Dr. John Neil will become CEO of Scottsdale, Ariz.-based HonorHealth next March. He will succeed Todd LaPorte, who has been with the health system since 2001 and has served as CEO since 2017. Dr. Neil, who is currently executive vice president, chief physician executive, and chief strategy officer of HonorHealth, will immediately assume the role of president as part of the transition.
Amy Shlossman will be CEO of Portland, Ore.-based OHSU Health effective Aug. 31. Currently, Shlossman is president of Sinai Hospital and Grace Medical Center, both part of LifeBridge Health. She is also senior vice president of the Baltimore-based health system, the announcement noted. Prior to joining LifeBridge, Shlossman was chief operating officer at two medical centers in the Phoenix-based Banner Health system. Dr. Renee Edwards has been acting CEO of OHSU since April.
What we’re reading
Study Finds Flaw in Method for Key Oncology Value-Based Care Model. AJMC, 6.17.26
The LEAD Model And The Remaining Structural Limits To Fee-For-Service Value-Based Care. Health Affairs, 6.17.26
Blueprint for Safety: Implementing a Clinically Governed AI Digital Assistant for Patient Guidance. NEJM Catalyst, 6.17.26 (abstract available, subscription required for full article)
Darwin's Our Take 6.22.26: Study finds general-purpose LLMs perform better than specialized clinical AI tools

Darwin's Our Take 6.15.26: GSK grows lung cancer portfolio with $10.6 billion Nuvalent acquisition

Darwin's Our Take 6.8.26: Feature: Health systems gain AI governance controls as collaborations proliferate

Subscribe to Our Take
Sign up for Our Take Newsletter: highly curated, expert weekly strategic insights for health care executives.