Chinese research team Future Doctor introduces clinical evaluation framework for medical AI

A research team led by Future Doctor, a China-based medical AI company, has proposed a standardised framework for evaluating the clinical applicability of medical AI systems, according to a study published in npj Digital Medicine, a journal within the Nature Portfolio.

The study introduces the Clinical Safety–Effectiveness Dual-Track Benchmark (CSEDB), an evaluation framework designed to assess how medical AI models perform in real-world clinical decision-making. The authors describe the framework as addressing limitations in existing medical AI assessments, which have largely focused on performance in standardised medical examinations rather than on clinical practice scenarios.

According to the researchers, this is the first benchmark for healthcare large language models developed by a Chinese team to be published in a leading international medical journal. The paper argues that the absence of clinically grounded evaluation standards has become a constraint as AI systems are increasingly applied in diagnosis and treatment settings.

The study notes that prevailing medical AI evaluation methods typically rely on fixed-answer tests, such as medical licensing examinations. While these assessments can measure knowledge recall, the authors argue that they do not adequately reflect the complexity of real clinical environments, where patient presentations vary widely and treatment decisions often involve uncertainty and risk.

The CSEDB framework was developed by Future Doctor’s research team in collaboration with 32 clinicians from 23 major medical institutions in China, including Peking Union Medical College Hospital, the Cancer Hospital of the Chinese Academy of Medical Sciences, the Chinese PLA General Hospital, and Huashan Hospital affiliated with Fudan University. The framework evaluates AI systems across two dimensions—safety and effectiveness—using 30 core indicators defined through clinical expert consensus.

Seventeen of the indicators focus on safety considerations, including recognition of critical illness, fatal diagnostic errors, contraindicated medications, and drug safety. Thirteen indicators assess effectiveness, covering adherence to clinical guidelines, prioritisation in patients with multiple conditions, and optimisation of diagnostic and treatment pathways. Each indicator is weighted on a five-point scale according to clinical risk, with higher weights assigned to scenarios involving potential life-threatening outcomes.

The evaluation methodology departs from traditional static question-and-answer formats. The benchmark comprises 2,069 open-ended clinical scenarios spanning 26 medical specialties, designed to simulate the complexity of real diagnostic and treatment decisions.

Using the CSEDB framework, the researchers evaluated several widely used global AI models, including DeepSeek-R1, OpenAI o3, Gemini 2.5, Qwen3-235B, and Claude 3.7. The study reports that MedGPT, developed by Future Doctor, achieved the highest overall scores across both safety and effectiveness dimensions, with a safety score of 0.912 and an effectiveness score of 0.861.

The authors note that MedGPT was the only model in the comparison whose safety score exceeded its effectiveness score, which they interpret as indicating a more cautious clinical decision profile relative to other systems evaluated.

According to the study, MedGPT’s development has focused on incorporating clinical expert consensus into its system design, with an emphasis on physician-like reasoning processes. The researchers report that earlier trials involving real patients showed a high level of diagnostic agreement between the system and hospital physicians, and that the platform is currently used by more than 10,000 doctors, generating ongoing clinical feedback that is used to refine the model.

The authors conclude that frameworks such as CSEDB could support more consistent evaluation of medical AI systems and inform future regulatory, clinical, and development efforts as the use of AI in healthcare continues to expand.

Author

  • Matthew Brady

    Matt is an award-winning storyteller, writer, and communicator currently based in Riyadh.

    A native Englishman, his career has led him to diverse locations including China, Hong Kong, Iraq, Malaysia, Saudi Arabia, and the UAE.

    In addition to founding HealthTechAsia, Matt is a co-founder of the non-profit Pul Alliance for Digital Health and Equity.

    In a former life, he oversaw editorial coverage for Arab Health, Asia Health, Africa Health, and other key events.

    In 2021, he won a Medical Travel Media Award, organised by Malaysia Healthcare Travel Council, and a Guardian Student Media Award in 2000.

    View all posts

Discover more from HealthTechAsia

Subscribe to get the latest posts sent to your email.