Ping An Technology, in collaboration with Ping An Good Doctor and Peking University Medical, has achieved the highest score on HealthBench Hard, a medical AI evaluation framework published by OpenAI, with its latest medical large language model.
The model, Ping An Medical LLM 3.5, recorded a score of 57.27 on the benchmark, ahead of Baichuan (44.4), Meta (42.8), and OpenAI (42.0). HealthBench Hard is a subset of OpenAI’s HealthBench framework, constructed with input from 262 physicians across 60 countries and 26 medical specialities, comprising 5,000 multi-turn clinical dialogue scenarios and 48,562 physician-defined evaluation criteria.
Ping An says the model is designed to replicate clinical reasoning rather than standardised question-and-answer responses. Development drew on real-world data from Ping An’s own healthcare operations, spanning screening, disease management, treatment, and rehabilitation pathways.
The company’s AI-MDT Pro system, deployed at Peking University Medical and Ping An Health institutions, uses the model to support multidisciplinary tumour consultations. Internal figures cited by Ping An indicate an 85% adoption rate for AI-generated treatment recommendations, with consistency between AI and senior specialists exceeding 92.5% in breast cancer cases.
Ping An separately reported that its financial large language model, PingAnGPT-Qwen3-32B, ranked first on the CNFinBench public leaderboard in March 2026, ahead of DeepSeek-R1, GPT-4o, and Claude Sonnet 4.
Discover more from HealthTechAsia
Subscribe to get the latest posts sent to your email.