
Nature News · Feb 18, 2026 · Collected from RSS
MainRare diseases—defined as conditions affecting fewer than 1 in 2,000 people—collectively impact more than 300 million people worldwide, with more than 7,000 distinct disorders identified to date, approximately 80% of which are genetic in origin1,2,3,9. Despite their cumulative burden, rare diseases remain notoriously difficult to diagnose due to their clinical heterogeneity, low individual prevalence and limited clinician familiarity1,2,4,6,10,11,12,13,14,15. Patients often experience a prolonged ‘diagnostic odyssey’ averaging more than 5 years, marked by repeated referrals, misdiagnoses and unnecessary interventions, all of which contribute to delayed treatment and adverse outcomes4,5. These challenges highlight the urgent need for scalable, accurate and interpretable diagnostic tools—an area where recent advances in multi-agent systems offer transformative potential.Developing artificial intelligence (AI) systems for rare disease diagnosis presents several inherent challenges: (1) multi-disciplinary (rare diseases often manifest with complex, heterogeneous and multisystem symptoms, requiring diagnostic models to possess multi-disciplinary medical knowledge and the ability to interpret diverse patient phenotypes16,17), (2) limited cases (the scarcity of cases for individual rare diseases limits the availability of training data, making it difficult to develop robust models and increasing the risk of overfitting and catastrophic forgetting), (3) dynamic knowledge updates (the rare disease knowledge landscape is rapidly evolving, with approximately 260 to 280 rare genetic diseases discovered per year, according to the International Rare Diseases Research Consortium (IRDiRC))18 and (4) transparency and traceability (clinical deployment demands interpretability; diagnostic suggestions must be accompanied by transparent, traceable reasoning to support clinician trust and accountability). This dynamic nature demands AI systems that are not only updatable but also capable of integrating new knowledge efficiently.Recent advances in agentic large language model (LLM) systems have opened new avenues for rare disease diagnosis6,7,19,20,21,22,23,24,25,26. These systems orchestrate several specialized tools and sub-agents23,24, enabling seamless integration of external knowledge bases, case repositories and multi-modal analytical components8,21. Unlike conventional supervised learning approaches, these systems are typically training-free and excel in few-shot and zero-shot scenarios—an essential capability for rare disease applications where annotated data are scarce. Their modular and interpretable architectures further facilitate transparent, auditable and clinically actionable diagnostic workflows.Here we present DeepRare, an agentic LLM-based system designed specifically for rare disease differential diagnosis decision support. DeepRare is capable of processing heterogeneous patient inputs, including free-text clinical descriptions, structured human phenotype ontology (HPO) terms and genomic testing results. Based on the input, the system generates a ranked list of candidate diagnoses, each supported by a transparent chain of reasoning that directly references verifiable medical evidence, enhancing interpretability and supporting clinician trust in AI-assisted decisions. Inspired by the Model Context Protocol (MCP)8, DeepRare uses a three-tier architecture: a central LLM-powered host with memory coordinates the process, specialized agent servers handle phenotype and genotype analysis, normalization and knowledge retrieval, and the outer tier integrates curated and Web-scale medical resources. To improve robustness, DeepRare further uses a self-reflective loop that iteratively reassesses hypotheses, reducing over-diagnosis and mitigating LLM hallucinations.We evaluated DeepRare on 6,401 clinical cases collected from seven public datasets and two in-house datasets, sourced from diverse populations across Asia, North America and Europe. Additionally, the two in-house datasets, from Xinhua Hospital (Shanghai) and The Affiliated Children’s Hospital of Xiangya School of Medicine (Hunan), contain 330 cases with not only phenotypes but also whole-exome sequencing (WES) data. All diagnoses in this cohort have been validated rigorously by genetic testing, providing a high-quality standard for assessing diagnostic performance. DeepRare consistently achieves superior diagnostic accuracy across all eight datasets of 2,919 rare diseases spanning 14 medical specialties.In HPO-based evaluations, compared with a further 15 methods, including traditional bioinformatics tools, LLMs and agentic systems, DeepRare achieved an average score of 57.18%, 65.25% at Recall@1 and Recall@3, respectively, surpassing the second-best method (Reasoning LLM) by substantial margins of 23.79% and 18.65%. Recall@K measures whether the correct diagnosis appears within the top-K predictions—for instance, Recall@1 indicates the percentage of cases in which the correct diagnosis is the top-ranked prediction, while Recall@3 measures cases in which the correct diagnosis appears anywhere in the top three predictions. In multi-modal input scenarios, DeepRare achieved a Recall@1 of 69.1%, outperforming Exomiser’s 55.9% in the Xinhua whole-exome cases. Furthermore, we engaged ten rare disease physicians to verify manually the traceable reasoning chains generated by the system across 180 cases. DeepRare demonstrates high reliability in evidence factuality, achieving 95.4% agreement with clinical experts, thereby confirming that its intermediate reasoning steps are both medically valid and traceable to authoritative sources. To facilitate clinical adoption, we have deployed DeepRare as a user-friendly Web application as a diagnostic copilot for rare disease physicians. Finally, we discuss the robustness of our agentic framework by evaluating different underlying LLMs and analysing the contribution of each module, demonstrating the superiority of our system design.The following sections present the results of our study, beginning with an overview of the proposed DeepRare framework and the evaluation settings, followed by a detailed analysis of the main findings.System overviewDeepRare is an LLM-powered agentic system designed for rare disease diagnosis. It features a three-tier architecture inspired by the MCP8 that synergistically integrates reasoning-enhanced LLMs with a broad range of clinical knowledge sources, as shown in Fig. 1a,b. The system comprises (1) a central host, powered by LLMs (locally implemented DeepSeek-V327 by default) and equipped with a memory bank, which orchestrates the entire diagnostic workflow by synthesizing collected evidence; (2) several specialized agent servers, each managing a local set of tools to perform various rare disease-related analytical tasks and interact with distinct resource environments; and (3) heterogeneous Web-scale medical sources, which provide essential and traceable diagnostic evidence, such as research articles, clinical guidelines and existing patient cases.Fig. 1: DeepRare: an agentic framework for rare disease prioritization.a, System workflow: multi-modal patient data (HPO terms, genomic variants) are processed through a tiered MCP-inspired architecture, generating a ranked top-K diagnosis list with evidence-supported reasoning chains. b, Knowledge architecture: sunburst visualization depicting hierarchical integration of diagnostic tools and biomedical knowledge sources within DeepRare. c, Performance benchmarking: comparative evaluation across diagnostic APIs, general-purpose LLMs, reasoning-enhanced LLMs, medically tuned LLMs and agentic systems. Illustrations in a were created using BioRender (https://biorender.com).Source DataFull size imageUpon receiving a clinical case—provided as free-text phenotypic descriptions, structured HPO terms, raw variant call format (VCF) files or any combination thereof—the central host decomposes the diagnostic task systematically. It first orchestrates the agent servers to retrieve relevant evidence and references from external data sources, tailored to the patient’s information. The host then synthesizes this evidence to generate preliminary diagnostic hypotheses, followed by a self-reflection phase in which it conducts additional searches to rigorously validate or refute these hypotheses. If no hypothesized diseases meet the self-reflection criteria, the system revisits earlier steps iteratively to acquire further patient-specific evidence, repeating this diagnostic loop until a satisfactory resolution is achieved. Ultimately, DeepRare outputs a ranked list of potential rare diseases, each accompanied by a transparent reasoning chain that links each inference step directly to trusted medical evidence. Further details on the system workflow are provided in Methods.Evaluation settingsTo evaluate the performance of DeepRare, we consider three baseline approaches: (1) Specialized rare disease diagnosis tools. We consider bioinformatics tools that are directly designed for rare disease diagnosis. Two HPO-wise analysis tools—PhenoBrain28 and PubCaseFinder29—are considered here. (2) Latest LLMs. We compare different LLMs, including general LLMs, reasoning-enhanced LLMs and medical LLMs. General LLMs denote the most commonly used LLMs without extra reasoning enhancement or domain alignment, including GPT-4o30, DeepSeek-V331, Gemini-2.0-flash32 and Claude-3.7-Sonnet33. Reasoning LLMs denote the latest generation LLMs enhanced with an explicit reasoning chain, including o3mini34, DeepSeek-R131, Gemini-2.0-FT32 and Claude-3.7-Sonnet-thinking33. Medical LLMs refer to LLMs developed specifically for the medical domain, with Baichuan-14B35 and MMedS-Llama 336 serving as notable representatives. All these LLMs are adapted to rare disease diagnosis, leveraging well-designed Prompt 1 (detailed descriptions of all prompts mentioned later in the text are provided in the Supplementary M