NIH’s “All of Us” Milestone: Closing the Data Gap in AI Drug Repurposing
On June 30, 2026, the National Institutes of Health announced that its All of Us Research Program has become the world’s largest integrated genomic and electronic health record database, with data from more than 747,000 participants now available to researchers. NIH’s All of US Research Program is now the largest integrated genomics and health database in the world (NIH). This is more than a programmatic milestone. It represents a direct answer to one of the most persistent barriers facing AI in drug repurposing: the lack of large, diverse, and integrated datasets that combine genomic depth with real clinical data.
This problem was highlighted in my recent post reviewing the use of AI and machine learning in drug repurposing, drawing on the work of Fu et al. in the Annual Review of Medicine (AI). AI and Drug Repurposing: Old Drugs, New Tricks, and the IP Questions (Post). One of the key challenges I highlighted is that existing multiomics and clinical data arise from heterogeneous patient samples across different laboratories and health care systems, making harmonization extremely difficult. AI at 391. Limited data sharing between biopharmaceutical companies and academic institutions compounds the problem, as proprietary datasets cannot be accessed by the broader research community due to intellectual property concerns. Post at 2. The result is that the growing mass of genetic and multiomics data has not been effectively explored for drug repurposing due to a lack of accurate, integrated approaches. AI at 382.
The All of Us release speaks directly to these barriers. The dataset now includes more than 535,000 whole genome sequences linked to nearly 482,000 electronic health records, a combination of genomic depth and clinical breadth unmatched by any other research program in the world. NIH at 1. It encompasses more than 1.3 billion genetic variants, 553,000 genotyping arrays, and 96,000 structural variant records, alongside 747,000 survey responses capturing social circumstances, behaviors, and environments. Id. For the first time, the dataset also includes proteomics and RNA sequencing data, moving the program into the multiomics era. NIH at 2.
Two additional features make this resource particularly relevant for AI. More than 86% of participants come from communities historically underrepresented in biomedical research, spanning all 50 states and more than 98% of U.S. three-digit ZIP codes. NIH at 1-2. That diversity addresses another concern raised in the article: that real world data is challenged by confounding factors such as sex, race, and socioeconomic status, as well as a lack of detailed clinical, biomarker, and genetic information. AI at 387. All of Us data is available to registered researchers at no cost, giving scientists at rural universities the same access as those at major research institutions. NIH at 2.
The program has already fueled more than 1,400 peer-reviewed publications, including work identifying existing medications that may help prevent Alzheimer’s disease. Id. As NIH Director Jay Bhattacharya noted, “To tailor treatments to individuals, you actually need very large populations to uncover the patterns that connect genetics, lifestyle, and the environment to health outcomes.” NIH at 1.
That observation captures why this matters for AI and drug repurposing. The field has algorithms and computational power. A unified, diverse, openly accessible data foundation connecting genomes to clinical realities is a missing critical piece.
To tailor treatments to individuals, you actually need very large populations to uncover the patterns that connect genetics, lifestyle, and the environment to health outcomes.
NIH Director Jay Bhattacharya