Machine learning in the cloud
A new age of human data research has arrived, says computer scientist Paul Harris, PhD, professor of Biomedical Informatics and director of the Office of Research Informatics. “Just about every health center will by now have collected 10 to 20 years of EHR data, and we have figured out policies and procedures allowing sharing for big clinically derived data assets. Given these large data assets and relatively inexpensive storage and computer costs, we’re currently in a golden age of being able to enable machine learning research at scale. You’re going to see more and more of it and wonderful discoveries made over the next decade.”
In an EHR research data warehouse, Vanderbilt University Medical Center has electronic health records of some 3.2 million patients available for study by qualified researchers. By next year the warehouse will have moved off its on-campus data appliance and onto Microsoft and Google cloud computing services, providing some important advantages. “The cloud gives you easy fee-based ways to scale up storage capacity and CPUs running on your data,” Harris said.
The move to the cloud is influenced by VUMC’s experience as a key partner in the federal government’s massive All of Us Research Program. All of Us data resides in the Google cloud, and that’s where any computational analysis of it occurs. Launched in May 2018, the program is a historic effort to gather data from 1 million or more people to accelerate research and improve health. To date, All of Us has enrolled well over 500,000 participants. Data access for qualified researchers is via the program’s Researcher Hub and Workbench, a product of the All of Us Data and Research Center, which is led by VUMC, working with the Broad Institute of MIT and Harvard and Verily Life Sciences (a subsidiary of Alphabet Inc.).
“Our model in making data resources available in All of Us is bringing researchers to the data, rather than sending data to researchers,” said Harris, who serves as principal investigator for the Data and Research Center. “This secure, cloud-based storage and computing environment democratizes access for machine learning, giving access to data on a highly diverse population of participants, without local requirements for data storage and computer resources.”