cc: Life Science
cc: Life Science Podcast
What Can You Do with 3.5 Million (de-identified) Health Records?
0:00
Current time: 0:00 / Total time: -31:26
-31:26

What Can You Do with 3.5 Million (de-identified) Health Records?

It turns out, you can do quite a lot.

Jud Schneider is the CTO of Nashville Biosciences. I reached out to him to talk about their collaboration with Amgen and Illumina to sequence 35,000 African-American genomes. How is that possible?

Nash Bio (owned by Vanderbilt University Medical Center) has about three and a half million de-identified health records. About 10% of those have consented DNA tied to those records. This is a treasure trove of data for pharma and AI companies to uncover patterns and develop new therapies.

The number of genomes in that project blew me away from the start. But among genome data sets, African Americans are underrepresented. This is an opportunity to make important discoveries for a significant part of the population as well as the population as a whole.

The genome data by itself isn’t enough. The real value is tying the genetic data to health records to find and understand patterns that associate genetic variation with disease or simply biology.

Our conversation evolved into the possibilities with such a massive data set for training AI, including imaging data, and how to get them most out of it. With the help of their clinical team, they can build very specific cohorts for training AI models. Jud says they can think of about a hundred ways to use the data, but customers can think of a thousand.

Let's just say we're looking at chest CTs. And you're looking for your, I'm just making something up, but we're developing a product that looks for lung nodules. right? Well, we've got lots of chest CTs that have diagnosed certain types of lung cancers, and you can actually get very specific on the type of lung cancer.

And we've also got lots of, you know, just kind of blank chest CTs, you know, and ones with artifacts that are important, like there's a pacemaker in there. There's other types of medical devices that may be implanted or, you know, there's some different anatomies that you need to take into account.

We just got quite a diversity of information that you can really end up with an extremely powerful and well-trained model you know is really starting from a place of grounded in the actual diagnosis and not necessarily just CPT codes.


Pardon the interruption. You are subscribed aren’t you?


It’s not enough to have a lot of data. You need to understand the protocols under which it was generated and factors that might not be immediately apparent. It would be easy to hand over a bucket of data based on an ICD code and let the customer have at it. (I learned that ICD codes might be what I would describe as “squishy”. Sometimes an initial diagnosis might be uncertain or inaccurate. But some thoughtful analysis can make the data much more useful. That’s the role of the clinical team who are working with the data on a regular basis.

We find ourselves in a situation all the time where we're really able to disambiguate the, uh, the nuances of the ICD coding system and the billing data to really find the patients that actually have the diagnosis and actually have the data, you know, within routine clinical care that's necessary.

The big takeaway for me is that while we generate and capture ridiculous amounts of data every day (in healthcare and elsewhere) it’s still important to understand the limitations of how the data is gathered (ICD codes for example) and be thoughtful about what you are looking for to get the most value out of it.


Share cc: Life Science

Chat with Chris about custom content for your life science brand.

Fun fact: I first interviewed Jud on Flip Turns, my podcast about people whose lives were changed by swimming.

Discussion about this podcast

cc: Life Science
cc: Life Science Podcast
How will AI, blockchain and other new technologies impact life science?