top of page

Why Medical AI Needs Open Source Data to Thrive

Writer's picture: Adam McArthurAdam McArthur

Data in AI has been quoted as “the new oil”, and as “potentially the most under-valued and de-glamorised aspect of today’s AI ecosystem”. Although the main field of AI is advancing on this understanding, and industry moving with it, Medical AI is significantly behind the curve.


One of the biggest misunderstandings right now is what “good data” means. To a Medical Professional, this means having good quality scans, with closed-ended, clearly defined outcomes, with little ambiguity. This is completely incorrect in an AI context. By good data, the AI world means “representative data”. I.e a dataset that shows examples from all parts of the “clinical world”. This included things like bad scans, scans from different datacenters, using different hardware - even documented “mistakes”. These things are extremely important because AI’s today lack agency: they struggle to understand data that does not look “mathematically” like anything they have seen before. People often underestimate what generalisations they themselves are capable of - so a good rule of thumb is to assume that AI’s are very bad at spotting even easy patterns for a young adolescent. 


This is an extremely large problem for any individual hospital or research lab to overcome. Data from one center will never be “representative”, so the AI system built will likely only work in that one location, and only while their hardware/operational setup stays the same. Many centers try to sign MOU’s/Data sharing agreements with each other to get around this, but even if this is successful, it is very unlikely to scale (how do you get 100’s of teams to collaborate, and repeat this process 100’s of times, over many decades, in every country?). The only way to overcome this is no-barrier data sharing and collaboration, which almost always means open-sourcing the process. Think of trade. It has been well understood for decades that the best way to make a more efficient global economy is to reduce all types of barriers to trade. This leads to an economy where everyone benefits. The same can and should be true of Medical Data for AI systems. 


There are specific examples right now where Medical AI can actually benefit from open data more than the main AI field. Many computer science researchers don’t care what part of Medical AI they research, and instead pick from datasets that are on the internet to freely use. This is a “data-first” approach. Why would CS Researchers wait years for data for a clinical problem, when they could find the data for a different problem on the internet now? This should be a great incentive to people considering open sourcing their medical data, especially if they are the first to open source data for that problem. You can release your data with the information required for the end problem you want to solve, and then not only can your lab work on it, but labs around the world can try and solve it too (This is especially useful if you do not have the capabilities to train AI models yourself!). You alone can control what AI researchers across the world do with your data.


There are many misconceptions about open-sourcing your data. The first is privacy concerns. Research papers from many different organisations have methodologies for how to do this properly, which means that for most data there are ways to release it safely, in a way that is useful to computer scientists.


The second misconception is that people think keeping their data private and/or monetized gives them an “advantage”. This is similar to how people view ideas. Both of these views are slowly dying out, as consensus builds that execution matters significantly more. Many of the architectures we have today are already capable of easily solving the challenges Medical Professionals have, but we lack the data to test this. Many leaders in AI are realizing that they can speak freely about their ideas and still be the only competitive company in the space. I think the same applies to Open Source Data, where without going through the legal/engineering process of validating an AI system and getting it medically approved, there is no money to be made. Some organisations are able to “go it alone”, but they often operate at a scale and with an income that hospitals cannot reach (i.e Terabytes of new data per day, millions of dollars per AI model).


There are already many large, successful efforts in Medical AI for this already. Ranging from cancer to genomics. These efforts allow researchers around the world to use the data as they see fit, experiment with it, and publish their findings. Most of the time, computer science research (especially for big conferences) is made open source, which means that for many datasets, over a few years you can have 10 or 20 different open source models to pick from that work in slightly different ways. The largest AI Medical conference, MICCAI, in 2024 released the “Open Data Initiative” - encouraging all the papers submitted to the conference to have open source data with them. This not only means Computer Scientists have incentive to go with a “data-first” approach, but that even if you have your own Computer Scientists, their papers might be rejected for not having open-source data. 


This doesn’t mean there are no potential downsides. People can “free-ride” on your data without contributing back (depending on the licence), and if the data is not documented correctly, it could potentially be mis-used. There are also real concerns with governance and responsible AI practices that need to be addressed. I view this as a necessary evil that the software engineering community has already accepted for decades (take cloud applications for example). It's also worth noting that for now, certain types of data are too sensitive to Open Source, but new smart encryption systems could enable the open-sourcing of this as well in the future.


If you are interested in open sourcing your data, try and do it in a way that supports the data custodian to repeat the process for other datasets. We recommend the first thing that you think about is the licence you would like your data to be included with. A great place to start for this is the Creative Commons Licence Chooser. We also recommend getting in touch with people that have experience with open-source medicine, like Open Source Imaging in Germany and The Linux Foundation’s Public Health Arm. 


Although I talk about data here, I eventually want to see other types of information Open Source. People should focus on open-sourcing all their code to let others reproduce the results of a paper, and open-sourcing the process of developing their code as well on GitHub. On the implementation side, people should open source how they got their software working in hospitals, the processes they went through and how much it cost. We should open source the documents written for medical device approval, so each new company/organisation doesn’t have to start from scratch this process. Their are already organizations that can help with this, including the ones mentioned before (Open Source Imaging in Germany and The Linux Foundation)


We urge anyone reading this post to consider if they can integrate Open Source ideas into their work, and whether any of the projects you are working on can open source their data. If you work for a clinical center, consider asking if anyone has researched their ability to open source data - or if they can start if they haven’t already started thinking about it (Look at the Mayo Clinic and John Hopkins for examples). The person that comes after you with data they want to open source will thank you.

 
 
 

Recent Posts

See All

Data Scientists are Unmanageable

Don't wander into an AI project and try to manage it like a normal software project. You'll make a mess. The Awakening A few years ago I...

Comments


Datamint 2024

bottom of page