Any experts on data analysis in the room who see potential in this approach??
I'm definitely not an expert on big data analysis; my company has much brighter minds than I for that. However I will say that format of data is a problem here.
Meaning, research papers are presented as full-text. So, two or three different papers may be about, largely, related topics, and they might all say some basic fact like "fluoxetine is a 5ht2c agonist". But, they will use different wording, phrasing, etc.
Fundamental to being able to do "big data" analysis, is having information in a consistent, homogenous format so that you can use strategies like
MapReduce to produce insights about the nature of that data. Human-readable linguistic text is not a consistent, homogenous format. The problem of converting human-readable text into something that can be procedurally rendered into a homogenous format is generally known as natural language processing. It is a hard problem. The tools to do that kind of analysis have gotten much better in the past couple decades, but are still nowhere near the point of being able to easily consume 10,000 different papers about the same subject and compile their results into a consistent format. My total guess is that it might be possible to do that to some significant degree in another 30-40 years... but there's always room for some huge breakthrough that completely changes the game.
There would also be room to use mechanical turks to do this; basically hire a bunch of people to read papers and compile metadata about them in a common format. Google has done exactly that with certain kinds of image recognition, literally paying people to spend hours documenting and double-checking house address numbers in photographs and things like that. However, that's work which can be done by anyone with functional eyes and hands; a careful metadata analysis of medical research texts would require a much more educated workforce, and is therefore a much harder undertaking.
Personally, I think the place I would like to see effort put, would be in getting everyone's
medical records into an identical format (which is already, more or less, underway) -- and then getting it into some kind of searchable database that anonymizes things to a degree where the obvious privacy concerns could go away, but retains enough data for deep statistical analysis. More or less, if you had such a database, then you could fairly trivially answer hard questions like "are there more clusters of cases of parkinson's disease in areas where compound X is more common in the water supply?", "are people seen less frequently for ongoing tinnitus complaints when they happen to be on drug Y for other reasons?" and things of that form. It's a little maddening to me that this is not already happening, simply because the technology for this exists -- and has existed for decades. The things preventing that from happening are privacy concerns, along with the general sluggish pace of change in the paperwork side of the medical industry.