Big Data and Drug Discovery: Connecting the Dots…….
Big Data and Drug Discovery: Connecting the Dots…….
How does one connect the dots? What leads up to the Eureka moment? Is it the sudden cosmic aligning of stars that leads to great wisdom? Hardly! It is usually the last straw in the stream of consciousness that breaks the back of the ignorant camel. Modern polymer and organic chemistry owes much to Kekulé and his discovery of the cyclic structure of benzene. There is this famous story about how Kekulé had a reverie about a snake swallowing its tail that lead to proposing the cyclic structure of the six-carbon ring. Well, if it indeed was only that, then indeed it was a mockery of all the knowledge leading to that point. Rather, these “Eureka” moments are the sudden bursts of clarity that propagate through the maze of facts or data.
What’s new today? In this world of instant gratification, what we don’t immediately realize, is the democratization of knowledge processing and decision-making. Wearing a “fit band” allows each of us to monitor our pulse, sleep patterns and generally manage our daily health. This is in a way going beyond becoming “self-aware” about one’s health and instead using the vast knowledge about real healthy life habits and replicating the same. In a way, it is the “de-romancing” of the process to self-awareness that is no less enabled by gadgets that collect data (the fit band) and the published (and accessible) body of work available to help each individual to make healthy decisions. Ergo, one does not need to be an exercise guru to stay healthy!
This is a blog about Big Data and Drug Discovery! So, what’s this topic got to do with drug discovery? The path going from basic “data” in biology and chemistry to drugs has always depended on “experts” such as medicinal chemists, protein engineers and the like. Even in the hallowed portals of higher education, these individuals are a rarified group who work on translating the brilliant (but often abstract) basic discoveries to really useful inventions such as drugs. It is then supremely ironic that these geniuses are the bottleneck in creating more new drugs. Indeed, there aren’t enough geniuses in the world to make sense of all the information available in Wikipedia!
Enter – Big Data approach to “pathway mining”. Just as a cartographer uses all the knowledge about the terrain to create maps, the modern biologist uses genetic, genomic, proteomic and epidemiological data to recognize a pattern or path from the maze of data. What’s new is that machine-learning, natural language processing, and robust statistical methods, are now allowing us to automate the processing of this so-called chaos of raw and disparate data. So, instead of depending on geniuses to make the connections all on their own, these methods allow us to organize the so-called chaos of the flood of new data into discreet patterns that the geniuses can make use of. Think of it as pre-packaged knowledge suitable for discoveries. Or going by the cartographer’s analogy, it is a map that still must be used by explorers to go and settle new territories.
Essentially, what these new data mining approaches have done is to enable us to mine genomic and scientific data on a real time basis so that any new publication is “connected” with every piece of data relevant to it. So, when a gene-mutation is connected to a rare genetic disease, these new methods allow us to place the gene in the context of all known molecular pathways and the “drug space” around these molecular pathways. These are the “patterns in chaos” that allow the experts within the pharmaceutical industry to pinpoint the site of molecular intervention to cure a disease.
The study of cholesterol homeostasis allowed the characterization of a mutation in the protein PCSK9 that is involved in LDL metabolism. Natural mutations that inactivated PCSK9 were linked to low LDL levels. Indeed, further investigation allowed the emergence of the hypothesis that inactivating PCSK9 could be a viable way to reduce cholesterol in individuals who do not respond to statins. This is a perfect example of taking a basic genetic discovery, placing it in the context of known pathways of cholesterol metabolism and the drug space around cholesterol homeostasis and generating a testable hypothesis for controlling LDL levels. The integration of genetic databases from NCBI, with the scientific data from Pubmed and chemical drug data from sources such as Drug Bank allows one to connect the dots and see the patterns much faster as time is essentially not wasted on assembling each piece of data.
Indeed, we are allowing the dreamers to dream more and not worry about memorizing the dictionary to write about their dream! Imagine a scenario where a new discovery is immediately placed in the context of a useful application for treating a disease. A “live system” of data that monitors such “events” can serve as sentinels for alerting the utility and primarily serve as “hypothesis generating” events. All useful answers come from great questions! So, isn’t it more useful for our geniuses to spend more time framing questions than the mundane task of gathering data? Isn’t it more useful to spend time on connecting the dots rather than collecting the dots? That’s the promise of applying Big Data to Drug Discovery…….