The rise of open source search in pharma R&D

Pharmaceutical and healthcare industries have led the way in finding uses for big data, machine learning (ML) and AI technologies. The Human Genome project was, perhaps, the first widely-publicised example, while projects such as IBM’s Watson Genomic initiative and the use of Google’s DeepMind at Moorfield’s Eye Hospital continue to push the boundaries.

The ability to rationalise, search and analyse data at scale is now helping the pharmaceutical industry to better diagnose, predict and cure diseases, find new uses for existing drugs and dramatically hasten the manufacture and trial of new medicines.

Successful application of large-scale data science technologies can deliver unparalleled results, rapid medical progress and change people’s lives.

However, the cost, complexity and failure rate of these custom-build ‘big data’ and machine learning projects have proved a significant, historical barrier to more widespread adoption. How are large pharmaceutical firms overcoming these barriers, and is there a role for open source search technologies in solving these issues?

Matlab, R and Excel

Early data projects were characterised by objective-driven analytics; standards-based analysis was used on tightly structured, rather than complex, data. Within a framework of digital hypothesis testing, users would design and deploy an expensive machine with a particular question in mind. The data generated would then be analysed, with the data delivered in a proprietary format, before it was exported to a standard format (in most cases a spreadsheet).

Researchers would take the output of these systems and combine studies over time, manually collating the results of multiple experiments into one large data set. This ‘Small Data’ technique of analysis was arduous, increased the room for human error in the calculation and relied on useful but limited tools, such as Excel, to glean and display meaningful results.

The volume and complexity of data has grown exponentially over the years, as digitisation has permeated throughout healthcare and pharma industries. Within the NHS, this growing deluge of data includes patient data (including handwritten notes) spanning 15 years in some cases, matched with clinical notes, surgical and drug histories – all from a range of sources, in a variety of formats.

Consequently, the market outgrew linear tools like Excel. Suddenly, more powerful solutions were needed to analyse irregular forms of data. This is where open source languages such as R and Python became used to build custom data tools, and where Matlab, on the commercial front, forged its niche.

The use of Python and R grew out of the limitations of basic tooling such as Excel. In certain disciples, open source solutions became the standard for addressing tasks such as molecular dynamics and visualisation, where advanced computation across large data sets was essential. Ironically, while open source was driving innovation in these areas, in others spreadsheets were still being used to import and analyse data.

Commercial solutions like Matlab, Spotfire and Autonomy were also deployed by early-adopters to mine swathes of clinical, market and legal data to track which new drugs were most likely to make it into pharmacies, and to analyse the potential market for these drugs. But companies groaned under the weight of management contracts and update cycles; often finding they had sunk millions into technologies that couldn’t evolve. This contributed to a slowdown in the adoption of data technologies within the sector.

Cost, risk and complexity

Stagnation is driven by multiple factors. Bespoke, complex data projects have showed notoriously high rates of overrun, overspend and failure: companies that were burned once, became increasingly wary.

With proprietary solutions, the cost and complexity of procurement (particularly the time it takes from initial discussion to getting copies of software) have been prohibitive, extending deployment time and delaying the overall project. In all cases, the cost of managing large implementations, upgrade cycles and Service Level Agreements can be eye-watering.

Added to all this, demand for most data technologies has typically been driven by one specific project; companies are reluctant to make heavy investments in an IT solution that may only address one problem.

To overcome these barriers, there has been a clear need for data platforms that combine scale and power with flexibility; allowing companies to reuse technology investments across multiple projects and to evolve over time. Nowhere have these technologies evolved faster than in the world of open source. As a result, freely available open source technologies are starting to pique the interest of big pharma.

The value of open source search

To use genomic data as an example, researchers have become increasingly interested in the best methods to aggregate across millions of adverse MHRA/FDA event reports. Likewise, when thinking about drug discovery or matching genetic types to particular drug therapies, the priority for researchers has been to traverse millions of research papers to identify links. These questions cannot be answered with traditional tooling, whether open source or commercial. How, then, have some of the biggest global healthcare and pharmaceutical companies started to use open source to solve these problems?

Merck & Co. has pioneered a method of analysing genetic data at scale, to better understand the genetic impact on drug efficacy, and to hasten availability of new treatments. Very few drugs make it to market, and so Merck’s focus is to mine petabytes of genetic data more effectively and at speed, in order to increase probability of success.

Merck uses data analysts in the early portion of drug discovery. Working alongside chemists, they analyse genetic evidence to monitor the efficacy of drugs, and how they interact with subjects’ genetic differences. They ensure the safety and efficacy of drugs before they become available for human consumption.

As genome sequencing costs have fallen dramatically, researchers have been awash with genetic data for novel research. The existing tools and the methods of analysis failed to scale – in terms of data size and harmonisation. They also required tedious, manual input and significant expert integration.

Now, using an open-source search platform, this complex data has been better harmonised: Merck has developed a universal coordinate system for genetic variants, providing a unified backbone to help scientists uncover new insights on human genetics across a broad spectrum of diseases and to aid in the discovery of new therapies.

Semantic search

Search and analytics have also come to the fore in finding new uses for existing drugs. As research budgets are squeezed and health organisations push for greater economies, the repurposing of existing drugs makes increasing sense to pharma companies grappling with the immense time and cost of developing new drugs, and getting them to market.

More and more of the fundamental scientific content, critical to the innovation process, is locked up inside electronic documents. How do you scan millions of publications, patents, reports and any other document type to get at the information you need most? How do you query unstructured information in an expansive and inclusive approach? The answer lies in named entity recognition (NER) engines. One example being used in pharma at the moment is SciBite’s TERMite engine along with its semantic search platform, Docstore, built using the open source Elasticsearch platform.

NERs recognise concepts within texts, such as drug names or diseases. The power and flexibility of the underlying search technology handles drugs with multiple names and multiple synonyms for the same entity upfront, at the indexing stage.
An example would be the finding of the word ‘GSK’ within some text. Does this refer to the company (GlaxoSmithKline) or the protein (Glycogen Synthase Kinase)? A good NER engine contains the domain expertise to semantically enrich the text, to identify context and add detail relevant to the life sciences space.

This enriched data is then indexed into the underlying search engine. Once there, an open source search platform can deliver fast results on sophisticated new semantic search queries across vast corpuses of biomedical literature.

Bring engineers and researchers together

These examples illustrate how open source search technology is allowing large pharmaceutical companies to be more innovative in the way they manipulate data: by making large-scale data analytics faster, less complex and dramatically reducing cost and risk.

What’s attracting both engineers and researchers is the low investment needed at the outset, which lends itself to more experimental and creative approaches in data projects. Flexible, reusable search platforms remove the need for preconceived hypothesis testing, allowing users to explore data freely. Engineers and researchers can trial any number of new approaches across a range of data sets, keeping what’s useful and discarding the rest.

Dan Broom is VP Northern Europe at Elastic