Finding, analysing and interpreting relevant datasets is a fundamental and time-consuming task for every discovery scientist. Researchers may want to validate a target, source datasets for an analytical report, or perhaps create a cohort of multi-omics datasets for mining. Whatever the research question, these routine tasks are filled with many obstacles. So, for large pharma companies, this activity presents a serious and significant drain on resources and productivity.

Challenges in the workflow

To unpack some of these challenges, let’s look at the most common steps in the workflow of a discovery scientist.

First, you may want to design an experiment to test a hypothesis. Although designing an experiment in advance would improve productivity, this requires collaboration between biologists, R&D managers, lab technicians and bioinformaticians and there are few tools to support this. Instead, study, sample and analysis requirements are often duplicated in multiple documents that are sent around by email, so things become lost, outdated and inconsistent and there is no ‘single point of truth’.

The next step is to source data to test the hypothesis. Access to studies with more samples combined with healthy control datasets helps to create a more comprehensive picture and stops the cherry-picking of data to support an opinion. Although many pharmaceutical organisations have their own private repositories these are limited, and searching for more omics data is difficult. Experimental data may be in silos in private repositories (often someone’s laptop) or hidden in plain sight in public repositories with poor data discoverability.

Finally, after sourcing data, you will want to analyse and interpret the results. Bioinformatics is a scarce resource, so if the biologist was empowered to interrogate and visualise the data this would accelerate and automate one of the key steps in the production of a comprehensive target report.

These are just a few of the obstacles that are creating an expensive bottleneck for drug discovery.

Consistent metadata

Thus my first tip to double the productivity of your discovery scientists is to improve both the metadata and the data management and, most importantly, make it a priority.

Implementing a rich metadata system with consistent validation rules and ontologies enables both proprietary and public data to be captured, described consistently and stored securely in a centralised location ready to be used for studies. This would ensure that results are easy to find and use in meta-analyses.

Migrating to a collaborative system for capturing and organising study and sample-level metadata (along with required sequencing procedures, produced raw data and secondary analyses) would offer a single-point-of-truth and a better understanding across the organisation of what data and analyses are required. Also it would be possible to assess the quality of data before using it in analysis.

With improved and standardised metadata, big pharma would be able to collaborate and source data across departments and geographic locations. It would prevent wasted resources on duplicating data, and fast-track analysis by removing the need to analyse the data itself to understand if it is of value.

Furthermore, the wider acceptance of a standard system would make it easier to put proprietary data into context with public data, making it possible for scientists to quickly gauge how much data is available through different sources. Greater consistency would allow proprietary and public data to be searched together. This would enable discovery scientists to mine datasets, scaling to hundreds, thousands or millions of samples on the fly.

Just as today everyone can use Google search without understanding programming languages, I see the potential to offer discovery scientists a search and analysis system for omics data that can be used without the need for sophisticated bioinformatics skills. This would give them immediate access to the data they need.

Create a modular data architecture

This leads me to my second tip on how to improve the productivity of your discovery scientists.

Many top pharma organisations have already invested heavily in their data architecture, building solutions to their omics data workflow challenges. Although every step can benefit from good data and metadata management, each has its own unique mix of challenges. This means there is not one off-the-shelf product which can solve all of these problems for every client. Migrating data from an existing architecture to a pre-developed ecosystem simply means exchanging one problem for another or, worse, even more problems than you started with.

So instead of seeking a single solution, consider breaking it down into functional and fully integrated modules/layers. These should be designed to be independent, allowing improvement at each key step of the workflow while keeping what is working well.

This approach would create a modular data architecture supported by a backbone of good data and metadata management.

Allow it to evolve

It is vital to create a system that is able to evolve. Genomics is a fast-evolving field; data, tools and technologies used today may become obsolete tomorrow. By resisting pressure to adopt a single approach or proprietary technology, organisations are able to future-proof their systems and take the development and evolution of their omics architecture into their own hands.

Dr Misha Kapushesky is the founder and CEO of Genestack.