The Value of Data Quality

It is agreed that data quality is critical in the business. Any attempt to understand a large data set at a scale that a human cannot handle necessitates that the data is cleaned up! Either to collaborate with other humans or to set up a computer based analysis, the data has to follow some rules.

Multiple recent publications rely on this idea for public domain data:

A manifesto for reproducible science (Marcus R. Munafò, Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button,, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers,, & Jennifer J. Ware and John P. A. Ioannidis 2017 Nat Hum Behav) mentions that 85% of biomedical research efforts are wasted,

The FAIR Guiding Principles for scientific data management and stewardship (Wilkinson, M. D. et al.  Sci. Data3:160018 doi: 10.1038/sdata.2016.18 2016): explains how metadata can help to create machine understandable data.


Both articles state that good data quality allows a more stable understanding of biological processes, which is mandatory to finding new drugs. Reproducibility of an experiment and ability to query based on the business metadata are key to new discoveries.

Data science is also looking for good quality data. From a study published on their website,, a data curation company, state that their data scientists spend 80% of their time finding and formatting the data before writing any fun and powerful data science code algorithm.

Figure 1 Data Science real time spent

However, setting up a data quality policy is not simple:

  • There is a need to agree on common vocabulary and policy: this is more and more difficult (or impossible) as the size of organisation grows. Also for legacy data or while collaborating with an external partner, it is hard to impose a standard.
  • There is a need to set up a data curation processes: This is usually time consuming and even not done. Only when the data is required (or asked by authority), some people, if they can get it back, will clean it up for new usage. Nobody wants to read and annotate millions of files on a legacy drive.
  • Automated annotation algorithms using NLP alone are too simple (so far) to provide reliable enough data contextualisation. Moreover, many scientific files do not contain the information to qualify them (e.g. getting the project code or disease for a scanner or NGS datafiles) so the automated process will never get this information out of the data without help.
  • Finally the advantage of data quality is not immediately apparent.  Even not for the people who created the data. It might be for someone else in 2 years, so due to time and money pressure, many are still hesitating to set up such process.

To illustrate the benefit of annotation, here is a search on textual data that is not annotated but just indexed, like in your document management system:

Figure 2 : simple text based search

And here is the same screenshot with annotated documents:

Figure 3 : search over a good quality dataset, including annotation

The difference is huge. In one case we just have matching of character strings whereas in the second, thanks to annotation, we can be sure that the document is about the searched topic. We have highlighted here that a specific disease is spotted and that the result set contains 4 experiments and can also be filtered by chemical entities. Annotation process is much more specific and makes scientific topics emerge from the data. It is much more precise compared to what a search can retrieve. Additionally, the search facets let you explore how the disease you are looking for interacts with chemical entities on the dataset.

So what is the solution?

As in many complex situations, the solution is hybrid. Thankfully, there is a relationship between the time you may spend on data quality and the level of quality you will obtain.

First, setting up an automated annotation process is fairly quick and cheap. You will need of course to choose a software that is domain aware, specialized in pharma or life science domain and spend a bit of time tuning the detection parameters. At DEXSTR, we evaluate at around 2-3 days  the time needed to tune an algorithm regarding a particular dataset. This level of annotation is enough especially if the dataset is today unknown (like external data or legacy data).
This level of annotation is enouth to make data classification fast and to know what should be done with this data.

On a second level, you may use the fact that teams are working in a usually pretty structured way. Lab operators work on experiments, scientists work on projects or diseases, clinicians are working on assays or patients, etc. It is of course possible to leverage this organisation to make a very fast manual annotation process. For example, you may enter a project code once and a target and get all your project data associated with this information. If you work in an NGS lab, you may enter a sample ID and a tissue once and get all your raw data annotated with this information.This requires a bit of discussion with teams and the setup of a couple of rules that usually happens in a couple of weeks.
This level of data quality and contextualisation is enough for data sharing, collaboration and scientific understanding. The main high-level business metadata (again, project, disease, etc.) are well set up and data can be reused.

Finally creating a connector and extractor to the structured database, like LIMS or ELN, will allow an even faster and more reliable data annotation. Here, detailed business metadata can be captured, like tissue, sample used, provider, bioassay, method, instruments, etc.
This level of detail is finally enough to create a dataset ready for the digital discovery!

This level of detail is finally enough to create a dataset ready for the digital discovery!

Figure 4 : Get the right outcome for the right effort


Depending on your target (quickly asses a dataset, set up a collaborative platform or getting ready for full digital reseach process) you need to choose the right amount of effort.

If you’d like to talk to us about how we can help you make your scientific data more valuable, please follow the contact link on the top bar to arrange a call with one of our data management experts

Erwan David, Chief technical officier

Erwan David, Co Founder.

Share on LinkedInTweet about this on Twitter