Traditional metrics for accuracy
When evaluating data automation systems, particularly those with optical character recognition (OCR) capabilities, accuracy rates are a key concern. But for biologics manufacturers, measuring data automation accuracy is neither as simple – nor as critical – as it may seem. Usually measured as the percentage of characters or words correctly detected by an automated system, accuracy is an important metric for certain OCR applications, such as scanning printed text documents. In these cases, OCR can often achieve close to 100% accuracy, although ink bleed-through or page skew can impact results. But for handwritten documents – particularly those written in cursive – character recognition becomes much more complicated, and accuracy metrics become harder to define, as they become intrinsicly tied to a document's legibility. If a human cannot decipher a particular character or word, then software generally cannot be expected to either.
Limitations of traditional accuracy metrics in biopharma
For biopharma documents, character and word accuracy generally aren’t the most useful metrics. These documents are usually not composed of paragraphs of text but instead contain structured and unstructured – and often handwritten – data that we wish to be able to search and analyze. This means effective data automation requires something different than basic character accuracy – it requires an understanding of the data. For example, it may not matter whether a system can accurately read the letters in an illegible signature, but instead whether it can accurately match that signature to others from the same person. In addition, biopharma records may make use of industry- or company-specific codewords or acronyms that would not be found in a standard dictionary, which impacts the relevance of traditional accuracy metrics.
High quality, useful data is the ultimate goal of data automation for biopharma, and extraction accuracy is just one piece of that puzzle. An effective solution delivers data that is not only accurate, but also clean, comprehensive, timely, and source-trackable – and extracted in a way that preserves its context and structure. This allows the data to be confidently used and reused to answer different ad-hoc questions – all without having to go back to the original paper records.
High quality, useful data is the ultimate goal of data automation for biopharma, and extraction accuracy is just one piece of that puzzle. An effective solution delivers data that is not only accurate, but also clean, comprehensive, timely, and source-trackable – and extracted in a way that preserves its context and structure. This allows the data to be confidently used and reused to answer different ad-hoc questions – all without having to go back to the original paper records.
The critical role of data
As we’ve seen, the concept of accuracy for biopharma data automation is much more complex than in the traditional OCR sense. It is helpful to think of it as encompassing the following:
- An accurate extraction of what was actually recorded on paper. This may include words that are misspelled or fields that are missing. It may use context to understand whether something is the number zero or the letter O. It also reflects a document's legibility and thus can help companies improve legibility – a key part of the FDA’s Good Documentation Practices.
- An accurate interpretation of the intended meaning of the data within each form field. A simple example of this is understanding that “05/06/24” is a date – and knowing whether it means May 6 or June 5. In certain cases the system might even be able to surpass what is legible to humans by understanding the scope of possible values for a certain piece of data. For example, a field might contain the hard-to-read initials of a lab technician. But a system that knows the initials of all ten possible lab technicians can confidently zone in on the correct one.
- An accurate representation of the structure of a document and how the fields relate to each other. This means the ability to identify fields without templates or tagging, and to automatically recognize different versions of the same form. It means interpreting field names and values within tables – as well as parsing structures with embedded tables, irregular tables, fields with multiple data types, and tables without lines between the cells. It also includes handling crossed out and corrected entries in a GMP-compliant manner. This requires a high level of sophistication that is critical to making the extracted data useful for later analysis.
These considerations illustrate how nuanced the topic of accuracy is when looking at data automation solutions for biologics manufacturers. It can’t be captured by any one metric. Instead, it is important to understand the system’s full capabilities in extracting useful, high quality data – and how that translates into tangible business benefits.