Recognition – A new approach to automated data capture

Data Mining   |   
Published November 16, 2017   |   

Data is rapidly becoming a key resource in helping many organizations find unexplored areas of business in addition to operational inefficiencies. The challenge however, is that this data is largely trapped within unstructured data (90% of enterprise content is unstructured)—whether it be a scanned insurance form, a received email attachment, an internal CAD diagram, or any other number of ‘dark & dirty’ unstructured formats.

In order to take advantage of the opportunities data presents, organizations are trying to leverage traditional data capture methods. The thing is, there is a significant problem with this approach. While it’s true that these methods succeed in capturing data, they just aren’t capable of understanding what to do with the valuable insights within that data.

This is why leading organizations are instead turning to Recognition—a thorough, robust solution that elevates content in a digital age.

But Wait . . . What Exactly is Data Capture?

Content capture is a process many organizations implement to automatically identify and classify information, and then make that information available within a particular system. At its core, data capture takes your organization’s content—in any format—and converts it into something that a computer can systemize.

Of course, capture is a process that’s easier said than done. Why? Well, capturing data in ‘any format’ is a wonderful concept, but when executed, is often a far cry from what actually happens. The reality is: it doesn’t matter how fast you capture different types of content when you can’t extract data, append metadata or even integrate with other content management systems. Captured data without searchability, or “findability”, completely limits your organization’s ability to know what data you have (and where to find it).

Dealing with paper-based and digitally born content

With any form of unstructured or semi-structured data, the method used to capture the content applies to information that is paper-based and/or digitally born. Each method, however, is missing a critical component in how organizations can successfully apply data capture technology. You can see evidence of this fundamental gap when we examine current data capture methods.

Paper-based content

Prior to the rise of the computer, every organization captured their data on paper. As you can imagine, the amount of storage space and maintenance effort required to manage this type of data can only be described as time-consuming and cumbersome. Even still today, there is an unprecedented volume of this type of information flowing into organizations.

Capture methods for paper-based content

There are several ways to capture your organization’s paper based content. It is important to keep in mind that each comes with specific impacts regarding how data is captured or extracted. The methods include,

Scan: A scanner will capture the data’s image and convert it into a digital file. The benefit to scanning paper based content is that it becomes possible to standardize an organization’s content into one format (typically PDF).

This technology is rapidly becoming obsolete as modern organizations shift to accommodate the rise in digitally born content. In addition, a scanner’s primary function is to handle archival processes. It was never designed to recognize the content’s data elements.

Optical Character Recognition (OCR): OCR is software that can recognize both print and electronic text and text characters. The OCR process scans each individual character within a document, analyses those characters, and then translates them into codes. These codes make it possible to search for, identify, and even build rules to tag and extract specific content.

Organizations can only utilize OCR once the content is scanned. In most cases, they do not take further steps to utilize the data within the content. They simply hold onto it.

Enterprise Content Management (ECM): ECM provides organizations with the methods and workflows they need to properly manage captured data. Once content is scanned and converted into a particular format, many organizations try to use ECM.

An ECM system will capture the content, but because it doesn’t know about (or understand) the data, it doesn’t know what to do with the content. Even when OCR is applied, there is nothing in place to direct the content into the right system.

Digitally born content

This type of data typically describes content that is generated using computer technology. What this means is that if your organization creates word documents and spreadsheets using Microsoft Word, or communicates via email, you have and use digitally born content.

At first glance, this type of data might seem easy to manage. However, it is critical for organizations to continue to look to the future. According to research published by, Harvey Spencer Associates, Inc., a research and analyst firm based out of the U.S., organizations across the country will see a significant demand increase for email capture, well into 2020. Most other forms of paper based content including fax, scanner and MFP will either see a steady decline or sustain a very low level of demand.

data capture

Capture methods for digitally born content

The reality is, email in this case represents all forms of electronic content. Yet, despite this anticipated growth, data capture methods for digitally born content currently see organizations storing their business-critical information in multiple repositories with little to no visibility on where or how to find a particular piece of information. Some specific methods for capturing this type of content include,

Connectors to ECM

Connecting digitally born content to an ECM system like OpenText is one way to give organizations better access to digitally born data. Using data capture methods like OCR can then help to a group and sort the data.

This capture method only pertains to a limited number of file formats, and doesn’t approach the solution from an end-to-end perspective. As a result, working with OCR alone can also be a problem.

Email Clients: Many organizations use different software tools to generate email marketing content for their clients. With some software, it is possible to capture data associated to the clients emailed, campaign details, and more.

This method of data capture is problematic because email is designed for collaboration. So, it is important to remember that on their own, these software tools do not integrate with other content management systems, ignore attachment formats, and/or perform any data extraction.

Digital Mailroom: Digital mailrooms automatically process and distribute information found within incoming mail. It is possible to scan or capture data using OCR technologies.

This content capture method often results in redundant systems where organizations have to print and re-scan or re-submit their digital assets. It doesn’t provide a complete or streamlined process.

Recognition – The future of content capture

The overarching challenge is that organizations today use paper based and/or digitally born content capture methods to capture their content—and leave it at that. The content is successfully captured, but the organization doesn’t do anything with it.

The missing element in the way organizations currently address modern data capture is a method that is more universal in nature, one that addresses multiple input systems and multiple content formats—a solution that is truly digitally born. It is a solution that we believe, fills a critical gap in the capture industry. Some call it, Capture 2.0, or Ingestion vs. Digestion.

We like to call it, Recognition.

Recognition is a way in which you can elevate your content beyond data capture methods, and requires a 4-step process:

  • Standardize: Regardless of original content format, create “one source of truth”, or one (standardized) format.
  • Analyze: Pull out data elements based on your organization’s requirements.
  • Categorize: Organize the content into groups or buckets “buckets” based on a similarity index generated during the analysis step. Once the content is grouped and validated, metadata (data elements) is appended to officially classify the document.
  • Optimize: Add enhancements and prepare the content for various downstream processes.

Instead of capturing data and dumping it somewhere, this 4-step process offers a modern capture method that utilizes your organization’s content, and its data, in an integrated way. It is a more robust method that establishes a fluid approach to content, by recognizing key data and understanding what to do with that.

Recognition is how your organization will manage the incoming demand for digitally born content in a way that connects and integrates with existing framework and workflows. It will help you to finally see the positive impacts your content, and its data, can have on your business.