Introduction:

There was a time when the availability of data was highly questionable – due to the labour intensiveness of its collection and organization. However, today, there is a plethora of data available – for any topic of interest one can brew up. But this poses another question – how to organize all this data into easily searchable formats?

Data collection, extraction, and organization can be time-consuming and tedious tasks. These processes being done manually lead to inefficiency, time hogging, and ineffective costing.  Additionally, transforming paper trails into digital formats is completely useless, if digitized data is not editable or searchable.  Hence, a solution that converts physical data to digital data, simultaneously organizes it, and makes it more user friendly is the need of the hour.

‘Data Capturing’ is a simple solution that helps in extracting data from the physical documentations and storing it in an electronic format. This helps companies with a paper-heavy routine to automate their administrative tasks by providing faster means of data extraction. There are various methods of data capture, and selection of which depends on the kind of data being interpreted. These methods are:

  • Manual Keying
  • IDR (Intelligent Document Recognition)
  • Nearshore keying
  • OCR (Optical Character Recognition)
  • Barcode/ QR recognition
  • ICR (Intelligent Character Recognition)

Amongst all these computer vision techniques, Optical Character Recognition attracts the attention of most researchers. This is because of its applications in automating laborious tasks.

OCR (Optical Character Recognition) is a technology where text from physical documents, posters or images, are scanned and converted to digital formats. It enables organizations to convert unstructured data into searchable and editable data, using the concepts of Natural Language Understanding (NLU).

Data Capture – How it Really Happens.

Talking about the stages, first, there’s the document analysis stage wherein the OCR’s system analyses the structure of the document. The intermediate stage includes text recognition, wherein the OCR system reads and understands the text. The final step includes placing the recognized text back to the coordinates it was extracted from.

In the process, it breaks pages into headers, tables, footers and paragraphs. Then lines to words and words to isolated characters. Then it compares each character with a predefined set of images to identify it. After this process of identification then it reverse engineers the above process to again receive complete paragraphs as a whole. It also uses data scanning and decoding to process pdfs and images.

Google Cloud Vision API has ML models trained with large sets of data and thus, they can easily tag and classify images into a million categories. It uses OCR to identify handwritten or printed text, in documents, posters, signboards or any textual document it is pointed at. Apart from Google Cloud Vision, key players in the OCR software industry are Tesseract OCR, ABBYY FineReader,  Kofax Omnipage (previously Nuance) and KlearStack’s OCR.

All this makes manual data entry into applications look old, doesn’t it? However, even OCR has come a long way with the passage of time. There are now many advancements that make OCR extremely efficient and impressive. In order to understand these advancements, however, let us first look into the background of OCR.

The Beginning – Template Based OCR

Different vendors provide data in different structural formats. To extract data correctly from all these sources, using the same OCR structure is impossible. This led to the creation of the Template-based OCR technique, wherein a set of rules was set up to extract a particular type of data. The extraction was performed by scanning the coordinates in the document for each field. The escalation from manual data extraction to Template-based OCR worked out pretty well in terms of efficiency and time.

The output accuracy of this system was only disturbed when different vendors sent along with different structurally formatted documents and the template created once was applied to all. The solution to this was creating a new template for each time the structure of the document changed, which was time-consuming and cost-ineffective. This made people look at Cognitive OCR as a viable option for data capture.

Moving Forward: ML-based and Rule-based OCR

Traditionally, OCR systems were handcrafted methods, which contained several modules. These modules were inclusive of data pre-processing, character segmentation, feature extraction and character recognition. With time, advanced techniques like integration of ML in OCR and Rule-based OCR came into existence.

Character classification is the procedure to determine the category of a character. Machine learning techniques including unsupervised and supervised methods are used in it. Unsupervised algorithms, such as K-Nearest Neighbours (KNN), need not the label of character to train the data. Instead, the aggregation characteristic of the same character in the feature space is used to cluster the data. Different from unsupervised learning algorithms, the supervised algorithm needs the label of character to train a classifier to decide the category of an input character.

Deep learning is constructed with multiple layers, including convolution layer, rectified linear unit and pooling layer. The advantage of using deep learning is its generalization – once the classifier is trained and obtained, it can be used in multiple applications, such as handwritten character classification and street signboards text recognition. However, it is highly computational exhaustive and needs hardware, like GPUs, to speed up the entire process.

In the Rule-based OCR, an attribute rule is used to extract an individual value from the OCR text results. An attribute represents a piece of metadata that describes the document and is present numerous times inside the document. Validation rules improve the accuracy of extracting values from OCR text using the attribute rules. The different attribute rules use:

  1. Character Segmentation (the system determines the boundary of letters and alphabets by vertical and horizontal scanning),
  2. Loop detection (the system will recognize loops like in 0 and 8),
  3. Opening detection (system recognizes the openings including upward downward or sideways like in 6 or 2),
  4. Character classification (the system classifies characters by comparing them with already stored images)

The Newest Curve – Cognitive Data Capture

Cognitive Data Capture is the technique that uses Artificial Intelligence and Machine Learning’s Deep-learning algorithms in collaboration with OCR. It processes documents just as humans would. The process involves understanding, classification, and extraction of data by identification of patterns, finding clues, and ranking. Systems using Cognitive Data Capture, are capable of ingesting large amounts of structured or unstructured data. These systems supplement information that can be used by employees in effective decision-making. Similar to OCR, Cognitive Data Capturing solutions are accurate and cost-efficient.

Another aspect of Cognitive Data Capture is sentiment analysis (in the absence of tonal indicators) and contextual understanding – this is what makes it different from OCR. The system provides labels including positive, neutral, mixed, and negative to sentences fed into it. This helps in better understanding customer sentiment, extracting key phrases in unstructured text, and in the identification of named entities.

Additionally, there is also a difference in how handwritten text is analysed. In the case of OCR, its capability is restricted to printed data that looks the same given the standardisation of multiple fonts. However, in the case of Intelligent Character Recognition (ICR) in Cognitive Data Capture, it is intelligent enough to decipher data from non-standard documents, which contain handwritten texts with varied formats. ICR helps in converting these varying characters into a machine-readable format.

Finally, let us explore the potential areas where OCR can enhance workflow efficiency.

  1. Invoice Management:

    OCR can help scan the figures on physical documents such as invoices and extract them into digital format. For organizations getting frequent paper-based invoices, this technology can prove to be invaluable. Digitizing databases by manual data entry would no longer be needed and employees can productively use their time elsewhere.

  2. Cloud Storage Searchability:

    Most organizations and employees are in a habit of keeping their files on the cloud storage, which helps in easy access and less mess in the workspace. These platforms, using OCR, provide an option to turn physical documentation, photos and pdfs into editable documents within seconds.

  3. Document Management:

    Most organizations have a combination of hard copy and soft copies when it comes to storing useful documents. OCR helps in the easy retrieval of information when linked to an indexing and sorting tool. It can also be used for extracting business card information directly into a contact list.

Conclusion:

With the rising age of digitalization, organizations are investing a lot of money into productive technologies. Efficient data extraction techniques help them get the relevant data faster, which in turn helps them stay ahead of the competition. Data Capture techniques like OCR can help an organization in a lot of ways including effective use of physical storage space, time-efficient data extraction and mitigation of the risk of misplaced documents.