What is OCR?
Optical Character recognition (OCR) refers to the process of converting text images into machine-readable text format. For example, if you scan a form or receipt, the computer saves the scan as an image file. You cannot edit, search, or count text in an image file using a text editor. However, you can use OCR to convert images into text documents and store content as text data.

Why is OCR so important?
Most business workflows involve accessing information through print media. Paper forms, invoices, scanned legal documents, and printed contracts are all part of the business process. It takes a lot of time and space to store and manage these massive documents. Despite the trend toward paperless document management, scanning documents into images is still challenging. The process requires human intervention, is cumbersome and slow.
In addition, digitization of document contents can lead to image files with hidden text. Word processors cannot process text in images the same way as text documents. OCR solves this problem by converting text images into text data that can be analyzed by other commercial software. You can then use the data to analyze, improve operations, automate processes, and increase productivity.

How does OCR work?
Image Acquisition
Scanners read documents and convert those documents into binary data. OCR software analyzes the scanned image, classifying light areas as background and dark areas as text.
preprocessing
The OCR software first cleans the image and removes errors in preparation for reading. Here are some cleaning techniques used for it:
Slight offset correction or skewing of scanned documents during scanning to fix alignment issues.
Remove noise, remove speckles from digital images, or smooth the edges of text images.
Clean up borders and lines in an image.
Script Recognition with Multilingual OCR Technology
Text recognition
The two main types of OCR algorithms or software processes used by OCR software for text recognition are pattern matching and feature extraction.
Pattern matching
Pattern matching separates an image of a character (called a glyph) and compares it with stored similar glyphs. Pattern matching only works if the stored glyph has a similar font and size as the input glyph. This method works well for scanned images of documents entered in known fonts.
Feature extraction
Feature extraction segments or decomposes glyphs into features such as lines, closed loops, line orientation, and line focus. It then uses these features to find the best or closest match among the various stored glyphs.
Post-processing
After analysis, the system converts the extracted text data into computerized files. Some OCR systems can create annotated PDF files that contain pre- and post-scan versions of scanned documents.


