close
test_template

Automated Text Extraction from Images: Using Ocr System

Human-Written
download print

About this sample

About this sample

close
Human-Written

Words: 2899 |

Pages: 6|

15 min read

Published: Apr 15, 2020

Words: 2899|Pages: 6|15 min read

Published: Apr 15, 2020

Table of contents

  1. Research history
  2. Introduction
  3. Classification
  4. OCR System
  5. Tesseract
  6. Applications
  7. Conclusion

The popularity of digital image is increasing rapidly. Every day, many images have been generated by many groups like students, engineer, doctors. The need, and usage of image vary among these users. They can access images based on its primitive features or associated text. Text present in such images can provide meaningful information. We aim to retrieve the content and summarize the visual information automatically from images. Several algorithms are required to develop an optical character recognition (OCR). Tesseract is currently the most accurate OCR engine which was developed by HP Labs and is currently owned by Google. In this paper, we extract text from images using text localization, segmentation and binarization techniques.

Text extraction can be achieved by applying text detection which helps us in identifying image parts containing text, text localization determines the exact position of the text, text segmentation includes the separation of text from its background and binarization process converts the coloured images into binary image which includes black text appearing on a white background. On this binary image, character recognition is applied to convert it into ASCII text. This method also considers various features like colours, shapes, textures and extract relevant document. Text extraction is used in creating e-books from scanned books, image searching from a collection of visual data etc.

Research history

For past few years, the need for retrieving data from images and storing them for future reference has rapidly increased. Several researches have been performed to study the various approaches that could be helpful in extracting data from images. These approaches include methods related to various extraction processes such as text detection, text localization, text segmentation etc. Additionally, various properties of an image such as colour, intensity, component connection, orientation, text-style etc. are used to distinguish text regions from their backgrounds and other regions within the image. Machine recognition of handwritten texts has been in research for pattern recognition.

Previously Tesseract has been used to perform user specific training on both isolated and free-flow text by specifically using lower case Roman script. Being suspended for more than 10 years, Tesseract now give base to major commercial engines with improved accuracy. Tesseract was developed by HP, but never used by it. Later it was modified, improved and maintained by Google. Although Tesseract is helpful in extracting data from images with more accuracy, but along with, it brought some flaws with itself like over-segmentation for some characters, and under-segmentation or rejection in case of cursive word segments are few of them. Its unusual choice of features is probably the key strength and instead of raw outlines, using polygonal approximation is its key weakness.

Various projects have been developed using Tesseract to implement real-world scenarios related to manuscripts, data extraction and archiving from images, effective manipulation of image databases, language processing and many more.

Introduction

Due to rapid development in digital technology, we have a huge collection of information stored in the form of images resulting in digitization of resources in various industries. Recent studies have been conducted on image processing that shows the importance of content retrieval from images. Text extraction from images and converting them to ASCII text can be achieved using OCR systems. OCR is very useful and popular in various applications including digital libraries, information retrieval systems, multimedia systems and Geographical Information systems.

OCR system almost reduced the keyboard interface between man and machine, and helps in automation of office which saves a lot of time and human effort. The accuracy of an OCR system is sometimes dependent on text pre-processing and segmentation algorithms. Level of difficulty in extraction of text is dependent on different styles, size, orientation, complex background of image etc. Various methodologies have been used to perform text extraction from images such as text detection, text localization, text segmentation etc.

Text detection plays a vital role in determining and highlighting regions containing only text or it can also be fed into optical character reader module for recognition. The images captured in OCR systems often include skew and perspective distortions due to some human errors which are also required to be removed. Binarization of skew corrected images using simple yet efficient binarization techniques that are performed before the segmentation process.

After processing of input image, we obtain a binarized image i. e. texts are written in black colour on white background. On this binarized image, text localization is performed that involves separating each character from the entire word in the image by scanning pixels sequentially. Sometimes situation occur such that components of adjacent characters are touched or overlapped which creates difficulties in the segmentation task.

This problem occurs frequently due to modification of upper and lower zones; thus, it is an important stage. Tesseract is an open source software that helps in text extraction with comparatively high accuracy when compared with other OCR systems. Tesseract does not have its own page layout, therefore it assumes to have input as binary image and optionally defined polygonal text regions. Processing period followed is a step-by-step pipeline of connected component analysis (called Blob) that recognizes the text as black-on-white text, organization of blobs into text lines, breaking text lines into words according to the character spacing kinds and so on.

Classification

Character Recognition

Character recognition is broadly classified into two categories as on-line and off-line character recognition. On-line character recognition involves dealing with a data stream coming from a transducer while the user is writing. When the user writes on an electromagnetic and pressure sensitive digitizing tablets, typical hardware used for collecting data, the successive movements of the pen are transformed to series of electronic signal which is further memorized and analysed by the computer. Whereas, off-line handwriting recognition deals with automatic conversion of texts from images into letter codes that are useful in computers and text-processing applications.

Image Texts

Text inside images can be classified into two categories as artificial text (i. e. caption text or superimposed text) and scene text (i. e. graphics text). Artificial texts are overlaid on images artificially at a later stage such as headlines displayed on television. Whereas, scene texts exist naturally in the images such as text printed on t-shirts. Texts have various attributes that changes their appearance such as font style, size, orientation, colour, texture, contrast, alignment, background etc. All the variations in the appearance will increase the difficulty level of text extraction and makes it more complicated.

  1. Size: Text size may vary a lot in an image, even then assumptions are made according to the application domain.
  2. Inter-Character Distance: Usually, a text line contains uniform distance between the text characters.
  3. Colour: Text lines have a property that there is no necessity to have all the characters of same colour. This property is very helpful in using connected-component based approach for text detection. This property makes efficient use of text detection process in various images and documents that maybe complex due to presence of text strings in more than two colours (polychrome).
  4. Alignment: Usually, the caption text characters appear in clusters and lie horizontally, but sometimes it appears as a non-planar text because of some special effects. But in scene texts, various perspective distortion like alignment in any direction and geometric distortions can also be present.

OCR System

OCR system involves step-by-step process to extract text from an image into editable format. The basic steps are: digitization, pre-processing, text localization, text segmentation, classification and character recognition.

Digitization

In the initial and most important stage of OCR systems, a paper-based handwritten or printed document is digitized (converted into electronic format). Digitization could be attained by a method in which a document is scanned, and an image file format is processed to obtain an electronic representation of original document. Digitization can be performed by scanners such as drum scanner, flatbed scanner, sheet fed scanner, face up scanner, digital cameras etc. using various scanning processes like photocopying, microfilming etc. But alongwith the benefit of scanning, they are accompanied by the drawback of degrading the quality of the documents. An expensive but highly effective approach, which is used by the best subsidized archives in the world, is to microfilm the documents, and then use medical or high-definition film scanner.

Pre-processing

Once the digitized image is obtained, it must be pre-processed. This pre-processing is sometimes also referred as text detection method. Pre-processing contributes in boosting up the performance of the OCR system by performing skew correction methods and methods for removal of disturbances and noise.

  1. Skew Correction: Often, we have noticed that images captured via camera and other mediums suffer from skew and perspective distortions. Such distortions occur due to unparallel axes and/or planes at the time of capturing of the image. Whereas, the effect of perspective distortion is spread throughout the image whose effect is rarely visible on small parts of the image. Since segmentation process generates few text areas only, thus they can be de-skewed computationally using fast and efficient skew correction techniques.
  2. Removal of disturbances and noise: In this stage of processing, all the disturbing elements and noise are removed from the image by using a step-by-step process wherein gray-scaling is done followed by line and discontinuity removal and finally removal of dots is performed.
  • Gray-Scaling. Gray-scaling is performed on a skew corrected text area to produce binary image, such that, distinguishing text from the background becomes easier. Gray-scaling is a method used for text binarization. Although, it is an improvement of monochrome, it requires more memory space for storage since each pixel takes up to 4 to 8 bits. Gray-scaling involves conversion from RGB to their gray-scaled equivalent by adding 30% Red, 59% Green and 11% Blue for each pixel of an image. Apart from gray-scaling, there are various other binarization methods (e. g. Bernsen’s binarization method) which are constantly being studied and modified for better efficiency.
  • Line Removal. When image contains noise in the form of horizontal and vertical fluctuations (lines throughout the image), it becomes necessary to remove such lines from the image for improving the efficiency of the OCR systems. For this, the image is scanned progressively to detect rows and columns consisting of black pixels throughout the width and height of the image. The detected row or column is then removed by changing its pixel colour value from black to white.
  • Discontinuity Removal. Here it is important to consider that this process does not have any adverse effect on the images having no such distortion, but it does leads to some discontinuities in the text areas that were intersected by lines. These discontinuities make the text recognition process difficult, thus they are required to be rectified. For this the 8-connected pixels connectivity algorithm is applied. In this algorithm, if the diagonally, horizontally or vertically opposite pairs of the considered pixel are black, then the pixel (considered) is also set as black. The iterative implementation of this algorithm until all the pixels of discontinuity are processed successfully.
  • Dots Removal. In the final step of pre-processing, the remaining disturbances like unwanted black pixels, that have not yet been rectified, are eliminated. Here also the 8-connected pixels connectivity algorithm is used alongwith an assumption that each cluster of black pixels that are forming the noise is significantly smaller than any of the other clusters of characters of the text. At the end, all the connected black pixels below the threshold value indicating that the considered cluster is some unwanted noise (and does not include any cluster of characters in the text) are successfully eliminated.

Text Localization

After the pre-processing of image is finished, we are now left with the text against the plain background. The process of localization includes enhancing the text area by removing the non-text areas. The text in an image has the property wherein all characters appear close to one another forming a cluster. On considering this property, morphological dilation operation can be used cluster the text pixels together and eliminate the pixels far away from the candidate text area.

Text Segmentation

In this stage, identification of individual glyphs (basic units representing one or more characters, usually contiguous). Segmenting handwritten characters into various zones such as upper, middle and lower zone, and characters is more difficult than the standard printed documents due to the variations in paragraphs, words in a line, characters in a word, skew, slant, size and curves. For this process, horizontal histogram profiles of binarized areas is analysed for segmentation into text lines followed by using vertical histogram profile of individual text lines, we segment words and characters.

Classification

This stage involves the training of the OCR engine by providing some sample data related to the data that is not yet present in the system, but it is required to extracted and recognized the document provided. The feature values trained will be used for recognition of characters from the extracted text image.

Characters Recognition

Recognition of characters is the final stage of text extraction from images. In this stage, we have the character arrays localized and discretely segmented from the pre-processed image. This step includes thinning and scaling of segmented characters and comparing those characters with the stored training set data in the OCR engine and displaying the closest match found.

Tesseract

History

Since the mid 1980’s, a number of open-source research have been performed that aimed to build an OCR system for recognizing various text documents in variety of scripting languages all over the world. At present, several open-source OCR systems are available. GOCR was one of the first open-source OCR engine. Some other engines such as Clara OCR and Ocrad followed up in the last decade, but their performance, quality and capabilities as compared to the other commercial OCR softwares were very limited. Tesseract OCR engine, developed in between 1984 to 1994 and improved in 1995 for its accuracy, was open-sourced by HP labs in Bristol, England in the year 200. Tesseract is highly portable, and more focussed towards providing accuracy rather than rejection. Although, tesseract was first developed by HP, but has never been used. At present, Google is sponsoring its development and maintenance. Tesseract has a major advantage that it supports various languages.

Architecture

Like other OCR engines, Tesseract also includes the key functional modules like line and word finder, word recognizer, static character classifier, linguistic analyser and an adaptive classifier such that it implements the basic processes involved in the development of OCR systems. At present, Tesseract can recognize printed text in various languages such as English, Spanish, French, Dutch etc. , but, it does not support analysis of document layout and formatting output.

The architecture of Tesseract involves various stages such as adaptive thresholding, connected component analysis, find lines and words, and two passes from word recognizer to detect text from a captured image, and store the data in a textual format to enhance the efficiency of further processing. In the first stage, which is adaptive thresholding, Tesseract converts the captured image into a binary image. In the connected component analysis stage, the outlines of the binary image are gathered together into blobs.

Later in the find lines and words stage, first the blobs are organized into lines by analysing the equivalent image size and fixed areas, then these text lines are segmented using the definite and fuzzy spaces into words. At the end, these words are recognized by a two-pass process in which first each word is recognized and then it is passed to the adaptive classifier as training data. The adaptive classifier tries to recognize the text by using the test data available more accurately. Once the training data is received and the learning process is completed, the final stage is to resolve the various issues such as spelling mistakes can be resolved by generating suggestions for the misspelled words.

Applications

OCR is one of the few technologies which has its applications spread over the entire spectrum of industries, where the need for immediate saving of labour is realized. OCR helps in saving a large number of paperwork in multiple languages and in a variety of formats by not only making digitized storage easier, but also making inaccessible data available at just one click. Use of OCR is also helpful in delivering better services in various sectors by reducing record management time. Here are some applications of automated text extraction from images:

Get a custom paper now from our expert writers.

  1. GUI based computer applications: GUI’s on websites (www) assist users to know how to perform wide variety of task using image search engines.
  2. Healthcare: In healthcare industry, OCR technology is widely used for processing paperwork that includes insurance as well as general health forms. Form processing tools, powered by OCR, are used to extract data from forms and are kept in an electronic database that can easily be accessed as and when required.
  3. Detection of Vehicle license plate: Text extraction from images is helpful in supervising real-time traffic. It is beneficial in recognizing vehicle license easily during traffic accidents, breaking of traffic rules, and various other scenarios.
  4. Content based image filtering: This application includes detection of spam and filtration of pornography, reactionary and fraud words easily.
  5. Banking: It uses this technology to process checks with minimal human involvement. This technology has been implemented near to perfectness for printed checks and is also almost accurate for handwritten checks.

Conclusion

Various aspects related to a common OCR engine architecture and its implementation on Tesseract is discussed in-depth. Although, the accuracy of the OCR systems is highly dependable on the quality and nature of the text data, but it has been analyzed that the Tesseract can be proved to be a very efficient OCR system.

Image of Alex Wood
This essay was reviewed by
Alex Wood

Cite this Essay

Automated Text Extraction from Images: Using Ocr System. (2020, April 12). GradesFixer. Retrieved November 20, 2024, from https://gradesfixer.com/free-essay-examples/automated-text-extraction-from-images-using-ocr-system/
“Automated Text Extraction from Images: Using Ocr System.” GradesFixer, 12 Apr. 2020, gradesfixer.com/free-essay-examples/automated-text-extraction-from-images-using-ocr-system/
Automated Text Extraction from Images: Using Ocr System. [online]. Available at: <https://gradesfixer.com/free-essay-examples/automated-text-extraction-from-images-using-ocr-system/> [Accessed 20 Nov. 2024].
Automated Text Extraction from Images: Using Ocr System [Internet]. GradesFixer. 2020 Apr 12 [cited 2024 Nov 20]. Available from: https://gradesfixer.com/free-essay-examples/automated-text-extraction-from-images-using-ocr-system/
copy
Keep in mind: This sample was shared by another student.
  • 450+ experts on 30 subjects ready to help
  • Custom essay delivered in as few as 3 hours
Write my essay

Still can’t find what you need?

Browse our vast selection of original essay samples, each expertly formatted and styled

close

Where do you want us to send this sample?

    By clicking “Continue”, you agree to our terms of service and privacy policy.

    close

    Be careful. This essay is not unique

    This essay was donated by a student and is likely to have been used and submitted before

    Download this Sample

    Free samples may contain mistakes and not unique parts

    close

    Sorry, we could not paraphrase this essay. Our professional writers can rewrite it and get you a unique paper.

    close

    Thanks!

    Please check your inbox.

    We can write you a custom essay that will follow your exact instructions and meet the deadlines. Let's fix your grades together!

    clock-banner-side

    Get Your
    Personalized Essay in 3 Hours or Less!

    exit-popup-close
    We can help you get a better grade and deliver your task on time!
    • Instructions Followed To The Letter
    • Deadlines Met At Every Stage
    • Unique And Plagiarism Free
    Order your paper now