By clicking “Check Writers’ Offers”, you agree to our terms of service and privacy policy. We’ll occasionally send you promo and account related email
No need to pay just yet!
About this sample
About this sample
Words: 2899 |
Pages: 6|
15 min read
Published: Apr 15, 2020
Words: 2899|Pages: 6|15 min read
Published: Apr 15, 2020
The popularity of digital image is increasing rapidly. Every day, many images have been generated by many groups like students, engineer, doctors. The need, and usage of image vary among these users. They can access images based on its primitive features or associated text. Text present in such images can provide meaningful information. We aim to retrieve the content and summarize the visual information automatically from images. Several algorithms are required to develop an optical character recognition (OCR). Tesseract is currently the most accurate OCR engine which was developed by HP Labs and is currently owned by Google. In this paper, we extract text from images using text localization, segmentation and binarization techniques.
Text extraction can be achieved by applying text detection which helps us in identifying image parts containing text, text localization determines the exact position of the text, text segmentation includes the separation of text from its background and binarization process converts the coloured images into binary image which includes black text appearing on a white background. On this binary image, character recognition is applied to convert it into ASCII text. This method also considers various features like colours, shapes, textures and extract relevant document. Text extraction is used in creating e-books from scanned books, image searching from a collection of visual data etc.
For past few years, the need for retrieving data from images and storing them for future reference has rapidly increased. Several researches have been performed to study the various approaches that could be helpful in extracting data from images. These approaches include methods related to various extraction processes such as text detection, text localization, text segmentation etc. Additionally, various properties of an image such as colour, intensity, component connection, orientation, text-style etc. are used to distinguish text regions from their backgrounds and other regions within the image. Machine recognition of handwritten texts has been in research for pattern recognition.
Previously Tesseract has been used to perform user specific training on both isolated and free-flow text by specifically using lower case Roman script. Being suspended for more than 10 years, Tesseract now give base to major commercial engines with improved accuracy. Tesseract was developed by HP, but never used by it. Later it was modified, improved and maintained by Google. Although Tesseract is helpful in extracting data from images with more accuracy, but along with, it brought some flaws with itself like over-segmentation for some characters, and under-segmentation or rejection in case of cursive word segments are few of them. Its unusual choice of features is probably the key strength and instead of raw outlines, using polygonal approximation is its key weakness.
Various projects have been developed using Tesseract to implement real-world scenarios related to manuscripts, data extraction and archiving from images, effective manipulation of image databases, language processing and many more.
Due to rapid development in digital technology, we have a huge collection of information stored in the form of images resulting in digitization of resources in various industries. Recent studies have been conducted on image processing that shows the importance of content retrieval from images. Text extraction from images and converting them to ASCII text can be achieved using OCR systems. OCR is very useful and popular in various applications including digital libraries, information retrieval systems, multimedia systems and Geographical Information systems.
OCR system almost reduced the keyboard interface between man and machine, and helps in automation of office which saves a lot of time and human effort. The accuracy of an OCR system is sometimes dependent on text pre-processing and segmentation algorithms. Level of difficulty in extraction of text is dependent on different styles, size, orientation, complex background of image etc. Various methodologies have been used to perform text extraction from images such as text detection, text localization, text segmentation etc.
Text detection plays a vital role in determining and highlighting regions containing only text or it can also be fed into optical character reader module for recognition. The images captured in OCR systems often include skew and perspective distortions due to some human errors which are also required to be removed. Binarization of skew corrected images using simple yet efficient binarization techniques that are performed before the segmentation process.
After processing of input image, we obtain a binarized image i. e. texts are written in black colour on white background. On this binarized image, text localization is performed that involves separating each character from the entire word in the image by scanning pixels sequentially. Sometimes situation occur such that components of adjacent characters are touched or overlapped which creates difficulties in the segmentation task.
This problem occurs frequently due to modification of upper and lower zones; thus, it is an important stage. Tesseract is an open source software that helps in text extraction with comparatively high accuracy when compared with other OCR systems. Tesseract does not have its own page layout, therefore it assumes to have input as binary image and optionally defined polygonal text regions. Processing period followed is a step-by-step pipeline of connected component analysis (called Blob) that recognizes the text as black-on-white text, organization of blobs into text lines, breaking text lines into words according to the character spacing kinds and so on.
Character Recognition
Character recognition is broadly classified into two categories as on-line and off-line character recognition. On-line character recognition involves dealing with a data stream coming from a transducer while the user is writing. When the user writes on an electromagnetic and pressure sensitive digitizing tablets, typical hardware used for collecting data, the successive movements of the pen are transformed to series of electronic signal which is further memorized and analysed by the computer. Whereas, off-line handwriting recognition deals with automatic conversion of texts from images into letter codes that are useful in computers and text-processing applications.
Image Texts
Text inside images can be classified into two categories as artificial text (i. e. caption text or superimposed text) and scene text (i. e. graphics text). Artificial texts are overlaid on images artificially at a later stage such as headlines displayed on television. Whereas, scene texts exist naturally in the images such as text printed on t-shirts. Texts have various attributes that changes their appearance such as font style, size, orientation, colour, texture, contrast, alignment, background etc. All the variations in the appearance will increase the difficulty level of text extraction and makes it more complicated.
OCR system involves step-by-step process to extract text from an image into editable format. The basic steps are: digitization, pre-processing, text localization, text segmentation, classification and character recognition.
Digitization
In the initial and most important stage of OCR systems, a paper-based handwritten or printed document is digitized (converted into electronic format). Digitization could be attained by a method in which a document is scanned, and an image file format is processed to obtain an electronic representation of original document. Digitization can be performed by scanners such as drum scanner, flatbed scanner, sheet fed scanner, face up scanner, digital cameras etc. using various scanning processes like photocopying, microfilming etc. But alongwith the benefit of scanning, they are accompanied by the drawback of degrading the quality of the documents. An expensive but highly effective approach, which is used by the best subsidized archives in the world, is to microfilm the documents, and then use medical or high-definition film scanner.
Pre-processing
Once the digitized image is obtained, it must be pre-processed. This pre-processing is sometimes also referred as text detection method. Pre-processing contributes in boosting up the performance of the OCR system by performing skew correction methods and methods for removal of disturbances and noise.
Text Localization
After the pre-processing of image is finished, we are now left with the text against the plain background. The process of localization includes enhancing the text area by removing the non-text areas. The text in an image has the property wherein all characters appear close to one another forming a cluster. On considering this property, morphological dilation operation can be used cluster the text pixels together and eliminate the pixels far away from the candidate text area.
Text Segmentation
In this stage, identification of individual glyphs (basic units representing one or more characters, usually contiguous). Segmenting handwritten characters into various zones such as upper, middle and lower zone, and characters is more difficult than the standard printed documents due to the variations in paragraphs, words in a line, characters in a word, skew, slant, size and curves. For this process, horizontal histogram profiles of binarized areas is analysed for segmentation into text lines followed by using vertical histogram profile of individual text lines, we segment words and characters.
Classification
This stage involves the training of the OCR engine by providing some sample data related to the data that is not yet present in the system, but it is required to extracted and recognized the document provided. The feature values trained will be used for recognition of characters from the extracted text image.
Characters Recognition
Recognition of characters is the final stage of text extraction from images. In this stage, we have the character arrays localized and discretely segmented from the pre-processed image. This step includes thinning and scaling of segmented characters and comparing those characters with the stored training set data in the OCR engine and displaying the closest match found.
History
Since the mid 1980’s, a number of open-source research have been performed that aimed to build an OCR system for recognizing various text documents in variety of scripting languages all over the world. At present, several open-source OCR systems are available. GOCR was one of the first open-source OCR engine. Some other engines such as Clara OCR and Ocrad followed up in the last decade, but their performance, quality and capabilities as compared to the other commercial OCR softwares were very limited. Tesseract OCR engine, developed in between 1984 to 1994 and improved in 1995 for its accuracy, was open-sourced by HP labs in Bristol, England in the year 200. Tesseract is highly portable, and more focussed towards providing accuracy rather than rejection. Although, tesseract was first developed by HP, but has never been used. At present, Google is sponsoring its development and maintenance. Tesseract has a major advantage that it supports various languages.
Architecture
Like other OCR engines, Tesseract also includes the key functional modules like line and word finder, word recognizer, static character classifier, linguistic analyser and an adaptive classifier such that it implements the basic processes involved in the development of OCR systems. At present, Tesseract can recognize printed text in various languages such as English, Spanish, French, Dutch etc. , but, it does not support analysis of document layout and formatting output.
The architecture of Tesseract involves various stages such as adaptive thresholding, connected component analysis, find lines and words, and two passes from word recognizer to detect text from a captured image, and store the data in a textual format to enhance the efficiency of further processing. In the first stage, which is adaptive thresholding, Tesseract converts the captured image into a binary image. In the connected component analysis stage, the outlines of the binary image are gathered together into blobs.
Later in the find lines and words stage, first the blobs are organized into lines by analysing the equivalent image size and fixed areas, then these text lines are segmented using the definite and fuzzy spaces into words. At the end, these words are recognized by a two-pass process in which first each word is recognized and then it is passed to the adaptive classifier as training data. The adaptive classifier tries to recognize the text by using the test data available more accurately. Once the training data is received and the learning process is completed, the final stage is to resolve the various issues such as spelling mistakes can be resolved by generating suggestions for the misspelled words.
OCR is one of the few technologies which has its applications spread over the entire spectrum of industries, where the need for immediate saving of labour is realized. OCR helps in saving a large number of paperwork in multiple languages and in a variety of formats by not only making digitized storage easier, but also making inaccessible data available at just one click. Use of OCR is also helpful in delivering better services in various sectors by reducing record management time. Here are some applications of automated text extraction from images:
Various aspects related to a common OCR engine architecture and its implementation on Tesseract is discussed in-depth. Although, the accuracy of the OCR systems is highly dependable on the quality and nature of the text data, but it has been analyzed that the Tesseract can be proved to be a very efficient OCR system.
Browse our vast selection of original essay samples, each expertly formatted and styled