Questions tagged [ocr]

Optical Character Recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. The following topics, although some being distinct fields of application, are also commonly referred to as OCR: Handwritten Text Recognition (HTR), Optical Word Recognition (OWR), Intelligent Character Recognition (ICR), Intelligent Word Recognition (IWR).

Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website.

OCR @Wikipedia

Frequently-asked questions:

6124 questions
38
votes
6 answers

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants…
James Owers
  • 7,948
  • 10
  • 55
  • 71
37
votes
1 answer

Using Microsoft OCR Library with JS/jQuery in VS 2013

I am currently working on a windows 8.1 application and I am using web languages and mostly jQuery (Cordova type project) as it might be used on other platforms. I need to use the Microsoft OCR Library (not Tesseract or any other ones, I know them…
37
votes
4 answers

Character recognition (OCR algorithm)

I am working on a project in which I have to develop OCR Algorithm ( I have to read the text from Image and then convert it to different language ).So my first task is to get text from image. Steps to complete first task. Loading any image format…
TLE
  • 705
  • 1
  • 7
  • 16
36
votes
9 answers

What is the ideal font for OCR?

Does anybody have any experience with different fonts for OCR? I am generating an ID then trying to scan it with tesseract. At the moment I am just T&E'n different fonts, but this seems pretty inefficient. I've tried the OCR* family of fonts, and…
Chris Lloyd
  • 12,100
  • 7
  • 36
  • 32
36
votes
6 answers

Preprocessing image for Tesseract OCR with OpenCV

I'm trying to develop an App that uses Tesseract to recognize text from documents taken by a phone's cam. I'm using OpenCV to preprocess the image for better recognition, applying a Gaussian blur and a Threshold method for binarization, but the…
Mauricio
  • 839
  • 2
  • 13
  • 26
35
votes
6 answers

Recognize a number from an image

I'm trying to write an application to find the numbers inside an image and add them up. How can I identify the written number in an image? There are many boxes in the image I need to get the numbers in the left side and sum them to give total. How…
Hash
  • 7,726
  • 9
  • 34
  • 53
34
votes
3 answers

Is there an efficient algorithm for segmentation of handwritten text?

I want to automatically divide an image of ancient handwritten text by lines (and by words in future). The first obvious part is preprocessing the image... I'm just using a simple digitization (based on brightness of pixel). After that I store data…
Ernado
  • 641
  • 1
  • 6
  • 14
33
votes
8 answers

Is there an OCR library that outputs coordinates of words found within an image?

In my experience, OCR libraries tend to merely output the text found within an image but not where the text was found. Is there an OCR library that outputs both the words found within an image as well as the coordinates (x, y, width, height) where…
Adam Paynter
  • 46,244
  • 33
  • 149
  • 164
33
votes
9 answers

Tesseract OCR simple example

Hi Can you anyone give me a simple example of testing Tesseract OCR preferably in C#. I tried the demo found here. I download the English dataset and unzipped in C drive. and modified the code as followings: string path =…
Will Robinson
  • 631
  • 2
  • 6
  • 11
33
votes
6 answers

Using Tesseract from java

I'm trying to build a sample application in java that will read an image file and just output the text extracted from the image. I found the Tesseract project which seems promising, however, its in c++. In order to use it, should I simply run it as…
Omnipresent
  • 29,434
  • 47
  • 142
  • 186
32
votes
7 answers

How to remove all lines and borders in an image while keeping text programmatically?

I'm trying to extract text from an image using Tesseract OCR. Currently, with this original input image, the output has very poor quality (about 50%). But when I try to remove all lines and borders using photoshop, the output improves a lot (~90%).…
wind
  • 423
  • 1
  • 4
  • 5
32
votes
5 answers

OCR with the Tesseract interface

How do you OCR an tiff file using Tesseract's interface in c#? Currently I only know how to do it using the executable.
toh yen cheng
  • 345
  • 1
  • 4
  • 5
32
votes
2 answers

Which OCR Engine is better: Tesseract or OCRopus?

I have tried Tesseract with iPhone and assessed its accuracy to be 70% without image preprocessing. I also noticed that it might be poor in extracting digits. I have heard about OCRopus OCR engine: which is better, Tesseract or OCRopus, in terms of…
Ahmed Hussein
  • 442
  • 2
  • 6
  • 12
31
votes
10 answers

Programmatically recognize text from scans in a PDF File

I have a PDF file, which contains data that we need to import into a database. The files seem to be pdf scans of printed alphanumeric text. Looks like 10 pt. Times New Roman. Are there any tools or components that can will allow me to recognize…
Rob
  • 3,026
  • 4
  • 30
  • 32
30
votes
2 answers

What OCR options exist beyond Tesseract?

I've used Tesseract a bit and it's results leave much to be desired. I'm currently detecting very small images (35x15, without border, but have tried adding one with imagemagick with no ocr advantage); they range from 2 chars to 5 and are a pretty…
ylluminate
  • 12,102
  • 17
  • 78
  • 152