Questions tagged [tesseract]

Tesseract is an OCR (Optical Character Recognition) engine originally developed at HP Labs and now available as an open source library with development sponsored by Google.

Tesseract is an open source, multi-lingual OCR (Optical Character Recognition) engine originally developed at HP Labs. It is now sponsored by Google and licensed under the Apache license 2.0. It currently recognizes 107 languages. Tesseract is primarily written in C++ and C. The project is hosted at https://github.com/tesseract-ocr/tesseract and its support forums are found at http://groups.google.com/group/tesseract-ocr.

4350 questions
13
votes
2 answers

Resources containing OCR benchmark test-sets for free

I want to do an OCR benchmark for scanned text (typically any scan, i.e. A4). I was able to find some NEOCR datasets here, but NEOCR is not really what I want. I would appreciate links to sources of free databases that have appropriate images and…
SuTron
  • 387
  • 5
  • 16
13
votes
2 answers

How to OCR multiple column in a document using tesseract

I working on a project of OCR sinhala language using tesseract. My goal is ocr, multiple column including text in a document. And get out put file in a correct format. Is there any method to identify column in a document using tesseract?
Sandun Tharaka
  • 727
  • 2
  • 9
  • 29
13
votes
1 answer

Threshold image using opencv (Java)

I am working with Opencv for my project. I need to convert the image below to threshold image I tried this function: Imgproc.threshold(imgGray, imgThreshold, 0, 255, Imgproc.THRESH_BINARY + Imgproc.THRESH_OTSU); But the result was not so good,…
Bee Bee
  • 185
  • 1
  • 2
  • 11
13
votes
3 answers

Where I can find the list of available property name for tesseract->setvariable function's first parameter?

From the lots of goggling I am able to find only few of them as the below example for tesseract's setVariable(1st param, 2nd param) tesseract->SetVariable("tessedit_char_whitelist",…
The iOSDev
  • 5,237
  • 7
  • 41
  • 78
12
votes
6 answers

Tesseract installation in Google colaboratory

I have installed tesseract in Google colab using the command !pip install tesseract But when I run the command text = pytesseract.image_to_string(Image.open('cropped_img.png')) I get the below error: TesseractNotFoundError: tesseract is not…
Prosenjit
  • 145
  • 1
  • 2
  • 10
12
votes
1 answer

Tesseract OCR Text Position

I am working on OCR using tesseract. I am able to make the application working and get the output. Here i'm trying to extract data from an invoice bill and getting the extracted data. But the spacing between words in input has to be similar in…
ab2015
  • 125
  • 1
  • 1
  • 8
12
votes
2 answers

Training tesseract 4 with images instead of font

I have some questions about making tiff/box files for tesseract 4. In TrainingTesseract 4.00 document written: Making Box Files As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some …
M.Rahnama
  • 131
  • 1
  • 1
  • 5
12
votes
1 answer

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written. My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with…
Joelio
  • 4,621
  • 6
  • 44
  • 80
12
votes
9 answers

Can't seem to run tesseract from command line despite adding PATH

I'm trying to add tesseract to be able to install pytesseract. I use Windows 7. I add this path to my PATH environmental variable C:\Program Files (x86)\Tesseract-OCR\tesseract.exe From the command line if I run tesseract DMTX_screenshot.png out …
Moondra
  • 4,399
  • 9
  • 46
  • 104
12
votes
2 answers

Extracting fields from forms with varying structures

I am trying to extract certain fields from a balance sheet. For example, I would like to be able to tell that the value of 'Inventory' is 1,277,838 for the following balance sheet: Currently, I am using Tesseract to convert images to text. However,…
Kelvin
  • 203
  • 2
  • 3
  • 5
12
votes
1 answer

tesseract didn't get the little labels

I've installed tesseract on my linux environment. It works when I execute something like # tesseract myPic.jpg /output But my pic has some little labels and tesseract didn't see them. Is an option is available to set a pitch or something like that…
Paul
  • 1,290
  • 6
  • 24
  • 46
12
votes
1 answer

Detect white characters on black background using Tesseract

I'm completely new to Tesseract OCR. This problem might be simple but I can't seem to find the answer using Google. Basically, I have an image that contains two parts: the first part, which is at the top of the image, has a black background with…
Chaoran
  • 321
  • 1
  • 4
  • 15
12
votes
4 answers

How to tesseract multiple files in the same folder from command prompt?

I know how to Tesseract multiple files in the same directory using Terminal on OS X. for i in *.tif ; do tesseract $i outtext; done; Does anyone have suggestions for how to do this on the Command Prompt on a computer running Windows?
Thomas Padilla
  • 193
  • 1
  • 1
  • 7
12
votes
2 answers

Convert hOCR to HTML table

I am looking for a tool or an idea to be implemented in python that convert hOCR file (generated by tesseract in by application) to html table. The idea is to utilize the text location information in hOCR file (provided in bbox attribute) to create…
azri.dev
  • 311
  • 3
  • 8
12
votes
1 answer

Digital Numbers on Tesseract OCR

SOLUTION: I've had to train my own data to try it with the OCR. It seems that works well, but I don't know why the trained data from arturaugusto not works for me =( https://github.com/adri1992/Tesseract_sevenSegmentsLetsGoDigital.git With my…
adlagar
  • 877
  • 10
  • 31