Questions tagged [hocr]

hOCR is an open standard which defines a data format for representation of OCR output.

hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal.

Public Specification for the hOCR Format

31 questions
2
votes
0 answers

Converting hOCR formatted text to Json

Trying to implement a java class to convert hOCR output from Tesseract to JSON formatted data instead. At the moment we use Abbey for our OCR service and they return JSON formatted data for the Words location on the OCR'd image. But Tesseract only…
MayoMan
  • 4,757
  • 10
  • 53
  • 85
2
votes
0 answers

How to get the hidden text layout that tesseract creates for pdf files?

I don't have much experience with ocr. Here's what I try: tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf. My…
2
votes
1 answer

Tesseract hOCR iOS

I am learning how to use the Tesseract API and I am interested in the hOCR output function. Currently I am using this code to scan the image. Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:@"eng"]; tesseract.delegate = self; [tesseract…
user3247146
1
vote
1 answer

Converting Google Cloud Vision OCR X and Y Co-ordinates to bbox Co-ordinates

Google Cloud Vision OCR has the following Output for a bounding box Object. vertices { x: 786 y: 967 } Desired Output Format for Bounding Box I want to go ahead and convert these co-ordinates to bounding box co-ordinates to write them in my…
1
vote
0 answers

How to convert and save Hocr file in local path?how to solve error in following function?

I am getting unexpected indent in the following function. def save_hocr(self,data = []): df_hocr =data u= 0 for img in df_hocr: u = u+1 image = base64b64decode(str(img)) img = Image.open(io.BytesIO(image)) …
CodeDecode
  • 151
  • 1
  • 6
1
vote
1 answer

Parsing hOCR to JSON with Python

I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case). Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right…
Shankar
  • 11
  • 1
  • 2
1
vote
2 answers

Getting exact font size in hocr output

I'm using Tesseract to extract text and formatting from a large volume of pages that look like this: Sample page of OCR text with different line heights (My original images are 1200 DPI; I've reduced to 600 DPI and re-encoded to keep the file-size…
1
vote
1 answer

Extracting text by ElementTree

I try to run the following code to extract all the text from an XML file: please pay attention to "word_1_14" - which the word.text is found to be Nonetype thus not printed out...I found that it is because the text is with the strong tag, thus…
Jeffrey Ng
  • 67
  • 1
  • 7
1
vote
1 answer

Tesseract hOCR: How to detect upside down?

(I'll answer my own question here for general knowledge) In Tesseract OCR, how do you detect an image that is upside down? People who have worked with Tesseract may, or may not, know that Tesseract can read images that are being presented upside…
skiwi
  • 66,971
  • 31
  • 131
  • 216
0
votes
1 answer

How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)?

The resulting output: a txt file with empty lines. The expected output: a txt file with words of "Привет Мир! Это я, обычный неработающий текст или рыба" text. What am I doing wrong? Tried nested xsl:for-each code gives out the same kind of…
Oleg
  • 35
  • 4
0
votes
1 answer

Windows Tesseract OCR getting scattered HOCR out put instead of clean standard format

A quick help is highly appreciated. I am extracting the text from the tiff image through tesseract-OCR. The output I am looking for is.HOCR (HTML). I am getting the perfect output in terms of content, but the format looks very unorganized. But the…
Joe
  • 13
  • 1
  • 7
0
votes
1 answer

PDFMiner does not detect all pages

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on…
Sastorica
  • 1
  • 1
0
votes
0 answers

BS4 search and replace 'src' and 'style' attributes

I have been trying to search and replace some attributes in an html file, with information that I get from a second html file. I am using lxml from BeautifulSoup, but I am obviously doing something wrong and can't figure out what. I tried…
Giampaolo Ferradini
  • 529
  • 1
  • 6
  • 17
0
votes
0 answers

How do I make slashes act as word separators in HOCR output (Tesseract OCR)?

Is there any way to tell Tesseract OCR to treat certain characters as word separators in the HOCR output? For example, say I have a document about the Scranton/Wilkes-Barre RailRiders, and I want the slash to be treated as a word separator. So…
Null Pointers etc.
  • 2,124
  • 2
  • 14
  • 20
0
votes
0 answers

Is there a way to generate a FO with a HOCR input file?

Is there a way to create an XSL-FO that can have as input an HOCR generated with tesseract to produce the PDF with searchable text?
Qsebas
  • 458
  • 3
  • 15