Questions tagged [hocr]

hOCR is an open standard which defines a data format for representation of OCR output.

hOCR is an open standard which defines a data format for representation of OCR output. The standard aims to embed layout, recognition confidence, style and other information into the recognized text itself. Embedding this data into text in the standard HTML format is used to achieve that goal.

Public Specification for the hOCR Format

31 questions

votes

0 answers

Converting hOCR formatted text to Json

Trying to implement a java class to convert hOCR output from Tesseract to JSON formatted data instead. At the moment we use Abbey for our OCR service and they return JSON formatted data for the Words location on the OCR'd image. But Tesseract only…

json tesseract hocr

asked Jun 30 '17 at 15:26

MayoMan

4,757
10
53
85

votes

0 answers

How to get the hidden text layout that tesseract creates for pdf files?

I don't have much experience with ocr. Here's what I try: tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf. My…

pdf layout tesseract hocr

asked Mar 07 '16 at 10:24

user6028395

votes

1 answer

Tesseract hOCR iOS

I am learning how to use the Tesseract API and I am interested in the hOCR output function. Currently I am using this code to scan the image. Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:@"eng"]; tesseract.delegate = self; [tesseract…

ios objective-c tesseract hocr

asked Feb 04 '14 at 02:20

user3247146

vote

1 answer

Converting Google Cloud Vision OCR X and Y Co-ordinates to bbox Co-ordinates

Google Cloud Vision OCR has the following Output for a bounding box Object. vertices { x: 786 y: 967 } Desired Output Format for Bounding Box I want to go ahead and convert these co-ordinates to bounding box co-ordinates to write them in my…

ocr google-cloud-vision hocr

asked Dec 17 '21 at 04:26

Muneeb Ahmad Khurram

vote

0 answers

How to convert and save Hocr file in local path?how to solve error in following function?

I am getting unexpected indent in the following function. def save_hocr(self,data = []): df_hocr =data u= 0 for img in df_hocr: u = u+1 image = base64b64decode(str(img)) img = Image.open(io.BytesIO(image)) …

python-3.x image-processing python-tesseract hocr

asked Feb 01 '20 at 11:44

CodeDecode

vote

1 answer

Parsing hOCR to JSON with Python

I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case). Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right…

python postgresql parsing python-tesseract hocr

asked Jul 19 '18 at 11:16

Shankar

vote

2 answers

Getting exact font size in hocr output

I'm using Tesseract to extract text and formatting from a large volume of pages that look like this: Sample page of OCR text with different line heights (My original images are 1200 DPI; I've reduced to 600 DPI and re-encoded to keep the file-size…

tesseract hocr

asked Apr 20 '17 at 23:02

Scott Armstrong

vote

1 answer

Extracting text by ElementTree

I try to run the following code to extract all the text from an XML file: please pay attention to "word_1_14" - which the word.text is found to be Nonetype thus not printed out...I found that it is because the text is with the strong tag, thus…

python elementtree hocr

asked Nov 15 '16 at 07:56
Jeffrey Ng

67

1

7

1
vote

1 answer

Tesseract hOCR: How to detect upside down?

(I'll answer my own question here for general knowledge) In Tesseract OCR, how do you detect an image that is upside down? People who have worked with Tesseract may, or may not, know that Tesseract can read images that are being presented upside…

image rotation ocr tesseract hocr

asked Jan 03 '14 at 19:16
skiwi

66,971

31

131

216

0
votes

1 answer

How to convert Tesseract software output (hocr) into plain txt file with fop (generates zero output)?

The resulting output: a txt file with empty lines. The expected output: a txt file with words of "Привет Мир! Это я, обычный неработающий текст или рыба" text. What am I doing wrong? Tried nested xsl:for-each code gives out the same kind of…

xml xpath xslt apache-fop hocr

asked May 28 '22 at 02:51
Oleg

35

4

0
votes

1 answer

Windows Tesseract OCR getting scattered HOCR out put instead of clean standard format

A quick help is highly appreciated. I am extracting the text from the tiff image through tesseract-OCR. The output I am looking for is.HOCR (HTML). I am getting the perfect output in terms of content, but the format looks very unorganized. But the…

windows command-line ocr tesseract hocr

asked Feb 09 '22 at 08:40
Joe

13

1

7

0
votes

1 answer

PDFMiner does not detect all pages

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on…

ocr data-extraction pdfminer hocr

asked Oct 16 '20 at 20:59
Sastorica

1

1

0
votes

0 answers

BS4 search and replace 'src' and 'style' attributes

I have been trying to search and replace some attributes in an html file, with information that I get from a second html file. I am using lxml from BeautifulSoup, but I am obviously doing something wrong and can't figure out what. I tried…

python-3.x beautifulsoup lxml hocr

asked Mar 30 '20 at 02:22
Giampaolo Ferradini

529

1

6

17

0
votes

0 answers

How do I make slashes act as word separators in HOCR output (Tesseract OCR)?

Is there any way to tell Tesseract OCR to treat certain characters as word separators in the HOCR output? For example, say I have a document about the Scranton/Wilkes-Barre RailRiders, and I want the slash to be treated as a word separator. So…

tesseract hocr

asked Jul 19 '19 at 22:51
Null Pointers etc.

2,124

2

14

20

0
votes

0 answers

Is there a way to generate a FO with a HOCR input file?

Is there a way to create an XSL-FO that can have as input an HOCR generated with tesseract to produce the PDF with searchable text?

xsl-fo hocr

asked Jul 04 '17 at 18:56
Qsebas

458

3

15

Prev 1
2
3 Next