Parsing hOCR to JSON with Python

Question

I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case).

Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right approach? Should it be stored as XML datatype or parsed to JSON and stored? And in case of JSON, how to parse this hOCR to JSON with Python. Other related suggestions are also appreciated.

I still got to design the database for this, i am at the moment, deciding on the workflow and elements to be included to this workflow of storing the details of this hOCR format file. I am not a python programmer. So, did not try the implementations yet. — Shankar, Jul 19 '18 at 12:33

score 4 · Answer 1 · answered Jul 19 '18 at 15:37

4

hOCR appears to be a dialect of XML, so you should be able to use the xml.etree module from the stdlib to parse the hOCR code into a Python-navigable tree. Then navigate that tree to compose an object or nested dict, and then finally using the stdlib's json module to convert that dict to JSON.

answered Jul 19 '18 at 15:37

PaulMcG

62,419
16
94
130

1

Yes, hOCR is either a XHTML or HTML document, see the documentation http://kba.cloud/hocr-spec/1.2/#representation . For a start you can look at the implementation of the hocr-tools which are Python tools parsing the hOCR files by using etree and then make some different calulations: https://github.com/tmbdev/hocr-tools . – zuphilip Jul 19 '18 at 20:49

Parsing hOCR to JSON with Python

1 Answers1