I am using tesseract-ocr and get the output in hOCR format. I need to store this hOCR output into the database (PostgreSQL in my case).
Since I may need every piece of information (80% of it) from this hOCR individually, which would be the right approach? Should it be stored as XML datatype or parsed to JSON and stored? And in case of JSON, how to parse this hOCR to JSON with Python. Other related suggestions are also appreciated.