I want to extract texts from a scanned table with tesseract
and put it them into arrays that have the same structure as the table.
I already used opencv
to detect the table structure, and obtained the coordinates of the table joints as well as the entire table structure (stored into np.array
).
For example, for the table in this picture:
I want pytesseract
to store it into:
my_table = [[x, y, 1, 3],
[x, a, 2, 3],
[x, a, 2, 3],
[x, z, 2, 3]]
I have used commercial OCR softwares and they always detect the table structure first, and secondly, recognize and extract texts to that detected table structure.
How do I accomplish the second step with pytesseract? Answers using Tesseract in other languages are great as well.