I am using PaddleOCR to extract text from documents such as payslip and print the result. Corrently the code will print the words in the correct line and order however, i am struggling to get the words printed in roughly the same alignment, this is important particularly for payslips as they are made up of multiple smaller tables and if two of theses tables are next to each other it is harder to further process the extracting text as the table contents can become muddled. Any I am currently working on a project where I need to extract text from documents such as payslips using PaddleOCR. While I have been successful in extracting the text and printing it in the correct line and order, I am facing challenges in maintaining consistent alignment of the extracted words.
Alignment is particularly crucial for payslips since they often consist of multiple smaller tables. When two or more tables are adjacent to each other, the extracted text can become jumbled, making it harder to process the data accurately.
I would greatly appreciate any assistance or suggestions that can help me improve the alignment of the extracted text. Here are some specific questions I have:
Are there any techniques or methods I can use to enhance the alignment of the extracted words?
Are there any additional libraries or tools that can be integrated with PaddleOCR to improve alignment accuracy?
Are there any best practices or strategies that can be followed while using PaddleOCR to handle tables or complex document structures?
If anyone has experience or expertise in working with PaddleOCR or text extraction from documents, I kindly request your help in resolving this issue. Your insights, suggestions, or code examples would be invaluable to me.
Thank you in advance,
from operator import itemgetter
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='en')
img_path = 'payslips/png/payslip67.png'
result = ocr.ocr(img_path, cls=True)
output = []
for res in result:
for line in res:
output.append(line)
print(output)
ocr_extract = output
sorted_extract = sorted(ocr_extract, key=lambda x: x[0][0][1], reverse=True)
lines = []
current_line = []
previous_y = None
for element in sorted_extract:
rectangle = element[0]
text = element[1][0]
y = rectangle[0][1]
if previous_y is None or abs(y - previous_y) < 10:
current_line.append((text, rectangle))
else:
lines.append(current_line)
current_line = [(text, rectangle)]
previous_y = y
if current_line:
lines.append(current_line)
sorted_lines = []
for line in lines:
sorted_line = sorted(line, key=lambda x: (x[1][0][0], x[1][0][1])) # Sort words based on both X-coordinate and Y-coordinate
sorted_words = [word[0] for word in sorted_line]
sorted_lines.append(sorted_words)
sorted_lines = sorted(sorted_lines, key=lambda line: ocr_extract[[el[1][0] for el in ocr_extract].index(line[0])][0][0][1])
for line in sorted_lines:
print(' '.join(line))