Retaining tabular structure after extracting data using OCR Pytesseract

Question

I am using OCR Pytesseract to extract data from an image which has tabular data. I am extracting it to a textfile and I wish to store it in an excel sheet. I Couldn't directly store it into an excel sheet. But the problem I am encountering is that after saving data to text file, I am losing the tabular kind of structure. I tried converting to a dataframe and referred to a few SO questions as well, but none seem to help. My aim is that every cell of the excel sheet should have single value which would be extracted from Tesseract. Code to save to excel and converting to dataframe is:

text = pytesseract.image_to_string(PIL.Image.open("jpg path"), config = config)
#print(text)
file = open("file.txt","a+", encoding  = "utf-8")
file.write("text :{0}".format(text)) 
file.close() 
list_of_lists = []

with open("fileone.txt", 'r', encoding = "utf-8") as f:
    for line in f:
        inner_list = [line.strip() for line in line.split("  ")]
        list_of_lists.append(inner_list)

df = pd.DataFrame(list_of_lists)
print(df)

Retaining tabular structure after extracting data using OCR Pytesseract

0 Answers0