Extraction text as csv from scanned pdf file using tesseract

Question

enter image description hereI need help to extract text from scanned pdf. I have tried to extract it using pymupdf and pillow and pytesseract, but I am not getting correct results, there are some text are returned incorrectly. I tried to increase sharpness and brightness but still did not get a good result.

I have already checked many answers using OpenCV, but I am fairly new to OpenCV. Please help.

def pdf_to_text(pdf_file,text_file_name,rotate_pdf=False,adj_sharpness=False,adj_contract=False,adj_brightness=False):
    try:
        doc = fitz.open(pdf_file)
        zoom_x=2.5
        zoom_y=2.5
        mat = fitz.Matrix(zoom_x,zoom_y)
        files = []
        for n in range(doc.page_count):
            #print(f'Extracting {n} image')
            page = doc.load_page(n)
            if rotate_pdf:
                page.set_rotation(-90)
            #pix = page.get_pixmap(dpi=600)    
            pix = page.get_pixmap(alpha=False,matrix=mat,dpi=300)
            
            folder=os.path.join(os.getcwd(),"images")
            if not os.path.exists(folder):
                os.makedirs(folder)
            fname = os.path.join(folder,"page-%i.png"%n)

            pix.save(fname)
            
            im = Image.open(fname)
            
            im = adjust_sharpness(im,2.5)
            im = adjust_brightness(im,1.1)
            im = adjust_contrast(im,2.8)
            #im = im.filter(ImageFilter.SMOOTH)
            im.save(fname)
            #remove_lines(fname)
          
            
            files.append(fname)
            #if n>1:
            #    break   
        print("Extracting Images Completed")
        print("Now Extracting data from image file")
        
        for file in files:
            #file = "./images/page-0.png"
            
            text = image_to_string(file, lang_code="eng")
            
            #text = image_to_string(file, lang_code="fra+eng")
            make_textfile(text, text_file_name)
        print("Extracting and saving text files completed")
    except FileNotFoundError:
        print(f"File not available {pdf_file}")
        return None    
    
    pytesseract.image_to_string(image=Image.open(image_name))

The image:

Well, this is extracted from pdf as a sample. – Ketan Mar 16 '22 at 16:15 — Ketan, Mar 16 '22 at 16:15

score 0 · Answer 1 · answered Mar 16 '22 at 19:45

0

To process tables in Tesseract you are likely to need to remove table lines to help the OCR engine with the segmentation of the image. However, you may try this first to see how Tesseract will perform.

text = image_to_data(file, lang="eng", config="--psm 6")

This will treat your image as a block to avoid missing as much text as possible, but removing the lines and binarizing the image will lead to better results. This link would help you with the removal of lines.

answered Mar 16 '22 at 19:45

Esraa Abdelmaksoud

1,307
12
25

Hi, Thank you for your answer. Yes, I have applied that strategy and removed lines to get more clean text. I am getting better results, but not the best. The problem I am facing is the last column and 1st column is getting messy 1,2203206CXJ00,12/19/21,99212(UD),1,$85.00,$22.72,$62.28,$0.00,$0.00,$0.00,$0.00,$0.00,$0.00,$22.72,CO45 1,2203206CXJ00,12/19/21,U0004(UD),1,$250.00,$63.75,CO18,$0.00,$0.00,$0.00,$0.00,$0.00,$0.00,$63.75,CO45 2,22032057KC00,01/04/22,99212(SA),1,$85.00,$26.73,$58.27,$0.00,$0.00,$0.00,$0.00,$0.00,$0.00,$26.73,CO45,N381 like this after tweeking some lasts two columns – Ketan Mar 16 '22 at 20:03
Could you share a link to your output image? – Esraa Abdelmaksoud Mar 16 '22 at 20:09

Extraction text as csv from scanned pdf file using tesseract

1 Answers1