Text recognition of an image with Tesseract

Question

I would like to create a pdf file with text recognition from a scanned image.

But I don't want the original image in the PDF file, just plain text. The text should be visible so that it can be read, but the font doesn't matter that much.

This Tesseract command does almost what I want, but the text is invisible.

tesseract -c textonly_pdf=1 test.tif test pdf

How can I make the text visible?
Can I create a pdf file with another command-line or python tool?

I'm running Tesseract in Ubuntu.

Davide · Answer 1 · 2021-11-12T11:50:56.037

Here a snippet of code from a script I made in python (on windows) one year ago to extract the text in a dataframe (that you can then save to csv or other formats).

import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output

imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

Text recognition of an image with Tesseract

1 Answers1