0

I would like to create a pdf file with text recognition from a scanned image.

But I don't want the original image in the PDF file, just plain text. The text should be visible so that it can be read, but the font doesn't matter that much.

This Tesseract command does almost what I want, but the text is invisible.

tesseract -c textonly_pdf=1 test.tif test pdf 
  • How can I make the text visible?
  • Can I create a pdf file with another command-line or python tool?

I'm running Tesseract in Ubuntu.

Tinke
  • 45
  • 3

1 Answers1

0

Here a snippet of code from a script I made in python (on windows) one year ago to extract the text in a dataframe (that you can then save to csv or other formats).

import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output

imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()
Davide
  • 116
  • 7