1

This code accesses a folder of single page .tif files and extracts textual data.

data = []
data1 = []
listOfPages = glob.glob(r"C:/Users/name/folder/*.tif")
for entry in listOfPages:
    if os.path.isfile(entry):
        filenames = entry
    data1.append(filenames)
    text1 = pytesseract.image_to_string(
            Image.open(entry), lang="en"
        )
    text = re.sub(r'\n',' ', text1)     
    regex1 = re.compile(r'(www(i|ı)a\s+bbb(\:)?(\s+|\s+\.)?\s+(de(s|r(:)?))?)', flags = re.IGNORECASE)
        
    try:
        var1a = regex1.search(text)
        if var1a:
            var1 = var1a.group(1)
        else:
            var1 = None
    except:
        pass
        
    data.append([text, var1])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'var1'])
df01= pd.DataFrame(data1,columns =['filename'])
df1 = pd.concat([df0, df01], axis=1)
df1 = df1.reset_index(drop = True)

How can i make it work if i eventually add multipage.tif files into that folder? I am unable to transform the Image.opn(entry) part into something like this:

img = Image.open(path)
images = []
for i in range(img.n_frames):
    img.seek(i)
    images.append(np.array(img))
return np.array(images)
id345678
  • 97
  • 1
  • 3
  • 21

1 Answers1

1
  1. You could either pass the np.array to the image_to_string-method. Pytesseract will handle it itself (see https://github.com/madmaze/pytesseract/blob/master/pytesseract/pytesseract.py#L168)
    text1 = pytesseract.image_to_string(np.array(img), lang="en")
  1. Or instead create the image from the array not the file before passing it to pytesseract:
    text1 = pytesseract.image_to_string(Image.fromarray(np.array(img)), lang="en")

Here is a full examlple (without the loop and the further processing):

import numpy as np
from PIL import Image
import pytesseract
tif = Image.open('multipage_tif_example.tif')
tif.seek(0)
img_page1 = np.array(tif)

# Variant 1
text1 = pytesseract.image_to_string(img_page1, lang="eng")

# Variant 2
text1 = pytesseract.image_to_string(Image.fromarray(img_page1), lang="eng")

The versions I use are:

  • Python 3.9.7
  • pytesseract==0.3.8
  • numpy==1.21.3
  • Pillow==8.4.0

The tiff is from http://www.nightprogrammer.org/development/multipage-tiff-example-download-test-image-file/

Tom
  • 619
  • 6
  • 17
  • Thanks for your suggestions. when i do `text1 = pytesseract.image_to_string(np.array(entry), lang="en")` or `text1 = pytesseract.image_to_string(Image.fromarray(np.array(entry)), lang="en")` it get the error `TypeError: Cannot handle this data type: (1, 1), – id345678 Oct 29 '21 at 13:00
  • I tried myself and with the upper configuration it worked. Maybe it is something with your versions or you have a broken /non-standard-conform tif image. That is just wild guessing. – Tom Oct 29 '21 at 16:27
  • I use `python 3.9.7 pytesseract 0.3.8 numpy 1.21.2 pillow 8.3.2` On the internet, they say it has something to do with the dtype `np.uint8` – id345678 Oct 30 '21 at 08:16