Problems while applying a function to each element's content of a directory in python?

Question

I am trying to obtain the content of several .pdf files from a directory in order to transform them to text with tika library, however I believe that I am not reading the .pdf file objects correctly. This is what I tried so far:

Input:

for filename in sorted(glob.glob(os.path.join(input_directory, '*.pdf'))):
    with open(filename,"rb") as f:
        print(f)
        text = parser.from_file(f)

Output:

<_io.BufferedReader name='/Users/user/Downloads/pdf-files/a_pdf_file.pdf'>
AttributeError: '_io.BufferedReader' object has no attribute 'decode'

Which is the most efficient way of walking through the content of the files in python?.

thanks for the help @brianpck, I removed it and I still have the same exception `AttributeError: '_io.TextIOWrapper' object has no attribute 'decode'`. — tumbleweed, Oct 07 '16 at 20:17

score 1 · Answer 1 · answered Oct 07 '16 at 20:23

1

The tika parser receives a path and opens the file itself:

for filename in sorted(glob.glob(os.path.join(input_directory, '*.pdf'))):
    parsed = parser.from_file(filename)
    text = parsed['content']

answered Oct 07 '16 at 20:23

Mureinik

297,002
52
306
350

Thanks for the help.... Is there a more fast way to do this for large scale files?. – tumbleweed Oct 07 '16 at 20:51

Problems while applying a function to each element's content of a directory in python?

1 Answers1