I am trying to obtain the content of several .pdf files from a directory in order to transform them to text with tika library, however I believe that I am not reading the .pdf file objects correctly. This is what I tried so far:
Input:
for filename in sorted(glob.glob(os.path.join(input_directory, '*.pdf'))):
with open(filename,"rb") as f:
print(f)
text = parser.from_file(f)
Output:
<_io.BufferedReader name='/Users/user/Downloads/pdf-files/a_pdf_file.pdf'>
AttributeError: '_io.BufferedReader' object has no attribute 'decode'
Which is the most efficient way of walking through the content of the files in python?.