When I run the below Python script on a directory that contains a PDF file, I keep getting this error:
ShellError: The command
pdftotext "path/to/pdf/title.pdf" -
failed with exit code 1 ------------- stdout ------------- ------------- stderr ------------- 'pdftotext' is not recognized as an internal or external command, operable program or batch file.
I have verified that pdf2text and PDFMiner are installed properly. This is my first time using textract and it works great on all other file types (Word docs, PowerPoint docs, Excel docs, etc.). Why is the process calling pdftotext
when pdf2text
is the actual library?
import os
import os.path
import textract
pdf_path = 'path/to/pdf/'
for fname in os.listdir(pdf_path):
if os.path.isfile(pdf_path+fname ):
f = textract.process(pdf_path+fname )
if 'string' in f:
print fname
Thanks!