Python pdftotext ShellError Using textract

Question

When I run the below Python script on a directory that contains a PDF file, I keep getting this error:

ShellError: The command pdftotext "path/to/pdf/title.pdf" - failed with exit code 1 ------------- stdout ------------- ------------- stderr ------------- 'pdftotext' is not recognized as an internal or external command, operable program or batch file.

I have verified that pdf2text and PDFMiner are installed properly. This is my first time using textract and it works great on all other file types (Word docs, PowerPoint docs, Excel docs, etc.). Why is the process calling pdftotext when pdf2text is the actual library?

import os
import os.path
import textract

pdf_path = 'path/to/pdf/'

for fname in os.listdir(pdf_path):
    if os.path.isfile(pdf_path+fname ):
        f = textract.process(pdf_path+fname )
        if 'string' in f:
            print fname

Thanks!

I think u havn't tried python 3 – yunus Feb 15 '19 at 04:39 — yunus, Feb 15 '19 at 04:39

score 2 · Answer 1 · answered Jul 02 '15 at 16:49

I just got done dealing with this issue myself. From what I understand, the confusion is that pdftotext is a command utility that is popular in linux, whereas pdf2text is a wrapper for the PDFMiner package. My windows binary for poppler and pdftotext is from an archive.org link so I don't feel right linking to it here, but here's a link I found on the wikipedia page for a windows binary. From what I've been able to tell, pdftotext tends to give better output than pdfMiner. The issue I was having that was generating the same error you were recieving is that pdftotext.exe was installed, and in my path, but I would receive the error if I didn't start the python script through the cmdline.

If you end up downloading it, it comes with some other nice utilities like pdftohtml and pdftops. Personal favorite though is pdftotext -layout whatever.txt which will print a pdf to stdout as plaintext with everything in place.

tl;dr Try running opening a cmdline and running the program. If you still might try (1) install a windows binary (assuming you're on windows) or (2) try updating textract with

pip install textract --upgrade

Hopefully that helps!

That didnt work for me, But i have found a work around, Given below — yunus, Feb 14 '19 at 08:50
I had converted that PDF file to doc or docx . Then I converted it to text . Through libre office . Installation,setup and implementation took a while. But solved most of the above fuss — yunus, Jul 13 '19 at 13:31
I have worked on that . Currently looking for opportunity in that field . Feel free to contact me at mohammedyunus009@gmail.com — yunus, Jul 13 '19 at 14:50

score 0 · Answer 2 · answered Feb 14 '19 at 08:54

0

Try implementing this code in your code .

import subprocess
subprocess.call(['soffice', '--headless',
            '--convert-to', 'odt', filename])
filename = os.path.splitext(filename)[0] + str('.odt')

But u should install libreoffice.

answered Feb 14 '19 at 08:54

yunus

2,445
1
14
12

Python pdftotext ShellError Using textract

2 Answers2