4

I wrote a small python script to parse/extract info from a PDF. I tested it on my local machine, I have python 2.6.2 and pdftotext version 0.12.4.

I am trying to run this on my webhosting server (dreamhost). It has python version 2.5.2 and pdftotext version 3.02.

But when I try to run the script I get the following error at the pdftotext line ( I have checked it with a simple throw away script as well) "Error: Couldn't open file '-'"

def ConvertPDFToText(currentPDF):
    pdfData = currentPDF.read()

    tf = os.tmpfile()
    tf.write(pdfData)
    tf.seek(0)

    if (len(pdfData) > 0) :
        out, err = subprocess.Popen(["pdftotext", "-layout", "-", "-"], stdin = tf, stdout=subprocess.PIPE ).communicate()
        return out
    else :
        return None

Note that I am pass this function the same PDF file and it does have access to it. In another function I can email myself the PDF document from the same script running on the webhost.

What am I doing wrong? What is the possible difference in usage for subprocess/python/pdftext between my local version and the webhost version? I am guessing I will have to modify the command, so any help would be greatly appreciated.

Thanks in advance.

Chaitanya
  • 5,203
  • 8
  • 36
  • 61
  • 1
    Can the pdftotext read from the command line directly on webhost? Can you verify this? Also, why don't you pass the name of the temporary file as an argument rather than give it on standard input? – Noufal Ibrahim Jan 29 '11 at 13:33
  • 3.02 is probably the version of `xpdf`, not `pdftotext`. Usually `pdftotext` is part of _xpdf_ package. – PoltoS Jan 29 '11 at 13:42
  • @Noufal - Yes it can read from commandline. For more context on why I am doing this see this question http://stackoverflow.com/questions/3745178/running-a-command-line-from-python-and-piping-arguments-from-memory – Chaitanya Jan 30 '11 at 13:04

3 Answers3

6

The hint for the answer lay in Noufal's comment, to use the filename. But the os.tmpfile() doesn't have a filename. I had to use another module. The modified code is given below.

#import tempfile
def ConvertPDFToText(currentPDF):
    pdfData = currentPDF.read()

    tf = tempfile.NamedTemporaryFile()
    tf.write(pdfData)
    tf.seek(0)

    outputTf = tempfile.NamedTemporaryFile()

    if (len(pdfData) > 0) :
        out, err = subprocess.Popen(["pdftotext", "-layout", tf.name, outputTf.name ]).communicate()
        return outputTf.read()
    else :
        return None

I am not sure sure how to give Noufal's comment the points for this answer though. Perhaps he can cut and paste this answer?

Chaitanya
  • 5,203
  • 8
  • 36
  • 61
  • 1
    You have to add the following imports `import os, tempfile, subprocess` the make the code above work. If currentPDF is the path of your file, change the first line with: `pdfData = file(currentPDF, 'rb').read()`. – rom Mar 23 '14 at 18:21
4

Can the pdftotext read from the command line directly on webhost? Can you verify this? Also, why don't you pass the name of the temporary file as an argument rather than give it on standard input? (repasting here as per your suggestion).

Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
  • installing the package `poppler-utils` on the remote server was the solution for me – rom Apr 08 '14 at 14:57
0

If you have shell access to the server, try to run without Python:

# pdftotext -layout - -

and:

# pdftotext -layout

Some versions of pdftotext may use stdi/stdout then run without any files in command line. Try

    out, err = subprocess.Popen(["pdftotext", "-layout"], stdin = tf, stdout=subprocess.PIPE ).communicate()

Or use temp file as suggested by Noufal Ibrahim.

David Andrei Ned
  • 799
  • 1
  • 11
  • 28
PoltoS
  • 1,232
  • 1
  • 12
  • 32
  • none of these work. "pdftotext -layout - -" gives me the same error message and "pdftotext -layout" just prints out the help text. The subprocess without "-" arguments behaves just like "pdftotext -layout" and prints out the help message – Chaitanya Jan 30 '11 at 12:59