Apache Tika on python extracts text from pdf on MacBook Pro but not Windows server

Question

As above, I am extracting text from multiple documents using tika in python, but on one particular pdf, it is extracting the text on my development machine (MacBook Pro) but not on Windows Server 2012, where it returns a 'NoneType'.

Very confusing, at first I thought libraries, but it's using the same jar file from apache (1.19.1)

try:
    headers = {'X-Tika-PDFextractInlineImages': 'true',}  
    data = parser.from_file(pathtofile, serverEndpoint=self.TIKA_SERVER, headers=headers)
    charstoreturn = data['content'].strip().split()[:limit]
    charstoreturn = ' '.join(charstoreturn).replace("\n", " ").replace('"', "'").replace(",","").replace("’","'")
    return True, charstoreturn
except Exception as err:
    return False, "error {} on file: {}.\n".format(str(err), pathtofile)

Where TIKA_SERVER is 'http://localhost:1234' pathtofile is the file I am testing with that is failing

Error on windows: error 'NoneType' object has no attribute 'strip' on file: \testdata\test2.pdf.

Any ideas?

score 0 · Answer 1 · answered Dec 05 '18 at 13:26

The python tika wrapper is returning None, so you need to dig into why that happened.

Is the tika server running? If not, why not? Do you have a suitable Java VM installed for it to use? Do you have permission to execute the jar? Does the Python code make assumptions about your Windows system that are not true (eg that jar's are executable, or that the default VM is the correct one etc).

If the tika server is running then does tika work properly or give some other errors? If you put a PDF through a tika server you start from the same jar does that work or give you an error? Can you debug to see what, if any, errors come back from the web request in the python library (breakpoint etc)?

Works perfectly on all other documents I pass through, just one single pdf? — Hairy, Dec 05 '18 at 13:28

Apache Tika on python extracts text from pdf on MacBook Pro but not Windows server

1 Answers1