As above, I am extracting text from multiple documents using tika in python, but on one particular pdf, it is extracting the text on my development machine (MacBook Pro) but not on Windows Server 2012, where it returns a 'NoneType'.
Very confusing, at first I thought libraries, but it's using the same jar file from apache (1.19.1)
try:
headers = {'X-Tika-PDFextractInlineImages': 'true',}
data = parser.from_file(pathtofile, serverEndpoint=self.TIKA_SERVER, headers=headers)
charstoreturn = data['content'].strip().split()[:limit]
charstoreturn = ' '.join(charstoreturn).replace("\n", " ").replace('"', "'").replace(",","").replace("’","'")
return True, charstoreturn
except Exception as err:
return False, "error {} on file: {}.\n".format(str(err), pathtofile)
Where TIKA_SERVER is 'http://localhost:1234' pathtofile is the file I am testing with that is failing
Error on windows: error 'NoneType' object has no attribute 'strip' on file: \testdata\test2.pdf.
Any ideas?