4

I am getting subprocess.CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', error while running tabula python liberary.

Command:

df = tabula.read_pdf(filepath, pages = 5 ,guess=True, multiple_tables= True, stream=True, java_options="-Dfile.encoding=UTF8")

ERROR message:

  File "C:\Users\himsoni\AppData\Local\Programs\Python\Python37\lib\site-packages\tabula\io.py", line 85, in _run
    check=True,
  File "C:\Users\himsoni\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\himsoni\\AppData\\Local\\Programs\\Python\\Python37\\lib\\site-packages\\tabula\\tabula-1.0.3-jar-with-dependencies.jar', '--pages', '1', '--stream', '--guess', '--format', 'JSON', 'C:\\Users\\himsoni\\Desktop\\PDF_extraction\\black_white_format\\black_white_format\\PDF_Split_JPEGs\\blackwhite_Test.pdf']' returned non-zero exit status 1.

import tabula; tabula.environment_info()

Python version:
3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
Java version:
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) Client VM (build 25.231-b11, mixed mode, sharing)
tabula-py version: 2.0.1
platform: Windows-10-10.0.17763-SP0
uname:
uname_result(system='Windows', node='himsoni', release='10', version='10.0.17763', machine='AMD64', processor='Intel64 Family 6 Model 142 Stepping 10, GenuineIntel')
linux_distribution: ('', '', '')
mac_ver: ('', ('', '', ''), '')

Python and Java version

Python 3.7.4
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) Client VM (build 25.231-b11, mixed mode)
Does java -h command work well?; Yes
Ensure your java command is included in PATH Yes
Write your OS and it's version: ? Windows 10

Code:

import tabula
filepath = "C:\\Users\\himsoni\\Desktop\PDF_extraction\\black_white_format\\black_white_format\\PDF_Split_JPEGs\\blackwhite.pdf"
df = tabula.read_pdf(filepath, pages = 5 ,guess=True, multiple_tables= True, stream=True, java_options="-Dfile.encoding=UTF8")
print(df)

Expected Output: Get the table put for specific page.

James Z
  • 12,209
  • 10
  • 24
  • 44
user1958031
  • 70
  • 1
  • 8
  • 3
    Any luck on this? I'm encountering the same error. – Shawn Schreier Sep 24 '20 at 13:24
  • No Luck Shawn. it was a PDF format error, not a code error. I had coordinated with the Developer team who handled the Tabula library and they told me PDF format was slightly corrupted so tabula unable to process PDF. – user1958031 Oct 13 '20 at 09:43
  • I'm getting the same error on a Raspberry Pi setup. However, when I process the same file through Mac terminal, I don't get the error. Both environments are running tabula version 2.2.0. With respect, I'm not convinced this is a PDF format error as stated by user1958031, because the same version of tabula worked for me in one environment and failed in another while processing the same file. The PDF format was the same in both scenarios. – Bryton Beesley Nov 09 '20 at 19:17

1 Answers1

0

my PDF contains this font descriptor object:

17 0 obj
<</Ascent 891 /CapHeight 662 /Descent -216 /Flags 32 /FontBBox
  [-497 -306 1120 1023] /FontFile2 18 0 R /FontName
  /AFPTimesNewRoman-Italic /ItalicAngle -17.-21823 /StemV 80 /Type
  /FontDescriptor /XHeight 441>>
endobj

According to the PDF specification, the ItalicAngle must be a number. -17.-21823 is not a valid number representation. PDF parsers that don't do repairs under the hood, therefore, most likely will fail reading your file. PDFBox does fail.

PS: answer provided by tabula pdf/tabula-java developer team.

user1958031
  • 70
  • 1
  • 8
  • You could try to correct this with NOTEPAD++. Do it in a way that the offsets don't change, e.g. replace `-17.-21823` with `-17.218230`. However it's possible that your PDF has more problems. – Tilman Hausherr Oct 13 '20 at 11:47
  • Thank you for your advice and we use your suggestion in case we face the same situation.... We performed manual translation for such PDF's – user1958031 Oct 14 '20 at 07:51