0

I installed both the tabula-py library and also Java to try and scrape tables from PDFs. I ran some simple code below with a sample pdf I found online:

from tabula import read_pdf

path = "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf"

table = read_pdf(path,pages=1) 
print(table[0])

I got the following error(s):

Got stderr: Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=

Traceback (most recent call last):
  File "/Users/default/Desktop/Schedule Data/Extraction.py", line 21, in <module>
    tables = tabula.read_pdf('Brunswick Student Proof 1.pdf',pages = [14,20])
  File "/Users/default/Library/Python/3.9/lib/python/site-packages/tabula/io.py", line 440, in read_pdf
    raw_json: List[Any] = json.loads(output.decode(encoding))
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I have searched for a potential fix to either the codecache error or the JSON decoder error, and the answers have not been very helpful, is the issue here with the Java end or the tabula library or both?

Abra
  • 19,142
  • 7
  • 29
  • 41
RIPPLR
  • 198
  • 1
  • 13
  • JSON isn't Java, it's JavaScript and they're unrelated. This has nothing to do with Java _or_ JS. Your file can't be parsed. I'd wager that it's jsonlines – roganjosh Aug 06 '23 at 14:30
  • How are you using Java with Python? It doesn't make sense as described. – aled Aug 06 '23 at 17:21
  • Are you using Mac? I am facing exactly same issue. Or this is because of some latest version of JAVA I feel – Sahil Doshi Aug 12 '23 at 14:59
  • 1
    @SahilDoshi Yes I am using M1 mac. Tabula works perfectly fine on my Windows computer, but I run into a bunch of issues on mac. – RIPPLR Aug 12 '23 at 22:30
  • Gut feel says it should be something related to Java "Code Cache is full " error – Sahil Doshi Aug 13 '23 at 08:25
  • The Java warning is, well, just a warning emitted by the JIT compiler. It has nothing to do with the Python error. – Olivier Aug 13 '23 at 13:42
  • @SahilDoshi Are you experiencing the issue with all PDFs or only some? – Olivier Aug 13 '23 at 13:48
  • The bounty attracted a [ChatGPT](https://meta.stackoverflow.com/questions/421831/temporary-policy-chatgpt-is-banned) plagiariser. – Peter Mortensen Aug 21 '23 at 08:55

2 Answers2

0

tabula-py is wrapper around https://github.com/tabulapdf/tabula-java lib, hence your python code will invoke JVM, and there you are facing issue with "CodeCache is full" which is related to JVM. So try to config JVM option.

API Ref:

https://tabula-py.readthedocs.io/en/latest/tabula.html

Here I have put example code, but you find appropriate Java options to make it runable.

import tabula

# Define the JVM options as a list of string
jvm_options = ["-XX:ReservedCodeCacheSize=128m"]

# Pass the JVM options when calling read_pdf
tables = tabula.read_pdf('your_pdf_file.pdf', pages='all', java_options=jvm_options)

for table in tables:
    print(table)
divyang4481
  • 1,584
  • 16
  • 32
0

Way too late to be of help now, but for others who might get stuck here, this is what helped me with this exact same issue:

Seems like the CodeCache is full error message ends up adding a log at the start of the JSON string we are trying to read, which then messes up the json.loads(output) function. My solution was to manually remove this padded log message at the start by using string functions.

So, in ../site-packages/tabula/io.py in the function read_pdf(), change the corresponding line causing the error (in your case line 440) to:

raw_json: List[Any] = json.loads(output.decode(encoding).split('\n')[-1])
John.Ludlum
  • 145
  • 3
  • 13