Questions tagged [tabula]

Tabula is a Java library and command line tool for extracting tables from PDF documents.

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use graphical user interface. It works on Mac, Windows and Linux.

Resources

309 questions
1
vote
0 answers

How to resolve a trouble in using Tabula?

I'm a data analyst and first of all would like to thank you and your friends for a wonderful tool Tabula. I have been using it over the recent months periodically and during past week quite actively. And, suddenly the tool ran out of order. I even…
Karine
  • 11
  • 2
1
vote
0 answers

Camelot Table Extraction Error (PdfReadWarning: incorrect startxref pointer(0) [_reader.py:938])

I am trying to extract some tables from a .pdf doc but I got an error: "PdfReadWarning: incorrect startxref pointer(0) [_reader.py:938]" The code is pretty simple because I am just testing: import camelot file = r"myPCPath\myFile.pdf" tables =…
1
vote
1 answer

Tabula font error in reading table from PDF

I saw a lot of people had similar issues, but not this one. And many of the similar issues do not have an applicable solution, unfortunately. I am getting this warning from tabula. And when I look at the result or test the length of what it…
ralbhar
  • 11
  • 2
1
vote
1 answer

Tabula - py ignores NaN values and shifts table cell values into the wrong column

So I was experimenting a little bit with tabula for Python and had a strange exception. The first Column of the table always stretches over 4 rows. So for the first 4 cells, witch are stretched over multiple rows, tabula just asumes NaN for the the…
1
vote
1 answer

Using multirow and multicoloum in Table in Overleaf

I am trying to make a table where the first column is multiple columns (2 columns) and also multiple rows (2 rows). The error is on the first column (Aspects). How to make it…
MK Huda
  • 605
  • 1
  • 6
  • 16
1
vote
1 answer

Error in tabula tabula-py when specifying area parameter

I am getting an error when I specify the area in the following code: data = tb.read_pdf(pdf_file, guess=False, stream=True, pandas_options ={'header': None}, encoding="utf-8", multiple_tables =False, area = [136,10,10,10], pages ='1', columns =…
1
vote
0 answers

Reading Tables from PDFs in S3 bucket using Camelot or Tabula packages: s3 URL

Can Python packages that pull tables from PDFs, such as Tabula and Camelot, read in the PDF from an S3 bucket - like with Pandas. For example, I can read a CSV file from the S3 bucket like this: df =…
1
vote
1 answer

tabula extract table from pdf remove line break

I have a table with wrapped text in a pdf file I used tabula to extract table from the pdf file file1 = "path_to_pdf_file" table = tabula.read_pdf(file1,pages=1,lattice=True) table[0] However, the end result looking like this: is there a way to…
user11666514
  • 165
  • 1
  • 8
1
vote
2 answers

Using Tabula to pull tables out pdf

We have standard reports uploaded as PDFs on a daily basis. In the PDFs are some tables that we want to pull into datasets. I have tabula imported in code repositories but I can't seem to get code repositories to bring in the PDF. I recieve this…
Connor
  • 41
  • 3
1
vote
0 answers

Tabula Java Heap Error — only 1 page to convert

I want to extract tables from 1 page pdf (50 KB) using Tabula, but it returns this error: 2022-01-08 17:33:25.054:INFO:oejsh.ContextHandler:main: Started…
1
vote
1 answer

tabula-py can't read file when the python script called by java

I'm working on a project base on java. And the java program will run command to call a python script. The python script is used tabula-py to read a pdf file and return the data. I tried the python script was work when I direct call it in terminal…
Fong Tom
  • 87
  • 5
1
vote
0 answers

Python pandas df - Columns must be same length as key

I have a dataframe I created by scraping this PDF with tabula. I'm trying to create a point column using geocoder - but I keep getting a Columns must be same length as key error. My code, as well as a link to the PDF is below: PDF:…
Adam
  • 315
  • 1
  • 11
1
vote
1 answer

Error with tabula in python regarding dependency (colab and locally)

I am working on extracting data from a number of pdf documents in python, testing in colab. A solution would be great on colab, but also locally if that is not possible. There is a lot of interesting entries per page, so I chose tabula. Code works…
1
vote
0 answers

Unable to read pdf using tabula-py

I am trying to parse a pdf using tabula-py but I keep getting this error stack - CalledProcessError(1, ['java', '-Dfile.encoding=UTF8', '-jar',…
shekwo
  • 1,411
  • 1
  • 20
  • 50
1
vote
1 answer

Tabula-py doesn't recognise columns correct

I am trying to recognise pdf document using tabula. I use this code: df = tabula.read_pdf(io.BytesIO(content), pages=12,pandas_options={'header': None}, multiple_tables = True,columns=(78.39, 226.97, 280.97,370.04,461.02,550.06)) However, after…