Questions tagged [tabula-py]

tabula-py is a wrapper of tabula-java that allows you to extract tables into DataFrame or JSON using Python. You can also extract tables from PDF into CSV, TSV or JSON file.

Installing tabula-py using pip :

pip install tabula-py
132 questions
0
votes
0 answers

How to extract a table from a PDF without manually tweaking the parameters?

I know the packages camelot and tabula-py and they can read tables from a PDF file. Problem is that each PDF file is different and therefore the parameter settings that work for one PDF file do not work for another PDF file. Since my preprocessing…
Ruthger Righart
  • 4,799
  • 2
  • 28
  • 33
0
votes
0 answers

Error from tabula-java: Error: Error: Header doesn't contain versioninfo

I have a script that parses pdf files. On my WSL it's perfectly working, but when i deploy it on Centos 7, I have this error. I'm using tabula-py python version: 3.6 java version: 11 When I try to search for the error, I found nothing. Can someone…
0
votes
0 answers

Export array of DataFrames to csv

I am trying to use tabula-py to extract data from a PDf and save it to a csv. The PDF contains a work order. The data in the PDF is not formatted in a usable table - I am required to use Stream mode. Through the Tabula web interface, I have created…
0
votes
0 answers

Tabula-py: any clever method to choose between lattice = False vs lattice = True?

I realised that sometimes with lattice = True, the result is better than lattice = False and vice-versa for others. Is there a clever way to select between the two options? For context: This shows that x1 is a better option than x2 but for bulk…
skw1990
  • 63
  • 6
0
votes
0 answers

Obtained position of tables in pdf and plot the bounding box on the image

Following this script, I could know the bounding box of the tables in my e-pdf: tabula.read_pdf(file, stream=True,guess=True,lattice=False,multiple_tables=True, output_format="json", pages=pg_num) However, I want to plot the bounding boxes detected…
skw1990
  • 63
  • 6
0
votes
0 answers

How to return positions and data frames together in tabula.read_pdf?

How to return positions and data frames together in tabula.read_pdf? For one page, I have to run 2 lines of codes (hence…
skw1990
  • 63
  • 6
0
votes
1 answer

Is there a way to read password protected PDFs with tabula-py?

I have password protected PDFs with some tables. (I have the passwords to them). Currently I'm using PDFminer.six to extract data from these PDFs to text but I want to use tabula-py instead to extract tables. Is there a way to do this?
MegaJas
  • 1
  • 1
0
votes
0 answers

tabula-py get total number of pages

I am using tabula-py to extract some text from a pdf. For my program I need to know the total number of pages. Is it possible to know this with tabula-py or do I need to use another module for this? If yes can you suggest the easiest method,…
aster94
  • 329
  • 1
  • 3
  • 13
0
votes
1 answer

Cannot read PDF Data into Sheets with Gspread-DataFrame

I want to read data from a PDF I downloaded using Tabula into Google Sheets, and when I transfer the data as it was read into Google Sheets, I get an error. I know the data I downloaded is dirty, but I wanted to clean it up in Google…
0
votes
0 answers

How to fix Unicode mapping error when using tabula-py

I am trying to extract a table from the following pdf file using tabula-py: link to pdf However, I encounter the following error: WARNING:tabula.io:Got stderr: Jan 17, 2023 1:28:52 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode WARNING: No…
0
votes
0 answers

Collecting data from a pdf after seeing a certaint keyword

i want to read the data in this table. But only the data that appears after general informationHere is a picture of the data I tried using tabula but nothing I've tried has seemed to work
0
votes
1 answer

LineBreak in a PDF table breaking tabula-py

I'm using tabula-py to extract a table from a pdf file. This kind of pdf (which I need to parse every month) have around 40 pages (but it varies). My code works just fine for the first 20 pages, which follow a nice standard. However, by the page 30…
viniwata1
  • 31
  • 4
0
votes
1 answer

Gibberish table output in tabula-java for Japanese PDF but works in standalone Tabula

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no…
Wah123
  • 1
  • 1
0
votes
0 answers

Lattice option not working for column header in tabula-py

I am using tabula-py for extracting table from pdf. Where I am using lattice for parsing the file. It is doing good for all rows except the first one. code: df = read_pdf("filename.pdf", pages=21, multiple_tables=True, lattice=True) Table in…
0
votes
1 answer

extracting data into columns using pdfplumber

I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells. How could I do that? For…
arvin
  • 9
  • 4
1 2 3
8 9