Questions tagged [tabula-py]

tabula-py is a wrapper of tabula-java that allows you to extract tables into DataFrame or JSON using Python. You can also extract tables from PDF into CSV, TSV or JSON file.

Installing tabula-py using pip :

pip install tabula-py
132 questions
0
votes
0 answers

How to iterate .pdf conversion in Python using Tabula

I'm new to Python and I have problem, its gonna be great having solution from all of you here. I have a 23 pages PDF file and I want to convert it to separate .csv file for each page. How could I iterate over the pages in the file name using…
0
votes
0 answers

how to calculate the values for 'columns' parameter in Tabula-py

Can someone please explain me where and how to use the 'columns' parameter in a tabula-py. For reference (Read giving column information) - https://nbviewer.org/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb
arvin
  • 9
  • 4
0
votes
0 answers

Not able to extract table using tabula properly

Tried to extract the proper readable table from the pdf. But the tabula was not working properly and unable to extract the table properly. I have tried using the parameters like stream, lattice, guess. But none worked. Any suggestions on how can i…
Pravin
  • 241
  • 2
  • 14
0
votes
1 answer

extracting all tables using tabula

While reading a pdf file using df = tabula.read_pdf(pdf_file, pages=‘all’) —> displays all tables from all pages. but when converting into a Pandas dataframe using tables = pd.DataFrame(pdf_file, pages = ‘all’, lattice = ‘True’)[0])—> display only…
arvin
  • 9
  • 4
0
votes
1 answer

Why my tabula template does not output the data from PDF file when running through Python?

I selected the area using Tabula as below in the app and created a template. The out put in web works. But when I do it via code below I get an error "The output file is empty". Area selection Code import tabula df =…
Don Nalaka
  • 129
  • 1
  • 11
0
votes
0 answers

Extracting text from PDF file but the data is mixing up

I have a PDF linked here. I am trying to extract text from it as a block so I can keep track of every detail, but the data is mixed with the other columns of data. I tried PyPDF2, Tablua and tika but no one gave me the right solution. Tabula…
0
votes
1 answer

Unable to extract tables from tabula or Camelot

Tried to extract the below table using Tabula, but it was returning null dataframe. It was working fine for other kinds of similar tables. Tried using Camelot as well but it didn't work as well. Any suggestions about how can I extract…
Pravin
  • 241
  • 2
  • 14
0
votes
1 answer

Skip errors and continue loop when url provides no file

I am using Tabula-py to download and extract tables from PDFs via a list of URLs. The URLs are created based on rules and everything is working fine except when Tabula tries to process a PDF from a link with no page/file (specifically weekends as…
0
votes
1 answer

Python - Extract data inside a Rectangle Box from a PDF file to CSV file

I want to extract data present inside a rectangle box in a PDF file to a CSV file with corresponding columns and rows. I tried using Camelot, PyPdf2, Tabula libraries etc, but I couldn't get the desired outcome in a CSV file. Could anyone help me…
Mech_Saran
  • 157
  • 1
  • 2
  • 9
0
votes
1 answer

Tabula.read_pdf - IndexError: list index out of range

may I know why I will get IndexError when running the below code import tabula df = tabula.read_pdf("123.pdf", pages='all')[0] IndexError: list index out of range
Test777
  • 23
  • 4
0
votes
0 answers

UnicodeDecodeError: 'utf-8' codec can't decode

I was trying to read a PDF using tabula python package but I have received Unicode Decode Error. I tried using chardet to find encoding but it said None. from tabula import read_pdf from tabulate import tabulate df =…
0
votes
1 answer

Why is the data in the PDF written in the 1st column?

I have a pdf file called Question.pdf, and its content is as follows. Question.pdf I am converting my pdf file to an xlsx file using the python tabula module. However, it writes all the data in the 1st column of my excel file, how can I delete this…
Yunus Emre
  • 25
  • 4
0
votes
0 answers

Avoiding too many pandas dataframe to array conversions

I have a python script that parses through the appendix of a pdf and compares the found data elements to a json file, in order to figure out which elements we are missing. The end result is a pandas dataframe with all the information I then need to…
JoSSte
  • 2,953
  • 6
  • 34
  • 54
0
votes
0 answers

PDF in Russian Language to CSV file Using Python

I have a pdf file which is written in Russian . I am trying to convert the table present in PDF to a CSV file. I am able to create the CSV file but it is encrypted I have used this code in python import tabula df = tabula.read_pdf("IPLmatch.pdf",…
0
votes
0 answers

Tabula.py doesn´t print content as expected. Multiline cell is given

I try to read a pdf with tabula.py read_pdf() method and pandas. Works fine, except for multiline textfields like given below: Multiline textfield in PDF I´m expecting the following output after writing df to list: ['Gewürzmischung Zaatar',…
noxxer
  • 1
  • 2
1 2 3
8 9