Questions tagged [tabula-py]

tabula-py is a wrapper of tabula-java that allows you to extract tables into DataFrame or JSON using Python. You can also extract tables from PDF into CSV, TSV or JSON file.

Installing tabula-py using pip :

pip install tabula-py
132 questions
2
votes
2 answers

Not detecting columns

I was parsing bank statement using tabula-py in which columns are seperated by vertical margins but row are not separated. so i use stream mode but if in any page there is not entry for any column then tabula merges them as one for…
1
vote
0 answers

Java not found for Tabula-py with chaquopy inside an Android Studio Kotlin project

I am using chaqoupy and building an android app in kotlin using python scripts. I am trying to use the tabula-py module however I am getting java not found exception in my Logcat com.chaquo.python.PyException: JavaNotFoundError: `java` command is…
mranderson
  • 11
  • 1
1
vote
1 answer

Error when trying to concatenate lists with pd.concat

I have a folder with 100 PDF files. Page 1 of all PDF files contains a table that I am extracting. Then I am concatenating all the tables into a dataframe and writing as a CSV file. However, I am getting error while concatenating. import os import…
akang
  • 566
  • 2
  • 15
1
vote
2 answers

Importing rotated text from a PDF table such as with tabula-py in python

Is there a way to import rotated text from a PDF table such as with tabula-py in python? I realize I can just rename the column headers in this case, but I was wondering if there is a way to set a parameter for importing rotated text. I don't see…
windyvation
  • 497
  • 3
  • 13
1
vote
2 answers

How to use tabula in AWS Lambda to read PDF table

Hello I get the following error while trying to use tabula to read a table in a pdf. I was aware of some of the difficulties (here) using this package with AWS lambda and tried to zip the tabula package via an EC2 (Ubuntu 20.02) and then, add it as…
ttam10
  • 35
  • 5
1
vote
1 answer

Tabula - py ignores NaN values and shifts table cell values into the wrong column

So I was experimenting a little bit with tabula for Python and had a strange exception. The first Column of the table always stretches over 4 rows. So for the first 4 cells, witch are stretched over multiple rows, tabula just asumes NaN for the the…
1
vote
1 answer

Merging cells, in the same column, in the same df- Python

I am attempting to merge two cells together. The reason for this is due to the fact that every unit under 'Chassis' should be an alphanumeric (ABCD123456) however the PO provided occasionally shifts the last number to the next row (no other data on…
WinterT
  • 13
  • 3
1
vote
1 answer

tabula.errors.JavaNotFoundError error while using tabula in Google cloud function

For my application I am using the tabula package to convert the pdf to csv. The cloud function I have written is in python3.7. I have written it in requirements.txt file. But I am getting this error File…
Jasmine
  • 476
  • 3
  • 22
1
vote
1 answer

tabula-py can't read file when the python script called by java

I'm working on a project base on java. And the java program will run command to call a python script. The python script is used tabula-py to read a pdf file and return the data. I tried the python script was work when I direct call it in terminal…
Fong Tom
  • 87
  • 5
1
vote
1 answer

How to loop in tabula-py data format in python

I want to know how to extract particular table column from pdf file in python. My code so far import tabula.io as tb from tabula.io import read_pdf dfs = tb.read_pdf(pdf_path, pages='all') print (len(dfs)) [It displays 73] I am able to access…
user1107731
  • 357
  • 1
  • 2
  • 10
1
vote
2 answers

Unable to convert multiple PDF pages of a PDF File into a CSV using tabula

I have PDF file whose 1st page data format is different however rest of the pages has the same tabular format. I want to convert this PDF file which has multiple pages into a CSV file using Python Tabula. The current code is able to convert PDF to…
linux01
  • 41
  • 2
  • 7
1
vote
0 answers

Unable to convert PDF to CSV using Tabula

I am getting a blank tab when I try converting a PDF file to CSV using Tabula. I want to convert a specific page of the PDF to .csv format. I am getting the following error: Got stderr: Oct 29, 2021 3:29:30 PM…
GS13
  • 11
  • 1
1
vote
0 answers

Unable to read pdf using tabula-py

I am trying to parse a pdf using tabula-py but I keep getting this error stack - CalledProcessError(1, ['java', '-Dfile.encoding=UTF8', '-jar',…
shekwo
  • 1,411
  • 1
  • 20
  • 50
1
vote
1 answer

Combine Consecutive Rows for given index values in Pandas DataFrame

I was extracting tables from a PDF with tabula-py. But in a table where some rows were more than one line, but in tabula-py, a single-table row is converted as multiple rows in DataFrame. I'm giving a sample here. Serial No. Name Type …
carl
  • 603
  • 5
  • 17
1
vote
1 answer

Covert List to DataFrame | tabula-py | read_pdf_with_template()

Problem Statement: I'm using Tabula App user interface for selecting dimension of table from PDF file as tabula-template to give dimension in JSON Format. The DataFrame in Tabula App Interface from extracting table after selecting Table dimension is…
Maqsud
  • 739
  • 3
  • 12
  • 35
1
2
3
8 9