Questions tagged [tabula-py]

tabula-py is a wrapper of tabula-java that allows you to extract tables into DataFrame or JSON using Python. You can also extract tables from PDF into CSV, TSV or JSON file.

Installing tabula-py using pip :

pip install tabula-py
132 questions
1
vote
0 answers

Tabula-py Not readng the full data of file

I was trying to read table from a PDF file using the tabula read_pdf() method. But it is not reading complete table. It is missing out on some row of table. I was trying the below given code: tables = tabula.read_pdf(f, …
1
vote
0 answers

Python3: tabula-py imports several strings with random whitespaces

I'm not sure if this behaviour's normal, but there is some inconsistency while reading the pdf. A oneliner: pdf = tabula.read_pdf(path, pages=pages) Where path is the directory of the pdf file. When printing the pdf in the console some values like…
user13581602
  • 105
  • 1
  • 9
1
vote
1 answer

Export PDF to csv using python (tabula)

When exporting a PDF file to csv, it returns an error:writeheader() takes 1 positional argumentbut 2 were given from tabula import read_pdf from tabulate import tabulate import csv df = read_pdf("asd.pdf") print(df) with open('ddd.csv', "w",…
tody22
  • 11
  • 2
1
vote
0 answers

Tabula-py not extracting Rows correctly

Extracting pdf tables using Tabula-py, It's extracting all rows but not splitting it right. Taken the sample pdf below to extract. tried extraction with below code import tabula import json import pandas as pd path = "/GST_OCR input…
Nag Arjun
  • 11
  • 5
1
vote
0 answers

How to convert PDF to excel using tabula-py into dataframe of several tables?

I have a PDF file where are several tables, For example: Table from PDF File By the way, I learned that I have to use tabula-py from Java (Note: I'm working on Jupyter Notebook So, I code this: import pandas as pd import numpy as np import…
Maria Fernanda
  • 143
  • 2
  • 8
1
vote
1 answer

Python Converting a List into an Array

I have a list that is 5 rows by 5 columns. I am trying to convert this list into a dataframe. When I try to do so, it only grabs the first row. This failed because I had it set to 5,5: df2 =…
1
vote
1 answer

Python Tabula Library - Output File Is Empty

I am using the Tabula module in Python. I am trying to output text from a PDF. I am using this code: pdf_read = tabula.read_pdf( input_path = "Test File.pdf", pages = start_page_number, guess=False, …
1
vote
0 answers

Language PDF: How to add the example sentences to source word and add to CSV

First of all, I’m new to Python, so please bear with me. I have a PDF file with Spanish vocabulary on the left and the German translation on the right. Sometimes there are also a few example sentences to show how the sentence is used. Here’s how the…
orejoorejo
  • 11
  • 1
1
vote
3 answers

Exception: JavaNotFoundError When Running Tabula-py in a python azure funciton app

I am extracting data from a pdf using a blob trigger python azure function app and I am getting the following error when using tabula py. I was able to run it locally without issues, however, when I deploy the function I am getting the following…
SantiASC
  • 13
  • 1
  • 4
1
vote
1 answer

How do I get which page is the table extracted from using tabula-py?

I am currently using tabula.read_pdf() to extract tables from a pdf. However, there are no information about which page does the table come from. One way is to get the total number of pages and iterate each page by passing in the pages argument for…
Stanley Gan
  • 481
  • 1
  • 7
  • 19
1
vote
2 answers

Accessing indexes in a list

I am using tabula-py to extract a table from a pdf document like this: rows = tabula.read_pdf('bank_statement.pdf', pandas_options={"header":[0, 1, 2, 3, 4, 5]}, pages='all', stream=True, lattice=True) rows This gives an output like so: [ …
shekwo
  • 1,411
  • 1
  • 20
  • 50
1
vote
1 answer

Tabula-py returns '...' on one specific column in df. everything else seems to work,

Expected behavior: Read PDF, extract all table data into pandas df. Actual behavior: Reads PDF fine, extracts most table data and saves it to a debugging.txt with fp.write(df). One column (names) usually only returns '...' when I view the…
stygarfield
  • 107
  • 9
1
vote
1 answer

AWS Lambda OSError(30, 'Read-only file system')

I am trying to run tabula-py on AWS Lambda on Python3.7 environment. The code is quite straight-forward : import tabula def main(event, context): try: print(event['Url']) df = tabula.read_pdf(event['Url']) …
Sukhi
  • 13,261
  • 7
  • 36
  • 53
1
vote
1 answer

Python tabula-py cannot import name wrapper

Here is my code: from tabula import wrapper df = wrapper.read_pdf('singapore.pdf') But it gives following error: ImportError: cannot import name 'wrapper' I tried it on ubuntu and it works fine there but on Windows I am unable to use this code,…
Muhammad Hassan
  • 4,079
  • 1
  • 13
  • 27
1
vote
1 answer

data missing while reading pdf file using tabula and python

I have a pdf with several text and tables and one row contains like below : PDF content : Id: 5647484848 Name Alex J Now I am using tabula-py for parsing the content, but the result is missing something (means you can see first charater or number…
Agustus
  • 634
  • 1
  • 7
  • 24
1 2
3
8 9