Questions tagged [tabula]

Tabula is a Java library and command line tool for extracting tables from PDF documents.

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use graphical user interface. It works on Mac, Windows and Linux.

Resources

309 questions
4
votes
1 answer

Extract text from PDF documents and generate structured data

I am able to extract the text from all pages in pdf successfully. But unable to generate in structured data. Guide me if anyone come across such expertise. Code: package pdfboxreadfromfile; import java.awt.geom.Rectangle2D; import…
Leace
  • 262
  • 1
  • 7
  • 24
4
votes
2 answers

tabula vs camelot for table extraction from PDF

I need to extract tables from pdf, these tables can be of any type, multiple headers, vertical headers, horizontal header etc. I have implemented the basic use cases for both and found tabula doing a bit better than camelot still not able to detect…
Niranjan Kumar
  • 1,438
  • 1
  • 12
  • 29
4
votes
1 answer

tabula python: Getting subprocess.CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', ERROR

I am getting subprocess.CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', error while running tabula python liberary. Command: df = tabula.read_pdf(filepath, pages = 5 ,guess=True, multiple_tables= True, stream=True,…
user1958031
  • 70
  • 1
  • 8
4
votes
3 answers

How to scrape PDFs using Python; specific content only

I am trying to get data from PDFs available on the site https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en For example, If I look at November 2019…
Camilia
  • 61
  • 1
  • 1
  • 2
4
votes
0 answers

How to switch table area coordinates in Python Camelot and Tabula-Py

I have obtained the coordinates of a table bounding box using Camelot, but I need to use tabula-py to extract the table data, as camelot is only extracting the first line in each table cell, even in lattice mode. I have noticed that when defining…
John
  • 81
  • 2
4
votes
2 answers

How can tabula (JAR) be called from Java?

Tabula looks like a great tool for extracting tabular data from PDFs. There are plenty of examples of how to call it from the command line or use it in Python but there doesn't seem to be any documentation for use in Java. Does anyone have a…
emd
  • 75
  • 2
  • 8
4
votes
0 answers

Read special characters and fonts from PDF using Python

I've a PDF in which certain table rows contain special characters and fonts for e.g.. Is there any way to read those properly. from tabula import read_pdf df = read_pdf("Tables PDF.pdf", pages = '5', lattice = True, multiple_tables = True,…
PratikSharma
  • 321
  • 2
  • 17
4
votes
0 answers

Downloading a temporary file in Heroku and then reading it

I'm trying to download a PDF from a site and then read it, all in a single python script running on a single worker dyno in Heroku. However, my script requires that file be temporarily stored in the ephemeral filesystem in order to be read. From the…
Kyap
  • 91
  • 1
  • 9
3
votes
1 answer

Reading Tables as string from PDF with Tabula

I am using tabula-py 2.0.4, pandas 1.17.4 on python 3.7. I am trying to read PDF tables to dataframe with tabula.read_pdf from tabula import read_pdf fn = "file.pdf" print(read_pdf(fn, pages='all', multiple_tables=True)[0]) The problem is that the…
Klemz
  • 123
  • 2
  • 13
3
votes
2 answers

How to fix this error on tabula.read_pdf() function in Python

I am trying to extract tables from a PDF file using Python (Pycharm). I tried the following code: from tabula import wrapper object = wrapper.read_pdf("C:/Users/Ojasvi/Desktop/sample.pdf") However, the error i got…
Ojasvi Jain
  • 79
  • 1
  • 2
  • 5
3
votes
2 answers

Shifting part of a row in a Dataframe to the right?

The dataframe in question is reading in from a pdf file using Tabula and getting some columns in the wrong places. It looks something like this: Index Name Date Time Exp QT Comm Load Notes 0 VT1 04/16 4:00 Glen 1600 Wheat…
Edward Gorelik
  • 179
  • 1
  • 9
3
votes
2 answers

Tabula-py omitting pages from a PDF document I am trying to extract

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely. The omissions seem to be random and don't follow any visible visual…
Sannita
  • 131
  • 1
  • 4
3
votes
1 answer

How to make page range in tabula-py?

In Python 3, I have a PDF file "Ativos_Fevereiro_2018_servidores.pdf" with 6,041 pages. I'm on a machine with Ubuntu. The file is here: https://drive.google.com/file/d/1P8kF0gUOVls6sOGed4R0C2PlVF5RFtU6/view?usp=sharing On each page there is text at…
Reinaldo Chaves
  • 965
  • 4
  • 16
  • 43
3
votes
1 answer

Tabula-py font not emplemented error

The PDF file content is Chinese(characters, not pictures and so on), so the it may use different fonts. My code: >>> import tabula >>> df = tabula.read_pdf('/data/proj/smartinvestment/cninfo_download_reports/pdf/601101/2016-12-29/1202969937.PDF',…
Mark
  • 31
  • 2
2
votes
0 answers

Tabula-py not extracting tables correctly

I was building an API that uses tabula to extract table from a pdf. I built the API on the windows machine and deployed it on ubuntu 20. On the windows machine the extraction was flawless, and I was able to perform all the necessary steps. However,…
abhi
  • 337
  • 1
  • 3
  • 12
1
2
3
20 21