Questions tagged [tabula]

Tabula is a Java library and command line tool for extracting tables from PDF documents.

Tabula allows you to extract data from PDF files into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use graphical user interface. It works on Mac, Windows and Linux.

Resources

309 questions
2
votes
1 answer

looping through pdf files with tabulizer in python

I'm having a hard time getting a piece of code to work. I want to loop through pdf files in a folder, extract what the tabula package thinks the tables are, extract these to a dataframe, and write all the tables from a specific pdf into a one csv…
CMorgan
  • 645
  • 2
  • 11
  • 33
2
votes
1 answer

Python: Error - tabula-py cannot read PDF

I cannot execute tabula-py's read_pdf function. It seems to be producing the following error message: WindowsError: [Error 2] The system cannot find the file specified With traceback: Traceback (most recent call last): File…
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
2
votes
1 answer

How to rename unnamed columns in Pandas?

I have a pdf with a table in it, and trying to get that table into Pandas. Extracting pdf tables is notoriously difficult to get right, but I have found tabula works best. It is far and away the best I have seen, though still not perfect. I have…
lukehawk
  • 1,423
  • 3
  • 22
  • 48
2
votes
4 answers

Extracting Tables from PDFs Using Tabula

I came across a great library called Tabula and it almost did the trick. Unfortunately, there is a lot of useless area on the first page that I don't want Tabula to extract. According to documentation, you can specify the page area you want to…
Riley Hun
  • 2,541
  • 5
  • 31
  • 77
1
vote
0 answers

Converting PDF Table from URL into a Pandas Dataframe?

Having issues converting PDF data into a dataframe depending on how the PDF is uploaded to the website. Hi all, Does anyone have any ideas on how to read an uploaded PDF's data into a pandas dataframe? I am having issues doing it with certain…
1
vote
0 answers

i'm getting a problem with DataFrame from pandas

I'm new at python and im trying to convert a the infos in a PDF file to EXCEL. This is my code import tabula from tabula.io import read_pdf import pandas as pd from pandas import DataFrame path = "C:/Users/Littl/OneDrive/Área de…
meaculapa
  • 11
  • 1
1
vote
0 answers

enhance Tabula for accurate text with layout extraction

I extracted all the text from pdf using tabula and it is great but as my pdf has border less tables and in some rows only single column is present with width of 3 columns, tabula put all text into single column. let me explain via some example. I…
1
vote
0 answers

Tabula not converting all PDF pages to CSV in Python

I am trying to convert an entire directory of PDF files to CSV, but my code is only picking up the first page and I need to get all pages in a given PDF. # Convert pdf to csv path = 'mypath\*.pdf' for f in glob.glob(path): df =…
6114617
  • 79
  • 2
  • 7
1
vote
1 answer

Python Convert to CSV with encoding type

Someone helped me with a program so that I can convert PDF files from that format to csv but they didn't specify an encoding type, Here is the code: import os import glob import tabula path="/Users/username/Downloads/" for filepath in…
Kenny
  • 43
  • 5
1
vote
1 answer

Tabula-py reads column data as unicode

A pdf for which I am trying to extract a table from, correctly identifies the table but the table data is extracted as unicode rather than string data. from tabula import read_pdf df =…
zoof
  • 159
  • 8
1
vote
1 answer

Python extract text between two tables as title for the table(outside tables) from pdf with tabula

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pdf file correctly. The thing is that, for each table, there is a title for it above the table (not…
user15410844
  • 61
  • 1
  • 7
1
vote
0 answers

tabula only reads first two rows

I am trying to scrape a table from a online pdf, but the indexing is not working properly. see table[0] and table[1] What i want is a DataFrame with strings in all columns, so i can extract the ICD codes with regex. import tabula import pandas as…
1
vote
0 answers

why fresh-tabula-js not working with nestJs?

Getting errors while using fresh-tabula-js with nestJs. Express server import Tabula from "fresh-tabula-js"; const table = new Tabula("t5.pdf", { spreadsheet: true }); console.log(table.extractCsv().output); NestJS 1) code-1 import { Injectable…
1
vote
1 answer

reading pdf file using tabula

I have a pdf file with tables in it and would like to read it as a dataframe using tabula. But only the first page has column header. While reading using tabula.read_pdf(pdf_file, pages='all', lattice = 'True') the data is coming in desired format…
arvin
  • 9
  • 4
1
vote
0 answers

How to skip image-based pages in camelot?

I'm running a for loop for multiple pdfs with multiple pages to extract multiple tables. Problem is when I run the for loop for multiple pdfs if there are any pdfs that contain image-based format at page 1 or 2 and tables start from page 2 or 3…
redox741
  • 21
  • 5