Questions tagged [python-camelot]

Camelot is a Python library that makes it easy for anyone to extract tabular data from PDF files.

image

Official web site

Camelot is a Python library that makes it easy for anyone to extract tabular data from PDF files.

Why Camelot?

  • You are in control. Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel and HTML.

See comparison with other PDF table extraction libraries and tools.

197 questions
0
votes
2 answers

Different table_areas on multiple page pdf

I would like to extract tables from a multiple page pdf. Because of the table properties, I need to use the flavor='stream' and table_areas properties to read_pdf for my table to be properly detected. My problem is that the position of the table is…
Oneira
  • 1,365
  • 1
  • 14
  • 28
0
votes
1 answer

Ghostscript not installing properly - find_library('gs') returns None

I'm attempting to install camelot, but for some reason Ghostscript won't install properly, so I keep getting the error RuntimeError: Please make sure that Ghostscript is installed whenever I try to use read_pdf. When I went to check if Ghostscript…
0
votes
0 answers

Ways to make Camelot faster

I have been using camelot for extracting tables from PDF pages. It works well. However, it takes around 5 minutes to extract all the tables from a pdf of 68 pages. In future, I am going to need to extract tables from pdf with over a 1000 pages. I…
Yaset Arfat
  • 106
  • 1
0
votes
1 answer

Reading Tables from PDF and converting them into Pandas Dataframe

I am trying to extract tabular data from pdf and storing them as data frame. But tabular data is not coming in a proper format. Below is the data frame i am getting : But I want that data frame into the below format. Please help me how should I…
0
votes
1 answer

Pandas DataFrame combine rows by column value, where Date Rows are NULL

Scenerio: Parse the PDF Bank statement and transform into clean and formatted csv file. What I've tried: I manage to parse the pdf file(tabular format) using camelot library but failed to produce the desired result in sense of…
0
votes
1 answer

concatenate tables from loop getting error - InvalidIndexError: Reindexing only valid with uniquely valued Index objects

I need to concatenate tables created from a loop. The have repeats of the names in the columns but they are telling a different story, but for some reason when running this code I get an error: InvalidIndexError: Reindexing only valid with uniquely…
0
votes
1 answer

Extract fixed size and position table from pdf files in Python

Say I have many similar pdf files as the one from here: I woudld like to extract the following table and save as excel file: I'm able to do extract table and save excel file manually with package excalibur. After installing Excalibur with pip3, I…
ah bon
  • 9,293
  • 12
  • 65
  • 148
0
votes
2 answers

how can I construct a list of NumPy arrays from these two arrays?

I have two arrays which are column and row values in PDF coordinate space: x = array([111, 303, 405, 513] y = array([523, 546 , 569 , 603 ]) Here's a visual: I need to convert this into a list of numpy arrays, where each array is the boundary…
Chuck
  • 1,061
  • 1
  • 20
  • 45
0
votes
0 answers

Is there a way to add new line inside tables.export()

So I did a little project to read tables from pdf file using camelot (pip install camelot) I get the output i require but it is all in row, so I was wondering if I could insert a new line inside the json file from this code import camelot file =…
0
votes
1 answer

Can camelot use pdf "primitives" to extract data?

So I spent some time trying to extract data using PyPDF2 but this ended up being unreliable across pdfs even if the pdfs looked (to the eye) like they had similar structure and are probably computer generated. The thing I liked about PyPDF2 is that…
evan54
  • 3,585
  • 5
  • 34
  • 61
0
votes
2 answers

Trying to plot a pdf table using camelot-py, but no table comes up

I am trying to plot the table to debug and find table coordinates, however the plot never appears on the screen. Camelot has built in functions that use the matplotlib library to plot the tables. I have all the dependencies downloaded for camelot,…
0
votes
1 answer

Data frame columns contains many newline (\n) and its value respectively .How to separate it as new columns and values too

While reading the PDF table using camelot some columns are concatenated and their values too like below Date | Facture-ref\nfactureid| Description\items| Payé\nEscompte …
0
votes
0 answers

Python - text is flipped using camelot to extract table from PDF

I'm using library camelot to read PDF and extract tables. For most PDF it works perfectly. But for other PDF, the text is flipped. Does anyone know what causes it and how to fix it? Here's the link to the…
Elia Weiss
  • 8,324
  • 13
  • 70
  • 110
0
votes
2 answers

Python: AttributeError: module 'camelot' has no attribute 'read_pdf'

Facing below issue: can anyone help? please.. Getting the below while trying to extract table data from PDF's.. import camelot # PDF file to extract tables from file = input_folder+file_name tables = camelot.read_pdf(file) # number of tables…
0
votes
1 answer

How to parse table in PDF for non-english language

I was using Camelot and tabula for parsing a pdf file with Cyrillic symbols inside. But in the output CSV file, I got the messed-up font with no sign of Russian language. What can help me to parse the pdf table in a non-English language? import…
Egorsky
  • 179
  • 1
  • 11