Questions tagged [python-camelot]

Camelot is a Python library that makes it easy for anyone to extract tabular data from PDF files.

image

Official web site

Camelot is a Python library that makes it easy for anyone to extract tabular data from PDF files.

Why Camelot?

  • You are in control. Unlike other libraries and tools which either give a nice output or fail miserably (with no in-between), Camelot gives you the power to tweak table extraction. (This is important since everything in the real world, including PDF table extraction, is fuzzy.)
  • Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
  • Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
  • Export to multiple formats, including JSON, Excel and HTML.

See comparison with other PDF table extraction libraries and tools.

197 questions
4
votes
1 answer

Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF. Attaching the target PDF link below and target table are at page number 3 and 4, which need to…
Abhishek Bisht
  • 138
  • 1
  • 10
3
votes
1 answer

How can I stop camelot-py from splitting multi-line text in a single cell into multiple cells?

I am trying to build an app which reads arbitrary PDFs and extracts tables from them and I am using Camelot for extracting the tables. This is working fine for tables in which cells have single line values. However, for tables having cells with…
Rohit Gavval
  • 227
  • 1
  • 13
3
votes
1 answer

Not able to import camelot in Python 3.7(Anaconda) in MacOS Catalina

My environment specs python --version Python 3.7.6 anaconda --version anaconda Command line client (version 1.7.2) sw_vers ProductName: Mac OS X ProductVersion: 10.15.2 BuildVersion: 19C57 I installed camelot from conda-forge using…
Ronnie Day
  • 121
  • 2
  • 13
3
votes
1 answer

Camelot Pdf Extraction FAIL parsing

Im getting a problem with Camelot library Im extracting data from PDF, my code is running "ok" for previous 23 page, but for this case its failing to parse text/table ending I suppose the problem is the string is so long reaching table border Also…
Wonka
  • 1,548
  • 1
  • 13
  • 20
3
votes
2 answers

How to extract table name along with table using camelot from pdf files using python?

I am trying to extract tables and the table names from a pdf file using camelot in python. Although I know how to extract tables (which is pretty straightforward) using camelot, I am struggling to find any help on how to extract the table name. The…
Vijay
  • 57
  • 2
  • 6
2
votes
1 answer

extract borderless table with pdfplumber

I am trying to extract the borderless tables from the PDF document, I have tried few combination with PDF table_settings parameter, however pdfplumber cannot recognize the borderless tables correctly pdf file can be downloaded from the link Here is …
go sgenq
  • 313
  • 3
  • 13
2
votes
1 answer

Ghostscript not detected when using camelot with Pipenv

I'm trying to use camelot to read tables from a pdf, but when I execute tables = camelot.read_pdf('foo.pdf') I get the following error: RuntimeError: Please make sure that Ghostscript is installed I have installed ghostcript and python-ghostscript…
Daniel
  • 51
  • 1
  • 6
2
votes
0 answers

'numpy.core._multiarray_umath' in Eclipse IDE

I am running Eclipse IDE 4.20.0 with a PyDev Interpreter on Windows10. I am trying try to get [Camelot][1] to run within my script but continue to get the error- "Original error was: No module named 'numpy.core._multiarray_umath'" For each, I have…
2
votes
2 answers

Tables not detected with tabula and camelot

I tried to extract tables from PDFs that are not in proper format that I think. The tables in these PDFs have a table format but not enclosed properly with verical borders. I'll attach the sample pdf and output with both libraries. When I tried to…
Anshul Joshi
  • 55
  • 1
  • 7
2
votes
0 answers

Extracting PDF tables with camelot-py (lattice): split_text does not work

When extracting a table using camelot, the text of two columns that is close together is merged into one, even though all lines are detected correctly. I am using the lattice flavor, as the table in the PDF has lines. I set split_text = True but it…
Tomper
  • 78
  • 7
2
votes
1 answer

PDF table to pandas data frame using camelot

I'm trying to create a simple way to get data from pdf into a pandas data frame. Something like that: import camelot import pandas as pd pdf = camelot.read_pdf("file1.pdf") print(pdf[0].df) The point is that I'm trying with two different files:…
2
votes
1 answer

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi). Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable…
2
votes
1 answer

Camelot PDF failing to strip text

I have this pdf and I'm trying to work on it's very first table. The issue happens when the name of the employer (EMPREGADOR) reaches two lines. I'm using the following command to try to strip the data correctly: tables =…
André Luís
  • 141
  • 2
  • 7
2
votes
1 answer

Python Camelot / Ghostscript "wrong architecture" error

I have encountered an error that takes me beyond my de-bugging capabilities. Camelot's usage of Ghostscript seems to have found an executable of wrong architecture. Steps taken: brew install Ghostscript checked to see if Ghostscript's executable…
Jkiefn1
  • 91
  • 3
  • 16
2
votes
1 answer

How to read table spread across multiple pages, using tabula_py or camelot

Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot
Sharon
  • 51
  • 3
1
2
3
13 14