How to extract data from multiple PDFs in the same directory using python-camelot?

Question

I'm trying to extract data from multiple multiple tables in multiple pdf and save it in csv format. I did my research and found python-camelot is good tool to extract. I tried and it works perfectly fine on a single pdf. However, I have over 50 PDFs in the same format so i decided to iterate over all files using For loop but it did not work and i get an error files are not found in the directory. can you please help. Here is the code:

import tkinter 
import camelot
import os

directory = 'C:\\Users\\Alr\\Desktop\\test\\'
files = [ filename for filename in os.listdir(directory)]
for i in range (len(files)):
    tables = camelot.read_pdf(files[i], pages='5,6,7')
    tables.export(files[i], f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(files[i]+'.csv')

`files` never gets set to any value. Did you forget to *read* the folder? — Jongware, Mar 11 '20 at 21:40
@usr2564301 thank you for your replay.. I forgot to include - just updated the code — Ahmad B, Mar 11 '20 at 22:30
Now the issue is clear – a common mistake, alas. `os.listdir` returns *the names of the files* and that means that the *path* is not included. Just prepend `directory` to the file name in `read_pdf` and you're set. — Jongware, Mar 12 '20 at 00:02
@usr2564301 Thank you so much I think you are right not it's working after i added the path to the name. However, i have problem with exporting it as i use the filename as name for the csv file but it includes the ".pdf" in the name and now the code is throughing an error. so is there any method to take out the .csv from the name and just use the file name — Ahmad B, Mar 12 '20 at 11:17

score 2 · Answer 1 · answered Mar 12 '20 at 08:34

2

As suggested in the comments, the problem is that os.listdir returns only filenames and not complete paths.

You can try this:

import tkinter 
import camelot
import glob

directory = 'C:\\Users\\Alr\\Desktop\\test\\*.pdf'
files = [filename for filename in glob.glob(directory)]

for pdf_filepath in files:
    csv_filepath=pdf_filepath.replace('.pdf','.csv')
    tables = camelot.read_pdf(pdf_filepath, pages='5,6,7')

    # the following lines seem to be duplicate
    tables.export(csv_filepath, f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(csv_filepath)

answered Mar 12 '20 at 08:34

Stefano Fiorucci - anakin87

3,143
7
26

Thank you yes i think by adding the path is working now. However, I have a problem with 'tables.to_csv(files[i]+'.csv')' Im using files[i] to name the csv file every time i extract a table from a pdf file. As you might now the files[i] will include the file name + .pdf thus is there a way to remove .pdf from the name before i export it because right not its giving an error it exports .pdf.csv together – Ahmad B Mar 12 '20 at 11:18
You can replace '.pdf' with '.csv' ---> tables.to_csv(files[i].replace('.pdf,'.csv')) – Stefano Fiorucci - anakin87 Mar 12 '20 at 11:26

How to extract data from multiple PDFs in the same directory using python-camelot?

1 Answers1