Problem extracting tabular data from a pdf

Question

I'm trying to extract table from a pdf that had a lot of name of media sources. The desired output is a comprehensive csv file with a column with all the sources listed.

I'm trying to write a simple python script to extract table data from a pdf. The output I was able to reach is a CSV for every table that I try to combine. Then I use the concat function to merge all the files. The result is messy, I have redundant punctuation and a lot of spaces in the file.

Can somebody help me reach a better result?

Code:

from camelot import read_pdf
import glob
import os
import pandas as pd
import numpy as np
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# Get all the tables within the file
all_tables = read_pdf("/Users/zago/code/pdftext/pdftextvenv/mimesiweb.pdf", pages = 'all')

# Show the total number of tables in the file
print("Total number of table: {}".format(all_tables.n))
 
# print all the tables in the file
for t in range(all_tables.n):
    print("Table n°{}".format(t))
    print((all_tables[t].df).head())

#convert to excel or csv 
#all_tables.export('table.xlsx', f="excel")
all_tables.export('table.csv', f="csv")

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',') for f in all_filenames ])
#export to csv
combined_csv.to_csv("combined_csv_tables.csv", index=False, encoding="utf-8")

Starting point PDF

Result for 1 csv

Combined csv

Thanks

can you post one pdf file to try to reproduce and solve your problem? — seralouk, Feb 14 '23 at 10:03

score 0 · Answer 1 · answered Feb 14 '23 at 11:58

Select only the first column before concatenating and then save.

Just use this line of code:

combined_csv = pd.concat([pd.read_csv(f,encoding = 'utf-8', sep=',').iloc[:,0] for f in all_filenames ])

Output:

In [25]: combined_csv
Out[25]:
0                 Interni.it
1                  Intima.fr
2              Intimo Retail
3     Intimoda Magazine - En
4          Intorno Tirano.it
               ...
47          Alessandria Oggi
48               Aleteia.org
49              Alibi Online
50               Alimentando
51       All About Italy.net
Length: 2824, dtype: object

And final csv output:

May I ask why after the 407 line the csv appear as follow `Ipe.com,"" IPS - Inter Press Service,"" Ipsoa.it,"" Irene's Closet,"" Irpinia News,"" Irpinia Oggi,"" Irpinia Report,"" Iskrae.eu,"" ` — signorz, Feb 14 '23 at 15:31

K J · Answer 2 · 2023-02-14T22:05:54.207

There are oddities to beware of when using CSV format.

ALL PDF pages are generally stored as just one column of text from page edge to edge, unless tagged to be multi column pages areas. (One reason data extractors are required to generate a split into tabular text.)

In this case of a single column of text a standard CSV file output/input is no different, except for ONE entry that includes a comma :- (no other entry needs a comma in a CSV output) so if the above PDF is import/exported to Excel it will look like a column.

Thus the only command needed is to export pdftotxt add " to each line end and rename to csv.
HOWEVER see comment after

pdftotext -layout -nopgbrk -enc UTF-8 mimesiweb.pdf
for /f "tokens=*" %t in (mimesiweb.txt) do echo "%t" >>mimesiweb.csv

This will correctly generate the desired output for open in Excel on command line

We correctly out put UTF-8 characters in the text.csv but my old Excel always cause the UTF to be lost on import, (e.g. Accènto becomes AccÃ¨nto) even if I use CHCP 65001 (Unicode) to invoke it.
Here exported as UTF8.csv (the file reads accented in notepad) no issue, but reimported again as UTF8.csv the symbology is lost ! Newer Excel may fare better ?

So that is a failing in Excel where without better excel import for me would be to simply cut and paste the 2880 lines of text so that those accents are preserved !! The alternative is to import text into Libre Office that will support UTF-8

However remember to either UNCHECK comma OR pad "each, line" as I did earlier for csv with existing , otherwise the 2nd column is generated. :-)

score 0 · Answer 3 · answered Feb 15 '23 at 08:18

I've found pdfplumber simpler to use for such tasks.

import pdfplumber

pdf = pdfplumber.open("Downloads/mimesiweb.pdf")
rows = []
for page in pdf.pages:
    rows.extend(page.extract_text().splitlines())

>>> len(rows)
2881
>>> rows[:3]
['WEB', 'Ultimo aggiornamento:  03 06 2020', '01net.it']
>>> rows[-3:]
['Zoneombratv.it', 'Zoom 24', 'ZOOMsud.it']

Problem extracting tabular data from a pdf

3 Answers3