2

I'm looking for packages in Python to convert tables from PDFs to CSVs. I've attached an image of such a table below, while the original PDF can be downloaded from here.

I've tried using Tabula which did not seem to be able to recreate the formatting of the table, and had difficulty distinguishing between cells (understandable, given the source has minimal bordering). The options which were effective used OCR (www.convertio.co and www.extracttable.com were the ones I tried). However, both of these are paid services without a subscription model, so would be financially unrealistic for me due to the volume of files I'm hoping to process.

If anyone has suggestions on any viable options, that would be extremely helpful.

Example Image

tmako
  • 349
  • 2
  • 9

2 Answers2

1

I was able to successfully read in the table from the page you shared with tabula. Ideally, you would want to do this with tabula as it would also be more performant, since you are extracting text versus processing an image. I use tabula in some project and can extract data from thousands of .pdf pages in seconds (they must be text).

Have you entered coordinates for the columns? This is usually the best way to do this if the columns are consistent across multiple pages (they often are). This is what the code and output would look like for one page. I am only able to download one page from that link and not multiple or the whole book, so my code will work on a .pdf with one page.

For multiple pages it is the same code as below but you do a double pd.concat. One concat is to concat a list of series into a dataframe (as I have done below). Not shown in my code would be another concat to bring together all of the pages.

from tabula import read_pdf
import pandas as pd
file = 'Documents/uc1-31210012879134-20-1629833207.pdf'
df = pd.concat(read_pdf(file, columns=([152.0, 193.0, 235.0, 272.0, 308.0, 352.0, 388.0, 460.0, 491.0, 527.0]),
                       pandas_options=({'header': False}), 
              pages='all',multiple_tables=True, stream=True, guess=False)).dropna()
df
Got stderr: Aug 24, 2021 1:04:28 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed

Out[1]: 
                Unnamed: 0         ..       Unnamed: 1 Unnamed: 2  \
0   Table 4. —Number of OA  SDIben. e      ficiarie.s.  with .ben   
6        State 72andcounty      Total          workers   husbands   
8       California , total  3,738,579        2,393,724    303,722   
9                  Alameda    157,320           99,635     11,865   
10                Alpine .        100               60          5   
11              Amador . .      6,780            4,630        665   
12                   Butte     39,645           25,920      3,450   
13             Calaveras .      8,030            5,325        765   
15                  Colusa      2,675            1,610        245   
16            Contra Costa    107,695           67,615      9,865   
17               Del Norte      4,455            2,635        345   
18                Eldorado     21,340           14,325      1,900   
19                  Fresno     90,325           54,095      7,390   
20                   Glenn      4,530            2,790        390   
21                Humboldt     20,635           11,810      1,710   
22                Imperial     17,925            9,810      1,555   
23                    Inyo      4,400            3,060        335   
24                    Kern     76,030           43,000    5,860 7   
25                   Kings      9,770            5,440        890   
26                    Lake     14,955            9,925      1,110   
27                Lassen ,      3,985            2,375        340   
28             Los Angeles  94175,420  806142,40980090   70,665 0   
29                  Madera     15,225            8,945      1,455   
30                   Marin     32,080           22,170      2,930   
32               Mendocino     14,745            8,945      1,260   
33                  Merced     21,915           12,780      1,905   
36                Monterey     42,820           26,845      3,815   
37                    Napa     20,600           13,625      1,815   
38                  Nevada     16,730           11,400      1,685   
39                  Orange    255,760          170,195     21,380   
40                  Placer     27,215           17,660      2,435   
41                  Plumas      4,445            3,000        385   
42               Riverside    190,995          129,380     14,980   
43              Sacramento    138,230           83,465     11,360   
44              San Benito      4,450            2,940        360   
45          San Bernardino    165,735          100,005     12,315   
46               San Diego    326,395          213,615     31,015   
47           San Francisco   1059,495           73,615    6,675 0   
48             San Joaquin     68,560           40,985      5,310   
49         San Luis Obispo     37,625           25,170      3,495   
50               San Mateo     90,130           61,700     7,4251   
51           Santa Barbara     53,330           35,145      5,050   
52             Santa Clara    145,405           94,760     12,140   
53              Santa Cruz    30,260_           19,695      2,610   
54                  Shasta     29,965           17,985      2,555   
56                Siskiyou     10,050            6,250        925   
57                  Solano     35,270           20,825      2,920   
58                  Sonoma     62,965           41,070      5,405   
59              Stanislaus     53,520           31,900      4,155   
63                  Tulare     47,655           27,325      3,820   
65                 Ventura     77,365           49,505      6,695   
66                    Yolo     17,205           10,520      1,560   
67                    Yuba      9,265            5,135        720   

       Unnamed: 3   Unnamed: 4 Unnamed: 5                   Unnamed: 6  \
0   efi..t.s in c  urrent -pay   ment sta  .tus , b.y..... type o..f b   
6        Children     widowers   Children          workers husbands Ch   
8          41,245      442,674    160,627               295,295 18,844   
9           1,685       19,030      6,720                   13,935 675   
10              0           15          5                         15 0   
11             40          745        190                       385 30   
12            305        4,315      1,195                    3,210 260   
13             45          800        240                       610 50   
15             30          360        105                       230 25   
16            875       13,905      4,085                    8,300 465   
17             25          510        155                       550 55   
18            150        2,010        790                    1,570 115   
19          1,335       11,385      5,130                    7,745 620   
20             55          595        200                       315 45   
21            155        2,685        920                    2,405 210   
22            640        2,100      1,250                    1,760 195   
23             15          505        130                       265 20   
24           0915        9,435      4,470                 8,470 65 750   
25            135        1,390        600                       870 90   
26            110        1,490        395                    1,400 115   
27             30          445        175                       400 45   
28         10,900      111,100     43,920             74,425 0 4,21025   
29            200        1,830        750                    1,390 140   
30            255        3,895        730                     1,760 50   
32            160        1,815        680                     1,375 90   
33            275        2,870      1,225                  1,930 85185   
36            580        5,350      1,825                    3,065 245   
37            255        2,435        815                        1,270   
38            120        1,660        440                     1,055 80   
39          2,445       32,600      9,325                   15,280 795   
40            210        3,100      1,010                    1,995 155   
41             30          430        130                       360 25   
42          1,725       19,935      7,440                   12,665 945   
43          1,415       16,535      6,460                   13,970 850   
44             75          490        220                       255 25   
45          2,005       20,145     10,125                 14,805 1,225   
46          3,630       37,775     11,625                 21,615 1,325   
47          1,200       12,470      3,040               10,93055 5 245   
48          1,040        8,165      3,695                    6,620 505   
49            290        4,125      1,000                    2,645 200   
50           7655       10,580      2,660                  5,580 15205   
51            595        6,255      1,750                    3,360 230   
52          1,635       17,210      6,230                   10,320 500   
53            280        3,760      1,090                    2,120 140   
54            240        3,545      1,170                    3,175 300   
56             65        1,125        320                       955 75   
57            380        4,415      1,985                    3,230 290   
58            735        7,780      2,320                    4,325 220   
59            585        6,575      2,640                    5,450 450   
63          1,010        5,730      3,010                    4,590 445   
65            960        9,110      3,310                    5,555 435   
66            175        2,080        760                     1,560 95   
67            110        1,075        485                    1,200 105   

   Unnamed: 7          .        ..1  
0    enefit ,  .by sex o  f ... ...  
6      ildren        Men      Women  
8      82,448  1,154,514  1,658,066  
9       3,775     46,670     71,480  
10          0         30         40  
11         95      2,400      2,915  
12        990     12,980     17,280  
13        195      2,760      3,205  
15         70        850      1,110  
16      2,585     33,040     48,935  
17        180      1,395      1,690  
18        480      7,205      8,655  
19      2,625     26,910     37,875  
20        140      1,455      1,905  
21        740      6,210      8,160  
22        615      5,695      5,600  
23         70      1,515      1,940  
24      3,130     21,815     29,155  
25        355      2,830      3,985  
26        410      5,095      6,020  
27        175      1.255      1,520  
28   1157,805    289,635    428,460  
29        515      4,780      5,880  
30        290     10,655     15,770  
32        420      4,620      6,025  
33        745      6,655      8,725  
36      1,095     13,230     18,665  
37        330      6,855      9,500  
38        290      5,810      7,135  
39      3,740     79,175    122,680  
40        650      8,700     11,895  
41         85      1,540      1,745  
42      3,925     63,005     83.820  
43      4,175     39,815     59.235  
44         85      1,515      1,810  
45      5,110     47,785     68,420  
46      5,795    105,260    147,030  
47    151,320     33,800     51,465  
48      2,240     20,580     27,970  
49        700     12,545     16,840  
50      1,215     29,260     43,220  
51        945     17,130     24,825  
52      2,610     44,620     66,405  
53        565      9,335     14,240  
54        995      9,215     12,020  
56        335      3,265      3,960  
57      1,225     10,090     14,555  
58      1,110     19,690     29,635  
59      1,765     15,600     22,310  
63      1,725     13,800     18,705  
65      1,795     23,720     34,655  
66        455      5,325      7,400  
67        435      2,720      3,315  
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • I did attempt to input the column locations, but unfortunately even with that, Tabula seems to have difficulty discerning between cells. For example, see Los Angeles county in the output you provided compared to the original data. – tmako Aug 24 '21 at 20:28
  • yeah, there might be something wrong with that file. I copy / pasted into Notepad/Text Editor, and the line for Los Angeles has messed up underlying text. You might have to try a solution that processes the image rather than the messy underlying text. There is an image on the left side of the page that starts at the line for los angeles, which may have created issues for the underlying text when the file was processes to a .pdf. – David Erickson Aug 24 '21 at 21:20
  • Yup, unfortunately many of these tables are very old, so formatting issues are to be expected. Thank you anyway though. – tmako Aug 24 '21 at 21:58
1

Options and comparison between Python libraries available for PDF/Image table reconstruction:

  • pdfminer.six provides the foundation for pdfplumber. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging.
  • pymupdf is substantially faster than pdfminer.six (and thus also pdfplumber) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools.
  • camelot, tabula-py, and pdftables all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract.
  • PyPDF2 and its successor libraries appear no longer to be maintained.
  • Pdfplumber - Have used this library to extract table in a pdf which had no bounding box of rectangle separating from table from normal text

pdf =pdfplumber.open(filepath)

for i in range(int(len(pdf.pages))):
     df = pd.DataFrame()
     table = pdf.pages[i].extract_table(table_settings={"vertical_strategy": "text","horizontal_strategy": "text"})
     df = pd.DataFrame(table,columns=table)
     df.to_csv(outfile, mode="a", index = False)



        

    

    
   
dataninsight
  • 1,069
  • 6
  • 13