I was able to successfully read in the table from the page you shared with tabula
. Ideally, you would want to do this with tabula
as it would also be more performant, since you are extracting text versus processing an image. I use tabula
in some project and can extract data from thousands of .pdf
pages in seconds (they must be text).
Have you entered coordinates for the columns? This is usually the best way to do this if the columns are consistent across multiple pages (they often are). This is what the code and output would look like for one page
. I am only able to download one page from that link and not multiple or the whole book, so my code will work on a .pdf
with one page.
For multiple pages it is the same code as below but you do a double pd.concat
. One concat
is to concat
a list of series
into a dataframe (as I have done below). Not shown in my code would be another concat
to bring together all of the pages.
from tabula import read_pdf
import pandas as pd
file = 'Documents/uc1-31210012879134-20-1629833207.pdf'
df = pd.concat(read_pdf(file, columns=([152.0, 193.0, 235.0, 272.0, 308.0, 352.0, 388.0, 460.0, 491.0, 527.0]),
pandas_options=({'header': False}),
pages='all',multiple_tables=True, stream=True, guess=False)).dropna()
df
Got stderr: Aug 24, 2021 1:04:28 PM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Out[1]:
Unnamed: 0 .. Unnamed: 1 Unnamed: 2 \
0 Table 4. —Number of OA SDIben. e ficiarie.s. with .ben
6 State 72andcounty Total workers husbands
8 California , total 3,738,579 2,393,724 303,722
9 Alameda 157,320 99,635 11,865
10 Alpine . 100 60 5
11 Amador . . 6,780 4,630 665
12 Butte 39,645 25,920 3,450
13 Calaveras . 8,030 5,325 765
15 Colusa 2,675 1,610 245
16 Contra Costa 107,695 67,615 9,865
17 Del Norte 4,455 2,635 345
18 Eldorado 21,340 14,325 1,900
19 Fresno 90,325 54,095 7,390
20 Glenn 4,530 2,790 390
21 Humboldt 20,635 11,810 1,710
22 Imperial 17,925 9,810 1,555
23 Inyo 4,400 3,060 335
24 Kern 76,030 43,000 5,860 7
25 Kings 9,770 5,440 890
26 Lake 14,955 9,925 1,110
27 Lassen , 3,985 2,375 340
28 Los Angeles 94175,420 806142,40980090 70,665 0
29 Madera 15,225 8,945 1,455
30 Marin 32,080 22,170 2,930
32 Mendocino 14,745 8,945 1,260
33 Merced 21,915 12,780 1,905
36 Monterey 42,820 26,845 3,815
37 Napa 20,600 13,625 1,815
38 Nevada 16,730 11,400 1,685
39 Orange 255,760 170,195 21,380
40 Placer 27,215 17,660 2,435
41 Plumas 4,445 3,000 385
42 Riverside 190,995 129,380 14,980
43 Sacramento 138,230 83,465 11,360
44 San Benito 4,450 2,940 360
45 San Bernardino 165,735 100,005 12,315
46 San Diego 326,395 213,615 31,015
47 San Francisco 1059,495 73,615 6,675 0
48 San Joaquin 68,560 40,985 5,310
49 San Luis Obispo 37,625 25,170 3,495
50 San Mateo 90,130 61,700 7,4251
51 Santa Barbara 53,330 35,145 5,050
52 Santa Clara 145,405 94,760 12,140
53 Santa Cruz 30,260_ 19,695 2,610
54 Shasta 29,965 17,985 2,555
56 Siskiyou 10,050 6,250 925
57 Solano 35,270 20,825 2,920
58 Sonoma 62,965 41,070 5,405
59 Stanislaus 53,520 31,900 4,155
63 Tulare 47,655 27,325 3,820
65 Ventura 77,365 49,505 6,695
66 Yolo 17,205 10,520 1,560
67 Yuba 9,265 5,135 720
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 \
0 efi..t.s in c urrent -pay ment sta .tus , b.y..... type o..f b
6 Children widowers Children workers husbands Ch
8 41,245 442,674 160,627 295,295 18,844
9 1,685 19,030 6,720 13,935 675
10 0 15 5 15 0
11 40 745 190 385 30
12 305 4,315 1,195 3,210 260
13 45 800 240 610 50
15 30 360 105 230 25
16 875 13,905 4,085 8,300 465
17 25 510 155 550 55
18 150 2,010 790 1,570 115
19 1,335 11,385 5,130 7,745 620
20 55 595 200 315 45
21 155 2,685 920 2,405 210
22 640 2,100 1,250 1,760 195
23 15 505 130 265 20
24 0915 9,435 4,470 8,470 65 750
25 135 1,390 600 870 90
26 110 1,490 395 1,400 115
27 30 445 175 400 45
28 10,900 111,100 43,920 74,425 0 4,21025
29 200 1,830 750 1,390 140
30 255 3,895 730 1,760 50
32 160 1,815 680 1,375 90
33 275 2,870 1,225 1,930 85185
36 580 5,350 1,825 3,065 245
37 255 2,435 815 1,270
38 120 1,660 440 1,055 80
39 2,445 32,600 9,325 15,280 795
40 210 3,100 1,010 1,995 155
41 30 430 130 360 25
42 1,725 19,935 7,440 12,665 945
43 1,415 16,535 6,460 13,970 850
44 75 490 220 255 25
45 2,005 20,145 10,125 14,805 1,225
46 3,630 37,775 11,625 21,615 1,325
47 1,200 12,470 3,040 10,93055 5 245
48 1,040 8,165 3,695 6,620 505
49 290 4,125 1,000 2,645 200
50 7655 10,580 2,660 5,580 15205
51 595 6,255 1,750 3,360 230
52 1,635 17,210 6,230 10,320 500
53 280 3,760 1,090 2,120 140
54 240 3,545 1,170 3,175 300
56 65 1,125 320 955 75
57 380 4,415 1,985 3,230 290
58 735 7,780 2,320 4,325 220
59 585 6,575 2,640 5,450 450
63 1,010 5,730 3,010 4,590 445
65 960 9,110 3,310 5,555 435
66 175 2,080 760 1,560 95
67 110 1,075 485 1,200 105
Unnamed: 7 . ..1
0 enefit , .by sex o f ... ...
6 ildren Men Women
8 82,448 1,154,514 1,658,066
9 3,775 46,670 71,480
10 0 30 40
11 95 2,400 2,915
12 990 12,980 17,280
13 195 2,760 3,205
15 70 850 1,110
16 2,585 33,040 48,935
17 180 1,395 1,690
18 480 7,205 8,655
19 2,625 26,910 37,875
20 140 1,455 1,905
21 740 6,210 8,160
22 615 5,695 5,600
23 70 1,515 1,940
24 3,130 21,815 29,155
25 355 2,830 3,985
26 410 5,095 6,020
27 175 1.255 1,520
28 1157,805 289,635 428,460
29 515 4,780 5,880
30 290 10,655 15,770
32 420 4,620 6,025
33 745 6,655 8,725
36 1,095 13,230 18,665
37 330 6,855 9,500
38 290 5,810 7,135
39 3,740 79,175 122,680
40 650 8,700 11,895
41 85 1,540 1,745
42 3,925 63,005 83.820
43 4,175 39,815 59.235
44 85 1,515 1,810
45 5,110 47,785 68,420
46 5,795 105,260 147,030
47 151,320 33,800 51,465
48 2,240 20,580 27,970
49 700 12,545 16,840
50 1,215 29,260 43,220
51 945 17,130 24,825
52 2,610 44,620 66,405
53 565 9,335 14,240
54 995 9,215 12,020
56 335 3,265 3,960
57 1,225 10,090 14,555
58 1,110 19,690 29,635
59 1,765 15,600 22,310
63 1,725 13,800 18,705
65 1,795 23,720 34,655
66 455 5,325 7,400
67 435 2,720 3,315