Tabulize function in R

Question

I want to extract the table of page 112 in this pdf document:

http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040

# report 2017 
url_location <-"http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=432759CA-0A73-57F6-04C67EF7EE506040"
out <- extract_tables(url_location, pages = 112)

I have tried using these tutorials (link1,link2) about 'tabulize' package but I largely failed. There are some difficult aspects which I am not very experienced how to handle in R.

Can someone suggest something and help me with that ?

Installation

devtools::install_github("ropensci/tabulizer")

# load package
library(tabulizer)

There is no way to know what your problem is, can you be a little more specific? For starters, you could simply download the pdf and scrape your table manually using the `extract_areas()` function — tifu, Dec 13 '17 at 12:56

hrbrmstr · Accepted Answer · 2017-12-13T18:56:41.837

Java deps — while getting easier to deal with — aren't necessary when the tables are this clean. Just a bit of string wrangling will get you what you need:

library(pdftools)
library(stringi)
library(tidyverse)

# read it with pdftools
book <- pdf_text("global-wealth-databook.pdf")

# go to the page
lines <- stri_split_lines(book[[113]])[[1]] 

# remove footer
lines <- discard(lines, stri_detect_fixed, "Credit Suisse")

# find line before start of table
start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1

# find line after table
end <- last(which(lines == ""))-1

# smuch into something read.[table|csv] can read
tab <- paste0(stri_replace_all_regex(lines[start:end], "[[:space:]][[:space:]]+", "\t"), collapse="\n")

#read it
read.csv(text=tab, header=FALSE, sep="\t", stringsAsFactors = FALSE)
##                   V1      V2      V3      V4    V5   V6    V7    V8    V9  V10
## 1              Egypt  56,036   3,168     324  98.1  1.7   0.2   0.0 100.0 91.7
## 2        El Salvador   3,957  14,443   6,906  66.0 32.8   1.2   0.0 100.0 65.7
## 3  Equatorial Guinea     670   8,044   2,616  87.0 12.2   0.7   0.1 100.0 77.3
## 4            Eritrea   2,401   3,607   2,036  94.5  5.4   0.1 100.0  57.1   NA
## 5            Estonia   1,040  43,158  27,522  22.5 72.2   5.1   0.2 100.0 56.4
## 6           Ethiopia  49,168     153     103 100.0  0.0 100.0  43.4    NA   NA
## 7               Fiji     568   6,309   3,059  85.0 14.6   0.4   0.0 100.0 68.2
## 8            Finland   4,312 159,098  57,850  30.8 33.8  33.5   1.9 100.0 76.7
## 9             France  49,239 263,399 119,720  25.3 21.4  49.3   4.0 100.0 70.2
## 10             Gabon   1,098  15,168   7,367  62.0 36.5   1.5   0.0 100.0 68.4
## 11            Gambia     904     898     347  99.2  0.7   0.0 100.0  72.4   NA
## 12           Georgia   2,950  19,430   9,874  50.7 47.6   1.6   0.1 100.0 66.8
## 13           Germany  67,244 203,946  47,091  29.5 33.7  33.9   2.9 100.0 79.1
## 14             Ghana  14,574     809     411  99.5  0.5   0.0 100.0  66.1   NA
## 15            Greece   9,020 111,684  54,665  20.7 52.9  25.4   1.0 100.0 67.7
## 16           Grenada      70  17,523   4,625  74.0 24.3   1.5   0.2 100.0 81.5
## 17            Guinea   5,896     814     374  99.4  0.6   0.0 100.0  69.7   NA
## 18     Guinea-Bissau     884     477     243  99.8  0.2 100.0  65.6    NA   NA
## 19            Guyana     467   5,345   2,510  89.0 10.7   0.3   0.0 100.0 67.2
## 20             Haiti   6,172   2,879     894  96.2  3.6   0.2   0.0 100.0 76.9
## 21         Hong Kong   6,172 193,248  46,079  26.3 50.9  20.9   1.9 100.0 85.1
## 22           Hungary   7,846  39,813  30,111  11.8 83.4   4.8   0.0 100.0 45.3
## 23           Iceland     245 587,649 444,999  13.0 72.0  15.0 100.0  46.7   NA
## 24             India 834,608   5,976   1,295  92.3  7.2   0.5   0.0 100.0 83.0
## 25         Indonesia 167,559  11,001   1,914  81.9 17.0   1.1   0.1 100.0 83.7
## 26              Iran  56,306   3,831   1,856  94.1  5.7   0.2   0.0 100.0 67.3
## 27           Ireland   3,434 248,466  84,592  31.2 22.7  42.3   3.6 100.0 81.3
## 28            Israel   5,315 198,406  78,244  22.3 38.7  36.7   2.3 100.0 74.2
## 29             Italy  48,544 223,572 124,636  21.3 22.0  54.1   2.7 100.0 66.0
## 30           Jamaica   1,962   9,485   3,717  79.0 20.2   0.8   0.0 100.0 74.3
## 31             Japan 105,228 225,057 123,724   7.9 35.7  53.9   2.6 100.0 60.9
## 32            Jordan   5,212  13,099   6,014  65.7 33.1   1.2   0.0 100.0 76.1
## 33        Kazakhstan  12,011   4,441     334  97.6  2.1   0.3   0.0 100.0 92.6
## 34             Kenya  23,732   1,809     662  97.4  2.5   0.1   0.0 100.0 77.2
## 35             Korea  41,007 160,609  67,934  20.0 40.5  37.8   1.7 100.0 70.0
## 36            Kuwait   2,996  97,304  37,788  30.3 48.3  20.4   1.0 100.0 76.9
## 37        Kyrgyzstan   3,611   4,689   2,472  92.7  7.0   0.2   0.0 100.0 62.9
## 38              Laos   3,849   5,662   1,382  94.6  4.7   0.7   0.0 100.0 84.9
## 39            Latvia   1,577  27,631  17,828  29.0 68.6   2.2   0.1 100.0 53.6
## 40           Lebanon   4,085  24,161   6,452  69.0 28.5   2.3   0.2 100.0 82.0
## 41           Lesotho   1,184   3,163     945  95.9  3.8   0.3   0.0 100.0 79.8
## 42           Liberia   2,211   2,193     959  97.3  2.6   0.1   0.0 100.0 71.6
## 43             Libya   4,007  45,103  24,510  29.6 61.1   9.2   0.2 100.0 59.9
## 44         Lithuania   2,316  27,507  17,931  27.3 70.4   2.1   0.1 100.0 51.6
## 45        Luxembourg     450 313,687 167,664  17.0 20.0  58.8   4.2 100.0 68.1
## 46         Macedonia   1,607   9,044   5,698  77.0 22.5   0.5   0.0 100.0 56.4

UPDATE

This is more generic but you'll still have to do some manual cleanup. I think you would even if you used Tabula.

library(pdftools)
library(stringi)
library(tidyverse)

# read it with pdftools
book <- pdf_text("~/Downloads/global-wealth-databook.pdf")

transcribe_page <- function(book, pg) {

  # go to the page
  lines <- stri_split_lines(book[[pg]])[[1]] 

  # remove footer
  lines <- discard(lines, stri_detect_fixed, "Credit Suisse")

  # find line before start of table
  start <- last(which(stri_detect_regex(lines, "^[[:space:]]+")))+1

  # find line after table
  end <- last(which(lines == ""))-1

  # get the target rows
  rows <- lines[start:end]

  # map out where data values are
  stri_replace_first_regex(rows, "([[:alpha:]]) ([[:alpha:]])", "$1_$2") %>% 
    stri_replace_all_regex("[^[:blank:]]", "X") %>% 
    map(~rle(strsplit(.x, "")[[1]])) -> pos

  # compute the number of data fields
  nfields <- ceiling(max(map_int(pos, ~length(.x$lengths))) / 2)

  # do our best to get them into columns
  data_frame(rec = rows) %>% 
    separate(rec, into=sprintf("X%s", 1:nfields), sep="[[:space:]]{2,}", fill="left") %>% 
    print(n=length(rows))

}

transcribe_page(book, 112)
transcribe_page(book, 113)
transcribe_page(book, 114)
transcribe_page(book, 115)

Take a look at the outputs for ^^. They aren't in terrible shape and some of the cleanup can be programmatic.

Can I ask you whether there is an easy looping to do, since I want to do the exact same thing for pages 112 to 115 ?? — msh855, Dec 13 '17 at 13:26
One more thing. I believe some columns (rows) are not properly read. For example, if the original pdf file has no values in some columns, the code gets confused. Look for instance, at the original pdf file and row 6, in the last column there are NAs wheres the table in the pdf file does not. — msh855, Dec 13 '17 at 13:52
looping shld be somewhat straightforward and I had to step away just as I realized it might not have gotten the cols 100%. Lemme poke a bit more at it. — hrbrmstr, Dec 13 '17 at 17:11
your help was indeed very valuable and saved a lot of time. I did 90% of my work. However, I really struggle to understand why in this case http://publications.credit-suisse.com/tasks/render/file/index.cfm?fileid=88DC07AD-83E8-EB92-9D5C3EAA87A97A77 , the pdf file at the pages 83:86 where the tables are is not read properly. — msh855, Dec 15 '17 at 18:43

Tabulize function in R

1 Answers1