Misaggregation of data output from tabulizer

Question

I am very new to R -- but have now spent several days cobbling together (thank you stack exchange community) the code I need, in order to accomplish what I am trying to do: from start to finish, I am using the Tabulizer package to process pdf tables into R. I am struggling with a misaggregation of numbers (off the pdf) and how to effectively split them into appropriate columns.

This is a subset of the relevant columns from the Tabulizer output:

'data.frame':   26 obs. of  3 variables:
 $ X1: Factor w/ 29 levels "Region A","Region B",..: 22 2 3 4 5 6 7 8 9 10 ...
 $ X2: Factor w/ 29 levels "1 192 36,7","102 4,6",..: 21 20 4 1 23 10 18 14 26 17 ...
 $ X3: Factor w/ 29 levels "1 144 48,2","1 800 55, 8",..: 24 6 7 2 23 11 22 17 26 21 ...

More specifically: "1 800 55, 8" is a mistaken misaggregation of "1800" and "55.8", and this occurs throughout the column (which should have been two).

I am using gsub (in a loop) to substitute the decimal comma into a decimal dot, and to remove unwanted spaces around said decimal.

for (i in 2:ncol(Tab)) { #some of the tables have multiple columns
  Tab[[i]] <- gsub(",", "\\.", Tab[[i]]) 
  Tab[[i]] <- gsub("\\. ", "\\.", Tab[[i]])
  Tab[[i]] <- gsub(" \\. ", "\\.", Tab[[i]])
}

That gets me to "1 800 55.8" which is better but not great; the first space may or may not be there, or there may be two of them, depending on the magnitude of the number - the space was a stand-in for the thousand separator in the original document.

I would like to convert the column to the appropriate two new columns. I've tried using stringi::word to isolate the "last word" and am trying to match|extract that output (55.8) from the original column, leaving "1 800", which can then be trimmed, but I don't know how to. Also not married to that solution -- open to learning other ways. Much appreciated.

score 2 · Answer 1 · answered Nov 06 '18 at 20:13

You can try to use the columns argument in extract_tables to specify where the columns get separated. The way I use it is by using first :

tabulizer::locate_areas(FilePath)

This will allow you to get the location of your table. Then use the right and left coordinates to kind of guess where the columns get separated and use that when extracting the table

tabulizer::extract_tables(FilePath, pages = 1, guess = FALSE  
                            , columns = list(c(300))))

You can also specify the location of the table by adding "area" argument :

tabulizer::extract_tables(FilePath, pages = 1, guess = FALSE  
                         , area = list(c(265, 33, 460,400))
                            , columns = list(c(300))))

Misaggregation of data output from tabulizer

1 Answers1