I am very new to R -- but have now spent several days cobbling together (thank you stack exchange community) the code I need, in order to accomplish what I am trying to do: from start to finish, I am using the Tabulizer package to process pdf tables into R. I am struggling with a misaggregation of numbers (off the pdf) and how to effectively split them into appropriate columns.
This is a subset of the relevant columns from the Tabulizer output:
'data.frame': 26 obs. of 3 variables:
$ X1: Factor w/ 29 levels "Region A","Region B",..: 22 2 3 4 5 6 7 8 9 10 ...
$ X2: Factor w/ 29 levels "1 192 36,7","102 4,6",..: 21 20 4 1 23 10 18 14 26 17 ...
$ X3: Factor w/ 29 levels "1 144 48,2","1 800 55, 8",..: 24 6 7 2 23 11 22 17 26 21 ...
More specifically: "1 800 55, 8" is a mistaken misaggregation of "1800" and "55.8", and this occurs throughout the column (which should have been two).
I am using gsub (in a loop) to substitute the decimal comma into a decimal dot, and to remove unwanted spaces around said decimal.
for (i in 2:ncol(Tab)) { #some of the tables have multiple columns
Tab[[i]] <- gsub(",", "\\.", Tab[[i]])
Tab[[i]] <- gsub("\\. ", "\\.", Tab[[i]])
Tab[[i]] <- gsub(" \\. ", "\\.", Tab[[i]])
}
That gets me to "1 800 55.8" which is better but not great; the first space may or may not be there, or there may be two of them, depending on the magnitude of the number - the space was a stand-in for the thousand separator in the original document.
I would like to convert the column to the appropriate two new columns. I've tried using stringi::word to isolate the "last word" and am trying to match|extract that output (55.8) from the original column, leaving "1 800", which can then be trimmed, but I don't know how to. Also not married to that solution -- open to learning other ways. Much appreciated.