I am looking to create a lookup table from data where entries in a column (user_entry
) are in different formats and may contain more than one instance per row.
# create example dataframe.
id <- c(1111,1112,1113,1114)
user_entry <- c("999/1001","1002;1003","999/1004\n999/1005","9991006 9991007")
df <- data.frame(id,user_entry)
> df
id user_entry
1 1111 999/1001
2 1112 1002;1003
3 1113 999/1004\n999/1005
4 1114 9991006 9991007
I am only interested in the 4 digit code which may or may not be preceded by a 3 digit location code and/or a separator character such as "/" or a space. There may be more than one 4 digit code in each entry and I would like to list each of these separately in the final lookup table (see lookup
below).
The code below does what I am looking for, but is really inelegant with loops inside loops and a dataframe growing inside. Is there a neater way to do this?
library(dplyr);library(stringr)
# use stringr package to extract only digits
df <- df %>%
mutate(entries = str_extract_all(user_entry,"[[:digit:]]+")) %>%
select(-user_entry)
# initialise lookup dataframe
lookup <- df[FALSE,]
for (record in 1:nrow(df)){
entries <- df$entries[[record]]
for (element in 1:length(entries)){
# only interested in 4 digit codes
if (nchar(entries[element])>3){
# remove 3 digit code if it is still attached
lookup_entry <- gsub('.*?(\\d{4})$','\\1',entries[element])
lookup <- rbind(lookup,data.frame(id=df$id[[record]],entries=lookup_entry))
}
}
}
> lookup
id entries
1 1111 1001
2 1112 1002
3 1112 1003
4 1113 1004
5 1113 1005
6 1114 1006
7 1114 1007