Extracting multiple strings from poorly defined user input data

Question

I am looking to create a lookup table from data where entries in a column (user_entry) are in different formats and may contain more than one instance per row.

# create example dataframe.
id <- c(1111,1112,1113,1114)
user_entry <- c("999/1001","1002;1003","999/1004\n999/1005","9991006 9991007")
df <- data.frame(id,user_entry)

> df
    id         user_entry
1 1111           999/1001
2 1112          1002;1003
3 1113 999/1004\n999/1005
4 1114    9991006 9991007

I am only interested in the 4 digit code which may or may not be preceded by a 3 digit location code and/or a separator character such as "/" or a space. There may be more than one 4 digit code in each entry and I would like to list each of these separately in the final lookup table (see lookup below).

The code below does what I am looking for, but is really inelegant with loops inside loops and a dataframe growing inside. Is there a neater way to do this?

library(dplyr);library(stringr)

# use stringr package to extract only digits
df <- df %>% 
mutate(entries = str_extract_all(user_entry,"[[:digit:]]+")) %>%
select(-user_entry)

# initialise lookup dataframe
lookup <- df[FALSE,]
for (record in 1:nrow(df)){   
  entries <- df$entries[[record]]    
  for (element in 1:length(entries)){
    # only interested in 4 digit codes
    if (nchar(entries[element])>3){
      # remove 3 digit code if it is still attached
      lookup_entry <- gsub('.*?(\\d{4})$','\\1',entries[element])
      lookup <- rbind(lookup,data.frame(id=df$id[[record]],entries=lookup_entry))
    }
  }
}

> lookup
    id entries
1 1111    1001
2 1112    1002
3 1112    1003
4 1113    1004
5 1113    1005
6 1114    1006
7 1114    1007

Maybe you can just extract the last 4 digits from each digit sequence? [`str_extract_all(user_entry,"\\d{4}\\b")`](https://regex101.com/r/Hm20nm/1)? — Wiktor Stribiżew, Apr 18 '17 at 15:33

score 2 · Accepted Answer · answered Apr 18 '17 at 15:49

2

Using base R,

matches <- regmatches(user_entry, gregexpr("(\\d{4})\\b", user_entry))

data.frame(
  id = rep(id, lengths(matches)),
  entries = unlist(matches),
  stringsAsFactors = FALSE
)
#     id entries
# 1 1111    1001
# 2 1112    1002
# 3 1112    1003
# 4 1113    1004
# 5 1113    1005
# 6 1114    1006
# 7 1114    1007

answered Apr 18 '17 at 15:49

r2evans

141,215
6
77
149

This makes the assumption that if we will always have the 999 before the 4 digit numbers. I don't know if that will always hold true. If the last entry was 1007999 the regex would return 7999. – Kristofersen Apr 18 '17 at 15:52
Other than that it is a much cleaner solution than mine. Figured I'd post that though for OPs benefit. I'm not sure how exactly he wants to handle the 999s – Kristofersen Apr 18 '17 at 15:53
The patterns suggest that (regardless of "999") the 4-digit code of interest is always bounded on the right, which is sufficient given the examples. The risk of reducing a question for SO to "minimal/reproducible" is over-simplification, where not enough variety is provided. \*shrug\* – r2evans Apr 18 '17 at 15:54
good call, he actually mentions that its preceded by the 3 digit code too. I missed that the first time reading it. – Kristofersen Apr 18 '17 at 15:56

score 0 · Answer 2 · answered Apr 18 '17 at 16:26

Not very elegant, but I think it should work in your case:

    library("tidyverse")
df1 <- df %>%
  separate_rows(user_entry, sep = '(/|;|\\n|\\s)')

extract <- str_extract(df1$user_entry,"(?=\\d{3})\\d{4}$")
df1$extract <- extract
df2 <- df1[!is.na(df1$extract),]
df2


> df2
     id user_entry extract
 #1111       1001    1001
 #1112       1002    1002
 #1112       1003    1003
 #1113       1004    1004
 #1113       1005    1005
 #1114    9991006    1006
 #1114    9991007    1007

Extracting multiple strings from poorly defined user input data

2 Answers2