I have data that looks a bit like this
col1 col2
1 "1042AZ"
2 "9523 pa"
3 "dog"
4 "New York"
5 "20000 (usa)"
6 "Outside the country"
7 "1052"
I want to keep everything that
- is only 4 numbers
- is only 4 numbers and two letters with any combination of spaces
I currently have this code:
df$col2 <- gsub('\\s+', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>%
mutate(col3 = str_extract(col2, "^[0-9]{4,}"),
col4 = str_extract(col2, "[A-Z].*$"),
across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))
I want this result:
col1 col2 col3 col4
1 "1042AZ" 1042 "AZ"
2 "9523PA" 9523 "PA"
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
7 "1052" 1052 NA
Problem is that the number in line 5 stays after running my code.