0

I have data that looks a bit like this

col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"

I want to keep everything that

  • is only 4 numbers
  • is only 4 numbers and two letters with any combination of spaces

I currently have this code:

df$col2 <- gsub('\\s+', '', df$col2)
df$col2 <- toupper(df$col2)
#Delete all rows that does not start with 4 numbers and make PC4 column
df <- df %>% 
  mutate(col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z].*$"),
         across(c(col2,col3,col4), ~ifelse(grepl("^[0-9]{4}", col2), .x, "")))

I want this result:

col1    col2       col3   col4
1       "1042AZ"   1042   "AZ"
2       "9523PA"   9523   "PA"
3       NA         NA     NA
4       NA         NA     NA
5       NA         NA     NA
6       NA         NA     NA
7       "1052"     1052   NA

Problem is that the number in line 5 stays after running my code.

Victor Nielsen
  • 443
  • 2
  • 14

1 Answers1

0

Following your code, you can set to NA if col3 does not have 4 characters:

df %>% 
  mutate(col2 = gsub('\\s+', '', toupper(col2)),
         col3 = str_extract(col2, "^[0-9]{4,}"), 
         col4 = str_extract(col2, "[A-Z|a-z].*$"),
         across(c(col2,col3,col4), ~ ifelse(nchar(col3) == 4, .x, NA)))

  col1   col2 col3 col4
1    1 1042AZ 1042   AZ
2    2 9523PA 9523   PA
3    3   <NA> <NA> <NA>
4    4   <NA> <NA> <NA>
5    5   <NA> <NA> <NA>
6    6   <NA> <NA> <NA>
7    7   1052 1052 <NA>

data

df <- read.table(header = T, text = 'col1   col2
1      "1042AZ"
2      "9523 pa"
3      "dog"
4      "New York"
5      "20000 (usa)"
6      "Outside the country"
7      "1052"')
Maël
  • 45,206
  • 3
  • 29
  • 67
  • But I still have the problem of 20000 (usa) being saved as 2000. And what about row 7 column 4. How did that get an NA? – Victor Nielsen Mar 04 '22 at 11:29
  • Not if you use the code I provided. Values are set to NAs if they are not 4 characters, including 20000 which is 5. – Maël Mar 04 '22 at 11:40
  • Ah yes, my bad @Maël. However I do get an error that says: Error in initialize(...) : attempt to use zero-length variable name – Victor Nielsen Mar 04 '22 at 12:31
  • 1
    not sure where does that come from. I added the data i used so that you can completely reproduce my answer. – Maël Mar 04 '22 at 14:30
  • Just writing for others to see. I solved it by deleting the code and rewriting it from scratch. I believe there was a hidden symbol somewhere as there was no difference between the deleted and new one. – Victor Nielsen Mar 04 '22 at 15:52