0

I have a dataset with a variable that i need to change anonymise by recoding into a different variable. There are 20,000 entries, some are duplicated so my data looks something like this:

DCD97568
DCD23547
DCD27656
DCD27656
DCD87590

The end product I want is a new variable that looks like this:

DCD00001
DCD00002
DCD00003
DCD00003
DCD00004

Thanks!

Update:

I need to deal with some NA entries in the original variable and I want these to be NA in the new variable so this

DCD14579
DCD21548
NA
DCD79131
DCD79131
DCD12313

would become

DCD00001
DCD00002
NA
DCD00003
DCD00003
DCD00004
Nottles82
  • 103
  • 7

2 Answers2

3

WE can do this with sprintf and match

df1$Col1 <- sprintf("DCD%05d", match(df1$Col1, unique(df1$Col1)))
df1$Col1
#[1] "DCD00001" "DCD00002" "DCD00003" "DCD00003" "DCD00004"

Or another option is factor

with(df1, sprintf("DCD%05d", as.integer(factor(Col1, levels = unique(Col1)))))

data

df1 <- structure(list(Col1 = c("DCD97568", "DCD23547", "DCD27656", "DCD27656", 
"DCD87590")), .Names = "Col1", class = "data.frame",
 row.names = c(NA, -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Nice idea using `sprintf`. I initially thought of the more convoluted: `df %>% mutate(V2 = paste0("DCD", formatC(group_indices_(., .dots = "V1"), width = 5, format = "d", flag = "0")))` – Steven Beaupré May 03 '17 at 16:47
  • I've encountered an error when the original variable has an NA for an entry. The sprintf and rleid options, as below, produce a output in the sequence where the original is NA. Is there a way to incorporate !is.na or similar? – Nottles82 May 09 '17 at 10:04
  • @Nottles82 Please update your post with a new example and expected output – akrun May 09 '17 at 10:05
1

Using data.table rleid, Thanks for some of the comments , Assumption here is that the data is in sequence or it can be used once the data is sorted:

x <- c("DCD97568",
       "DCD23547",
       "DCD27656",
       "DCD27656",
       "DCD87590")

new <- paste0("DCD000",data.table::rleid(x))

> new
[1] "DCD0001" "DCD0002" "DCD0003" "DCD0003"
[5] "DCD0004"
PKumar
  • 10,971
  • 6
  • 37
  • 52
  • 1
    Another option is `rle` from `base R` although it is a bit unclear whether the OP meant for run-length or not – akrun May 03 '17 at 16:39
  • 1
    @akrun Thanks for your comments, I understood but I thought for this scenario it fits and your comments are always welcome. Always admired your answers. – PKumar May 03 '17 at 16:41
  • This look a bit fragile when the `id` scales up. I think it won't pad the data in OP's format. – Steven Beaupré May 03 '17 at 16:43
  • If not in sequence ,the table should be sorted for `rleid` to work correctly. – Andrew Lavers May 03 '17 at 16:55