9

I have this dataframe, it looks like this:

enter image description here

I need to take the first character from the column at, the whole value in an, then put a counter on the end that increments for repeats in column an. This counter has to be always length of three. The end result is this:

enter image description here

So nothing here that dramatic, I was able to do this with the following code (prepare to be impressed):

library(stringr) 
tk <- ""
for (i in 1:nrow(df)){
  if (tk == df$an[i]){
    counter <- counter + 1
  } else {
    tk <- df$an[i]
    counter <- 1
  }
  df$ap[i] <- counter
}

df$ap <- paste0(substr(df$at, 1, 1), df$an, str_pad(df$ap, 3, pad="0"))

I'm so not satisfied with this debacle. It seems not very "R" and I'd like very much never to allow this to see the light of day. How can I make this more "R"?

I appreciate the advice.

Florian
  • 24,425
  • 4
  • 49
  • 80
DieselBlue
  • 137
  • 6
  • 1
    Could you post the `dput(DF)` output for the five-row example? – Frank Jul 21 '17 at 19:52
  • 1
    All of these answers are great in helping me understand true r better. I will learn the techniques for each of these. But who gets the coveted 'answer'? I'm inclined to just go with the most upvotes because they all are great...and dplyr wins. – DieselBlue Jul 21 '17 at 20:10

4 Answers4

9
library(stringr)
library(dplyr)
df1 <- df %>%
          group_by(an) %>%
          mutate(ap=paste0(substr(at, 1, 1), an, str_pad(row_number(), 3, pad="0")))

     at     an         ap
1   NDA 023356 N023356001
2  ANDA 023357 A023357001
3  ANDA 023357 A023357002
4   NDA 023357 N023357003
5  ANDA 023398 A023398001
CPak
  • 13,260
  • 3
  • 30
  • 48
8

The rleid and rowid functions from data.table can be useful here:

# using df from @Florian's answer
library(data.table)
setDT(df)

df[, v := paste0(
  substr(at, 1, 1), 
  an, 
  sprintf("%03.f", rowid(rleid(an)))
)]

#      at     an          v
# 1:  NDA 023356 N023356001
# 2: ANDA 023357 A023357001
# 3: ANDA 023357 A023357002
# 4:  NDA 023357 N023357003
# 5: ANDA 023398 A023398001

How it works:

  • sprintf from base effectively does the job of stringr::str_pad in the OP.
  • rleid groups runs of repeating values together.
  • rowid makes a counter within each group.
Frank
  • 66,179
  • 8
  • 96
  • 180
6

In base R, you can use sprintf to pad 0s and ave to get the counts like this:

df$ap <- paste0(substr(df$at, 1, 1), df$an,
                sprintf("%03.0f", as.numeric(ave(df$an, df$an, FUN=seq_along))))

ave performs the group calculations and seq_along counts the rows.

which returns

df
    at     an         ap
1  NDA 023356 N023356001
2 ANDA 023357 A023357001
3 ANDA 023357 A023357002
4  NDA 023357 N023357003
5 ANDA 023398 A023398001
lmo
  • 37,904
  • 9
  • 56
  • 69
  • OP mentions "a counter on the end that increments for repeats in column an" and also groups by repeats with their loop, but your approach just works with values for grouping, not repeats of values. Probably their data is sorted and what I'm saying here doesn't actually matter for them, though. – Frank Jul 21 '17 at 20:06
  • 1
    @Frank Thanks for the heads up. I didn't get the added complexity on the first read of the post and it isn't in the example, but I'll take a second look this weekend. – lmo Jul 21 '17 at 20:17
3

This works:

library(stringr)    
df = data.frame(at=c("NDA","ANDA","ANDA","NDA","ANDA"),an=c("023356","023357","023357","023357","023398"),stringsAsFactors = F)

df$ap = paste0(substr(df$at,1,1),
               df$an,str_pad(ave(df$an, df$an, FUN = seq_along),width=3,pad="0"))

Output:

    at     an         ap
1  NDA 023356 N023356001
2 ANDA 023357 A023357001
3 ANDA 023357 A023357002
4  NDA 023357 N023357003
5 ANDA 023398 A023398001

Hope this helps!

Florian
  • 24,425
  • 4
  • 49
  • 80