2

I have a data looks like this

df<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L, 
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG", 
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2  ", 
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV", 
"VCMCVVDDNR", "YATTA"), class = "factor")), class = "data.frame", row.names = c(NA, 
-13L))

I am trying to count letter frequencies. There are 20 possible letters which I want to count in each row.

For example,

  1. the first row: row starts with sp| so character frequencies are not calculated and result is the original string
  2. the second row: doesn't start with sp| so it will show character frequencies
MGSSN  2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

which means, there are 2 S, 1, M, 1, G, 1, N and the other letters are empty .

The character frequencies are ordered in descending order.

The final output would look like the following

output<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L, 
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG", 
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2  ", 
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV", 
"VCMCVVDDNR", "YATTA"), class = "factor"), Col2 = structure(c(8L, 
2L, 3L, 2L, 2L, 2L, 2L, 1L, 7L, 5L, 6L, 5L, 4L), .Label = c("1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0", 
"2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0", 
"2,2,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0", 
"3,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2  ", 
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1"), class = "factor")), class = "data.frame", row.names = c(NA, 
-13L))
Heikki
  • 2,214
  • 19
  • 34
Learner
  • 757
  • 3
  • 15
  • It's not clear what criteria you're using. Could you elaborate more? – NelsonGon Apr 19 '19 at 17:08
  • 1
    Try `df %>% mutate(col = case_when(!str_detect(col, "^sp" ) ~ str_count(col, LETTERS) %>% str_c(collapse=", "), TRUE ~ as.character(col)))` – akrun Apr 19 '19 at 17:08

1 Answers1

1

We can use str_count

library(stringr)
i1 <- !grepl("^sp", df$col)
df$col2[i1] <- sapply(as.character(df$col[i1]), function(x)
     paste(sort(str_count(x, LETTERS), decreasing = TRUE), collapse=", "))
df$col2[!i1] <- df$col[!i1]

Or instead of keeping as a string, it can be a list column as well

library(tidyverse)
df %>%
    mutate(col = as.character(col),
            col2 = map(col, ~ if(str_detect(.x, "^sp")) .x 
               else str_count(.x, LETTERS) %>% 
             sort(decreasing = TRUE))) 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • is it possible that somehow with the first solution, I get the `sp|` in the second column instead the numbers? – Learner Apr 19 '19 at 20:31
  • @Learner DId you meant the character after the `sp|` in the second column – akrun Apr 20 '19 at 03:16
  • Yes that line `sp|` – Learner Apr 20 '19 at 14:53
  • @Learner Can you update the expected output inyour post. Just to check how you wanted – akrun Apr 20 '19 at 14:57
  • I have showed the output above. do you see it? I just want to have the complete like of `sp|` in the next column that we create – Learner Apr 21 '19 at 23:44
  • @Learner The first solution just change the last line to `df$col2[!i1] <- as.character(df$col[!i1])` (if I understand you) – akrun Apr 22 '19 at 03:32
  • I have one more request. is there any way to make the first option as a function? I tried but I cannot figure out how to make it as a function . Thanks a bunch – Learner Apr 22 '19 at 21:34
  • @Learner That is easy `f1 <- function(dat, colN, pat){i1 <- !grepl(pat, dat[[colN]]);dat$col2[i1] <- sapply(as.character(dat[[colN]][i1]), function(x) paste(sort(str_count(x, LETTERS), decreasing = TRUE), collapse= ", ")); dat$col2[!i1] <- dat[[colN]][!i1]; dat}` and then call as `f1(df, "col", "^sp")` – akrun Apr 23 '19 at 02:37