I have a data looks like this
df<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L,
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG",
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV",
"VCMCVVDDNR", "YATTA"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))
I am trying to count letter frequencies. There are 20 possible letters which I want to count in each row.
For example,
- the first row: row starts with
sp|
so character frequencies are not calculated and result is the original string - the second row: doesn't start with
sp|
so it will show character frequencies
MGSSN 2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
which means, there are 2 S
, 1, M
, 1, G
, 1, N
and the other letters are empty .
The character frequencies are ordered in descending order.
The final output would look like the following
output<-structure(list(col = structure(c(9L, 2L, 13L, 11L, 5L, 7L, 10L,
6L, 8L, 3L, 12L, 4L, 1L), .Label = c("HHRGGVCTS", "MGSSN", "MVKTTYYDVG",
"RRHYNGAYDD", "RTSTN", "S", "SNCWC", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1", "THYDT", "TVHAV",
"VCMCVVDDNR", "YATTA"), class = "factor"), Col2 = structure(c(8L,
2L, 3L, 2L, 2L, 2L, 2L, 1L, 7L, 5L, 6L, 5L, 4L), .Label = c("1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0",
"2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0",
"2,2,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0", "2,2,2,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0",
"3,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0", "sp|P31689|DNJA1_HUMAN DnaJ homolog GN=DNAJA1 PE=1 SV=2 ",
"sp|Q9H9K5|MER34_HUMAN Endogenous PE=1 SV=1"), class = "factor")), class = "data.frame", row.names = c(NA,
-13L))