0

I have a data frame, grouped by column "id". The data frame should be split based on a substring that occurs in column "alteration". The substrings I am interested in are "intermediate", "high", or none of the strings occurring within a group.

Here is a sample data frame:

df <- data.frame(id= c(1, 1, 2, 2, 3, 3),
  disease = c("brain", "brain", "neck", "neck","breast", "breast"),
  status = c("yes", "yes","no","no","yes","yes"),
  gene = c("P53","TMB","ATM","TMB","RAF","NFKB"),
  alteration = c("TP53Y","TMB-intermediate","TATMY","TMB-high","TRAFY","TNFKBY"))

resulting in data frame

id  disease status  gene    alteration
1   brain   yes     P53     TP53Y
1   brain   yes     TMB     TMB-intermediate
2   neck    no      ATM     TATMY
2   neck    no      TMB     TMB-high
3   breast  yes     RAF     TRAFY
3   breast  yes     NFKB    TNFKBY

Expected output should be three data frames:

dfIntermed

id  disease status  gene    alteration
1   brain   yes P53 TP53Y
1   brain   yes TMB TMB-intermediate

dfHigh

id  disease status  gene    alteration
2   neck    no  ATM TATMY
2   neck    no  TMB TMB-high

dfNo (this data frame contains no information about TMB within group)

id  disease status  gene    alteration
3   breast  yes RAF TRAFY
3   breast  yes NFKB    TNKFBY

EDIT

Another post suggests the use of split(). When I split the data frame using the code:

out <- split(df, f = df$alteration )
out[[1]]

I get back six data frames, but I'm not able to grep the strings in f =. Is it possible to grep for 'high' or 'intermediate' within split?

EDIT II

I can split in combination with grep, but this returns only single rows and not the whole group

outB <- split(df, list(id, grepl("high", df$alteration)))
outB[[2]]

EDIT III

Issue resolved in another post

Issue resolved in another post

Zipfer
  • 103
  • 1
  • 7
  • Look at `split()` function. – tmfmnk Aug 20 '19 at 11:42
  • Hi tmfmnk, do you know how to grep for 'high' and 'intermediate within split? Actually this question is not a duplicate as marked by Sotos. – Zipfer Aug 20 '19 at 12:07
  • 1
    I think the question is not a duplicate of the linked question and and not solely solved by using `split`, as Zipfer wants to split the data frame so that observations with the same ID stay together, but where the information that the data frame is supposed to be split on is only present in one observation per ID – shs Aug 20 '19 at 12:10
  • You can try `split(df, grepl("high", df$alteration))`. – tmfmnk Aug 20 '19 at 12:10
  • @tmfmnk thanks, this returns only a single row containing 'high' in alteration versus the rest of the data frame. – Zipfer Aug 20 '19 at 12:23

0 Answers0