0

although I searched long for solutions, e.g. Assign value to group based on condition in column

I am not able to solve the following problem and would appreciate greatly any help!

I have the following data frame (in reality, many more with thousands of rows):

df <- data.frame(ID1 = c(1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6,7,7), 
             ID2 = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), 
             Percentage = c(0,10,NA,65,79,81,52,0,0,11,12,35,0,24,89,76,0,NA,59,16), 
             Group_expected_result = c(6,6,6,7,7,7,7,1,1,3,3,3,4,4,4,5,5,5,2,2))

What I want to do is to assign a group type from 1 to 7 to each group as indicated by ID1. Which group type should be assigned is dependent on the conditions of column 3, Percentage (can have values from 0-100) and is split into seven types:

Type 1 has a percentage of 0, i.e.

  • Type 1 = 0
  • Type 2 > 0 & < 10
  • Type 3 > 9 & < 20
  • Type 4 > 19 & < 30
  • Type 5 > 29 & < 40
  • Type 6 > 39 & < 50
  • Type 7 > 49

The combination of these types (above) defines the group type (G1-G7) below:

  • G1 = only T7
  • G2 = only T7 & T2-T6
  • G3 = only T2-T6
  • G4 = at least one T1, & one T2-T6, & one T7 (= all)
  • G5 = only T7 & T1
  • G6 = only T2-T6 & T1
  • G7 = only T1

The expected result is in the last column of the sample data frame, e.g. the first group consists of types T1 and T2, therefore should be group type G6.

So, the question is how to get the expected result in the last column? I hope I made the problem clear! Thanks in advance!

Community
  • 1
  • 1
litotes
  • 89
  • 1
  • 12

1 Answers1

0

Try this:

myType <- function(x) {
    if (is.na(x) || x==0) {
        return(1L)
    } else if (x < 50) {
        return(2L)
    } else {
        return(3L)
    }
}

myGroup <- function(myDf) {
    myIds <- unique(myDf$ID1)
    myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
    assignG <- vector(mode = "integer", length=nrow(myDf))
    vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)

    for (i in myIds) {
        myV <- which(myDf[,1L]==i)
        testV <- sort(unique(vT[myV]))
        assignG[myV] <- which(vapply(myGs, function(x) identical(x,testV), TRUE, USE.NAMES = FALSE))
    }

    myDf$myResult <- assignG
    myDf
}

Calling it, we obtain:

myGroup(df,7)
   ID1 ID2 Percentage Group_expected_result myResult
1    1   1          0                     6        6
2    1   2         10                     6        6
3    1   3         NA                     6        6
4    2   4         65                     7        7
5    2   5         79                     7        7
6    2   6         81                     7        7
7    2   7         52                     7        7
8    3   8          0                     1        1
9    3   9          0                     1        1
10   4  10         11                     3        3
11   4  11         12                     3        3
12   4  12         35                     3        3
13   5  13          0                     4        4
14   5  14         24                     4        4
15   5  15         89                     4        4
16   6  16         76                     5        5
17   6  17          0                     5        5
18   6  18         NA                     5        5
19   7  19         59                     2        2
20   7  20         16                     2        2

Here is a less intuitive, but more efficient solution.

myGroup2 <- function(myDf) {
    myIds <- unique(myDf$ID1)
    AltGs <- c(G1=2L, G2=7L, G3=3L, G4=9L, G5=6L, G6=5L, G7=4L)
    assignG <- vector(mode = "integer", length=nrow(myDf))
    vT <- vapply(myDf[,"Percentage"], function(x) myType(x), 1L)

    for (i in myIds) {
        myV <- which(myDf[,1L]==i)
        testV <- unique(vT[myV])
        assignG[myV] <- which(AltGs==(length(testV)+sum(testV)))
    }

    myDf$myResult <- assignG
    myDf
}

It is about twice as fast.

microbenchmark(t1=myGroup(df,7), t2=myGroup2(df,7))
Unit: microseconds
 expr     min      lq     mean   median      uq      max neval
   t1 692.117 728.4470 779.6459 748.562 819.170 1018.060   100
   t2 320.608 340.3115 390.7098 351.395 414.203 1781.195   100

You can obtain AltGs above by running the following:

myGs <- list(G1=1L, G2=2:3, G3=2L, G4=1:3, G5=c(1L,3L), G6=1:2, G7=3L)
AltGs <- vapply(myGs, function(x) length(x)+sum(x), 2L, USE.NAMES = FALSE)
Joseph Wood
  • 7,077
  • 2
  • 30
  • 65
  • Thanks very much, Joseph! The code runs perfectly on the dummy data. However, if applied to the real data, the only result is "0". I think the problem is that the ID1 is not an integer, but a 50-digit hexadecimal string, which causes the myV <- which(myDF[, 1L] ==i) to not evaluate to true. But I am not sure, because your code is out of my league. Do you have any ideas on how to solve that? Also, the line vT <- vapply(myDf[,3L] ... should be adapted to the column that has the percentages in it, so e.g. myDf[,144L], correct? And: how to adapt the myType function if NA could occur in every type? – litotes Sep 09 '16 at 19:37
  • @litotes, you are right about the line changing myDf[,3L] to myDf[,144L]. Alternatively, you could use myDf[,"Percentage"]. To correct the first part, you need to loop over the unique values in ID1. I'll update my code to be more general. – Joseph Wood Sep 09 '16 at 19:42
  • @litotes, give it a try now. Hope it helps!! – Joseph Wood Sep 09 '16 at 19:46
  • Thanks, works perfectly! I forgot to change the ID1 column name to the real one! Wonderful help!! – litotes Sep 10 '16 at 10:02