Count cumulative unique factors separated by semicolon Grouped by Name

Question

This is what my dataframe looks like. The two rightmost columns are my desired columns. I am counting the cumulative number of unique FundTypes as of each row.The 4th columns is the cumulative unique count for all "ActivityType" and the 5th column is the cumulative unique count for only "ActivityType=="Sale".

dt <- read.table(text='

Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         

John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                    

                         ', header=T, row.names = NULL)

I have tried dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name] but for example it counts a & a;b & c;d as 3 unique values as opposed to the desired 4 unique values as the factors are separated by semicolon.Kindly let me know of a solution.

UPDATE: My real dataset looks more like this:

dt <- read.table(text='

    Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
    John       Email               ""           0                     0
    John       Conference          ""           0                     0
    John       Email               a            1                     0
    John       Sale                a;b          2                     2 
    John       Webinar             c;d          4                     2
    John       Sale                b            4                     2
    John       Webinar             e            5                     2
    John       Conference          b;d          5                     2
    John       Sale                b;e          5                     3
    John       Email               ""           5                     3
    John       Webinar             ""           5                     3
    Tom        Email               a            1                     0
    Tom        Sale                a;b          2                     2 
    Tom        Webinar             c;d          4                     2
    Tom        Sale                b            4                     2
    Tom        Webinar             e            5                     2
    Tom        Conference          b;d          5                     2
    Tom        Sale                b;e;f        6                     4                    

                             ', header=T, row.names = NULL)

The unique cumulative vectors need to take into account the missing values.

`FundCode` is not part of your data frame, can you clarify this. — steveb, Dec 29 '15 at 00:25
Maybe I'm missing something but for `Name == "John" & ActivityType == "Sale"` there *are* only 3 unique `FundType`s - `a;b` (row 2), `b` (row 4), and `b;e` (row 7), no? Unless you were referring to something else here: *"but it counts a;b and b;c and c;d as 3 unique values"*? — nrussell, Dec 29 '15 at 01:23
@nrussell you are right, my example was outside the context of the dataframe I presented. It was more of a general example. I can fix it if you think what I have in the post is confusing. — gibbz00, Dec 29 '15 at 01:26
It would probably help to clarify that - or possibly just edit the example data by changing row 4 from `b` to `b;c` for `FundType` and including the code you used to generate the fourth and fifth columns. — nrussell, Dec 29 '15 at 01:29
I edited it. 4th and 5th columns are my desired columns. I dont have the code(dont know how) to generate those target columns. I tried to articulate my logic in the post in regards to how the columns should be derived. — gibbz00, Dec 29 '15 at 01:31

jazzurro · Accepted Answer · 2015-12-29T05:33:14.920

nrussell suggested a concise solution writing a custom function. Let me drop what I got. I tried to used cumsum() and duplicated() as you tried. I did two major operations. One for alltype and the other for saleonly. First, I created indices for each name. Then, I split FundType and formatted the data in a long format with cSplit() from the splitstackshape package. Then, I chose the last row for each index number for each Name. Finally, I chose only one column, alltype.

library(splitstackshape)
library(zoo)
library(data.table)

setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"][,
    .SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype

The second operation was for saleonly. Basically, I repeated the same approach to a subsetted data for sale, which is ana. I also created a data set without sale, which is ana2. Then, I created a list with the two data sets (i.e., l) and bound them. I changed the order of the dataset with Name and ind, take the last rows for each name and index number, taking care of NAs (filling NAs and replacing the first NA for each Name with 0), and finally chose one column. The final operation was to combine the original dt, alltype, and saleonly.

# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2 

# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)

    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email        a                     1                     0   1       1        0
 2: John         Sale      a;b                     2                     2   2       2        2
 3: John      Webinar      c;d                     4                     2   3       4        2
 4: John         Sale        b                     4                     2   4       4        2
 5: John      Webinar        e                     5                     2   5       5        2
 6: John   Conference      b;d                     5                     2   6       5        2
 7: John         Sale      b;e                     5                     3   7       5        3
 8:  Tom        Email        a                     1                     0   1       1        0
 9:  Tom         Sale      a;b                     2                     2   2       2        2
10:  Tom      Webinar      c;d                     4                     2   3       4        2
11:  Tom         Sale        b                     4                     2   4       4        2
12:  Tom      Webinar        e                     5                     2   5       5        2
13:  Tom   Conference      b;d                     5                     2   6       5        2
14:  Tom         Sale    b;e;f                     6                     4   7       6        4

EDIT

For the new data set, I tried the following. Basically, I used my approach for the saleonly data to this new data set. The revision was only in the alltype part. First, I added indices, replaced "" with NA, and subsetted the data with rows having non-NA values. This is temp. The rest is identical to the previous answer. Now I wanted to have the data set with NAs in FundType, so I used setdiff(). Using rbindlist(), I combined the two data sets and created temp. The rest is identical to the previous answer. The sale-part does not have any changes. I hope this will work for your real data.

### all type

setDT(dt)[, ind := 1:.N, by = "Name"][,
    FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype


whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
    alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
    alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype


### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2

l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)


    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email       NA                     0                     0   1       0        0
 2: John   Conference       NA                     0                     0   2       0        0
 3: John        Email        a                     1                     0   3       1        0
 4: John         Sale      a;b                     2                     2   4       2        2
 5: John      Webinar      c;d                     4                     2   5       4        2
 6: John         Sale        b                     4                     2   6       4        2
 7: John      Webinar        e                     5                     2   7       5        2
 8: John   Conference      b;d                     5                     2   8       5        2
 9: John         Sale      b;e                     5                     3   9       5        3
10: John        Email       NA                     5                     3  10       5        3
11: John      Webinar       NA                     5                     3  11       5        3
12:  Tom        Email        a                     1                     0   1       1        0
13:  Tom         Sale      a;b                     2                     2   2       2        2
14:  Tom      Webinar      c;d                     4                     2   3       4        2
15:  Tom         Sale        b                     4                     2   4       4        2
16:  Tom      Webinar        e                     5                     2   5       5        2
17:  Tom   Conference      b;d                     5                     2   6       5        2
18:  Tom         Sale    b;e;f                     6                     4   7       6        4

My original dataframe was 3M+ rows. "alltype" and "saleonly" comes out to be around 1 Million rows. Not sure how the final cbind will work. The FundType column in the original dataframe has a lot of missing values. So for example, the alltype starts with 1122233 etc. rather than 00001122333. How do I modify your solution to take into account all the missing values? — gibbz00, Dec 29 '15 at 04:30
@gibbz00 Would you be able to create a small example capturing the nature of your real data? Then, SO users can deliver better solutions. If you can add another data set as an edit part, that will be great. — jazzurro, Dec 29 '15 at 04:39
@gibbz00 You may want to leave the previous data set as well since all answers are based on that. — jazzurro, Dec 29 '15 at 04:47
@gibbz00 I did wha I could do for now with the new data set. I hope this will help you with the large data set. — jazzurro, Dec 29 '15 at 05:34

score 6 · Answer 2 · answered Dec 29 '15 at 02:21

I think this is one way to achieve what you are after. Start by adding an auxiliary index variable for maintaining the input order; and keying on Name:

Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name")

For clarity, I used this function

n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl))))

rather than typing out the body's expression on the fly - the operation below is dense enough as it is without a bunch of nested function calls convoluting things.

And finally,

Dt[Dt, allow.cartesian = TRUE][
  gIdx <= i.gIdx, 
  .("UniqueFunds(AllTypes)" = n_usplit(FundType),
    "UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
  keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
#      Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1:   John          Email          a                     1                     0
# 2:   John           Sale        a;b                     2                     2
# 3:   John        Webinar        c;d                     4                     2
# 4:   John           Sale          b                     4                     2
# 5:   John        Webinar          e                     5                     2
# 6:   John     Conference        b;d                     5                     2
# 7:   John           Sale        b;e                     5                     3
# 8:    Tom          Email          a                     1                     0
# 9:    Tom           Sale        a;b                     2                     2
# 10:   Tom        Webinar        c;d                     4                     2
# 11:   Tom           Sale          b                     4                     2
# 12:   Tom        Webinar          e                     5                     2
# 13:   Tom     Conference        b;d                     5                     2
# 14:   Tom           Sale      b;e;f                     6                     4

I feel like I could explain this easier with SQL, but here we go:

Join Dt on itself (by Name)
Using the extra index column (gIdx), only consider previous (inclusive) rows in sequence - this results in a sort of cumulative effect (for lack of better term)
Calculate the UniqueFunds(...) columns - notice the extra subsetting done in the second case - n_usplit(FundType[ActivityType == "Sale"])
Drop the extraneous index column (i.gIdx).

I'm not sure how this will scale due to the use of a cartesian join, so hopefully your real data set isn't millions of rows.

Data:

library(data.table)
##
dt <- fread('
Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                     
            ', header = TRUE)

I have to turn the "FundType" column from factor to character, right? This is what I had to do for the final code to work. — gibbz00, Dec 29 '15 at 03:00
@gibbz00 Yes, sorry I assumed they were not factors (I have this feature turned off by default in my sessions). — nrussell, Dec 29 '15 at 03:02
Why are we keying by ["Name,i.gIdx,i.ActivityType,i.FundType"] in the final line of the code? Thanks. — gibbz00, Dec 29 '15 at 03:09
So that those columns are preserved in addition to the two calculated fields. — nrussell, Dec 29 '15 at 03:11
Your solution is awesome. Unfortunately my dataframe has 3M+ rows and I am getting cannot allocate vector of size 2.4 Gb despite having 32GB of memory. Kindly let me know if you can think of a workaround — gibbz00, Dec 29 '15 at 03:34

NGaffney · Answer 3 · 2015-12-29T05:52:38.710

I achieved what you're looking for as follows:

library(data.table)
library(stringr)
dt <- data.table(read.table(text='

                 Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
                 John       Email               a            1                     0
                 John       Sale                a;b          2                     2 
                 John       Webinar             c;d          4                     2
                 John       Sale                b            4                     2
                 John       Webinar             e            5                     2
                 John       Conference          b;d          5                     2
                 John       Sale                b;e          5                     3
                 Tom        Email               a            1                     0
                 Tom        Sale                a;b          2                     2 
                 Tom        Webinar             c;d          4                     2
                 Tom        Sale                b            4                     2
                 Tom        Webinar             e            5                     2
                 Tom        Conference          b;d          5                     2
                 Tom        Sale                b;e;f        6                     4                    

                 ', header=T, row.names = NULL))

dt[,UniqueFunds.AllTypes. := NULL][,UniqueFunds.SaleOnly. := NULL]

#Get the different Fund Types
vals <- unique(unlist(str_extract_all(dt$FundType,"[a-z]")))

#Construct a new set of columns indicating which fund types are present
dt[,vals:=data.table(1*t(sapply(FundType,str_detect,vals))),with=FALSE]

#Calculate UniqueFunds.AllTypes
dt[, UniqueFunds.AllTypes. := 
     rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name]

#Calculate only when ActicityType == "Sale" and use cummax to achieve desired output
dt[,UniqueFunds.SaleOnly. := 0
   ][ActivityType == "Sale", UniqueFunds.SaleOnly. := 
     rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name
   ][,UniqueFunds.SaleOnly. := cummax(UniqueFunds.SaleOnly.), by = Name
     ]

#Cleanup vals
dt[,vals := NULL, with = FALSE]

Thank you so much for your answer! – gibbz00 Dec 29 '15 at 02:42 — gibbz00, Dec 29 '15 at 02:42

Count cumulative unique factors separated by semicolon Grouped by Name

3 Answers3