0

Training on some marketing analytics.

In this dataset:

head(software_data)
                               id   Usage         software_v     
                                1   2011-12-01          v12
                                2   2011-12-01          v12
                                3   2011-12-01          v12 
                                4   2011-12-01          v12 
                                5   2011-12-02          v12 
                                6   2011-12-02          v12 

How to find the Active users in a month (MAU) per software version?

Considering that active is when a person uses the software more than once in a month.

I did this:

table(software_data$software_v)

which returns the number of people per version but not according to the condition of more than once per month and not unique as it should. Also, if you insert unique in front of the previous line it doesn't give unique but the same result as without it.

smci
  • 32,567
  • 20
  • 113
  • 146
J_p
  • 435
  • 1
  • 5
  • 16
  • 4
    Are users identified by the id? If so, your example has _no_ users mentioned more than once. Not much of an example top test on. – G5W Feb 04 '18 at 19:45
  • Good point. Actually i have joined two datasets so the id appears more than once. – J_p Feb 04 '18 at 19:45
  • This is called ***aggregate***. There are lots of existing duplicates. A couple of the many related questions: [Summary data tables from wide data.frames](https://stackoverflow.com/questions/14749237/summary-data-tables-from-wide-data-frames), [Get ID by the group and then count unique value for these IDs](https://stackoverflow.com/questions/42667475/get-id-by-the-group-and-then-count-unique-value-for-these-ids) – smci Feb 04 '18 at 23:52
  • Near-duplicate of [Summary data tables from wide data.frames](https://stackoverflow.com/questions/14749237/summary-data-tables-from-wide-data-frames) – smci Feb 04 '18 at 23:56

3 Answers3

0

In such data manipulation I like to use a SQL type query that is very much adapted to this request. I would do the following by using SQLDF package.

library(sqldf)
sqldf("
      select id, strftime('%m', Usage) as month, software_v, count(strftime('%m', Usage)) as count
      from software_data
      group by id, strftime('%m', Usage), software_v
      having count > 1
      ")
stuski
  • 199
  • 1
  • 11
  • How would you plot this though so it depicts what it should? – J_p Feb 05 '18 at 09:32
  • You never asked for a plot. You asked to extract the data based on the specs. Why plotting? – stuski Feb 05 '18 at 14:32
  • You should know that it includes lots of years and by taking only the month you lose track . How would you take also the year. – J_p Feb 05 '18 at 14:49
0

If your Usage column is a date you can do

require(dplyr);require(lubridate)
df %>% 
    group_by(id, month(Usage), software_v) %>% 
    filter(n() > 1) %>% 
    ungroup() %>% 
    group_by(month(Usage), software_v) %>% 
    mutate(MAU = n())

Or you can use sql. @stusky has the right idea but doesn't actually compute MAUs. You can do

library(sqldf)
sqldf("
select      count(*) as MAU
            , month
            , software_v
from        (
            select      id
                        , strftime('%m', Usage) as month
                        , software_v
                        , count(strftime('%m', Usage)) as count
            from        df
            group by    id, 
                        strftime('%m', Usage), 
                        software_v
            )
where       count > 1
group by    month
            , software_v
      ")
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
0

EDIT: Realised it could be simpler.

2nd EDIT: Adjusted to account for month as well.

First let's grab a year-month variable from your data.

If it isn't already make sure the Usage column is of type Date.

software_data$Usage <- as.Date(software_data$Usage)

Once it's of type date we can compress this to a Year-Month Date column.

software_data$Usage_Year_Month <- format(software_data$Usage, format = "%Y-%m")

Once your dataframe is in this form, from this to the output I have at mau2 is just 3 lines of code.

mau <- ddply(software_data, c("id", "software_v", "Usage_Year_Month"), nrow)

mau <- mau[mau[,4] > 1,]

mau2 <- ddply(mau, c("software_v", "Usage_Year_Month"), nrow)

Now let me explain that.

We can use ddply (from the plyr package) to apply the nrow function to each subset of the data. So we subset on id, software_v and the variable we created Usage_Year_Month, then our function returns the number of rows in that subgroup. Then at the end we just need to filter so we only keep those rows with a value of greater than 1.

mau <- ddply(software_data, c("id", "software_v"), numcolwise(length))

mau <- mau[mau[,4] > 1,]

I've set up a mock example of your data as below (just picked an arbitrary sequence of dates for Usage).

id = round(runif(100)*5)
id = c(id, seq(6,10))
Usage = seq(as.Date("2011-12-01"), as.Date("2011-12-01")+length(id)-1, by = "+1 day")
software_v = round(runif(length(id))*3)

software_data <- data.frame(id, Usage, software_v)

software_data$Usage_Year_Month <- format(software_data$Usage, format = "%Y-%m")

Using this input the code produces the below.

Output from function

The V1 column contains the number of use cases for each unique id, software_v and Usage_Year_Month grouping. If you just want the unique ids for which you have more than 1 use case, just use unique(mau$id).

If you then want this by software version and month, let's just go one more round with ddply.

mau2 <- ddply(mau, c("software_v", "Usage_Year_Month"), nrow)

Output of function extension

In this output software_v is the unique software version, Usage_Year_Month is the matching Year and Month, and V1 holds the number of unique users who used this version more than once for that particular software version in that particular month.

LachlanO
  • 1,152
  • 8
  • 14
  • It doesn't include the month – J_p Feb 05 '18 at 09:15
  • I've adjusted the answer so that it will take care of months now too. I essentially just added an extra filter in `ddply`. I also updated the way I was checking counts to nrow instead of numcolwise(length). I know the example I used may not be completely accurate, it may or may not be likely that multiple versions of a product are floating around for any given month. Regardless, this code should do the job. If you're happy with how this is looking make sure to mark as answered :) – LachlanO Feb 05 '18 at 22:41
  • This is very confusing. i applied it and returns strange results.Also the V1 at the last image as number of unique users is misleading because v1 is called the first software version.i think it would be better if you had the dataset but i won't upload it in public domain. – J_p Feb 06 '18 at 12:17