-3

Trying some marketing analytics. Using R and SQL.

This dataset:

                           user_id install_date app_version
1 000a0efdaf94f2a5a09ab0d03f92f5bf   2014-12-25          v1
2 000a0efdaf94f2a5a09ab0d03f92f5bf   2014-12-25          v1
3 000a0efdaf94f2a5a09ab0d03f92f5bf   2014-12-25          v1
4 000a0efdaf94f2a5a09ab0d03f92f5bf   2014-12-25          v1
5 000a0efdaf94f2a5a09ab0d03f92f5bf   2014-12-25          v1
6 002a9119a4b3dfb05e0159eee40576b6   2015-12-29          v2
                   user_session_id     event_timestamp   app time_seconds
1 f3501a97f8caae8e93764ff7a0a75a76 2015-06-20 10:59:22  draw          682
2 d1fdd0d46f2aba7d216c3e1bfeabf0d8 2015-05-04 18:06:54 build         1469
3 b6b9813985db55a4ccd08f9bc8cd6b4e 2016-01-31 19:27:12 build          261
4 ce644b02c1d0ab9589ccfa5031a40c98 2016-01-31 18:44:01  draw          195
5 7692f450607a0a518d564c0a1a15b805 2015-06-18 15:39:50  draw          220
6 4403b5bc0b3641939dc17d3694403773 2016-03-17 21:45:12 build          644

link for the dataset

I want to create a plot that looks like this: enter image description here

but showcases it per version like this (just focus on the versions part of this graph - so like the above picture but 1 time per version like the one below):

enter image description here

Basically, shows the percentage of retention and churn throughout months per version. This is what i have done so far:

ee=sqldf("select user_id, count(user_id)as n ,strftime('%Y-%m', event_timestamp) as dt ,app_version 
from w 
group by user_id ,strftime('%Y-%m', event_timestamp),app_version 
having count(*)>1 order by n desc")

ee

                              user_id   n      dt app_version
1    fab9612cea12e2fcefab8080afa10553 238 2015-11          v2   
2    fab9612cea12e2fcefab8080afa10553 204 2015-12          v2
3    121d81e4b067e72951e76b7ed8858f4e 173 2016-01          v2
4    121d81e4b067e72951e76b7ed8858f4e 169 2016-02          v2
5    fab9612cea12e2fcefab8080afa10553  98 2015-10          v2

The above, shows the unique users that used the app more than once.So these are the population that the retention rate analysis is referring. What i am having difficulties with is summarizing through time each of the user_id through their events in event_timestamp column to find the retention / churn outcome like the first image i mentioned.

Cœur
  • 37,241
  • 25
  • 195
  • 267
J_p
  • 435
  • 1
  • 5
  • 16
  • This is more than just marketing analytics. Are you taking a marketing class? Do you know how to calculate churn rate? What code have you tried so far? This is the second question i've seen in the last few days that uses this same dataset; I suspect that this may be a homework question. – Steven Feb 08 '18 at 16:54
  • I am preparing for an interview and i try to do marketing analytics as a preparation.Yes i know how to calculate churn rate the method i applied is dividing the users you lost by the whole users that downloaded the app. – J_p Feb 08 '18 at 16:59
  • You need to summarize your data then to understand when each `user_id` started and stopped using the app. What code have you written to create those summaries? – Steven Feb 08 '18 at 17:11
  • This is the question as i wrote it but in code something like this:`vv=sqldf('select user_id, range(user_id),strftime("%Y-%m",dt) as t from ee group by t ')` – J_p Feb 08 '18 at 17:20

1 Answers1

1

I don't think your question is very clear, nor do I think you know exactly what you're trying to do. You say you're trying to use R to calculate churn and retention, but you've provided no actual R code, only SQL statements that you appear to be running from inside an R environment.

If you want to know the SQL to do all of this in one step, you need to ask a different, better question. However, given that you've provided a .csv file of the data, I have ignored the SQL portion of your question and provided an R solution, including data handling.

library(dplyr)
library(zoo)
library(ggplot2)
library(reshape2)
library(scales)

df <- 
  read.csv([location of the .csv on your machine], header = TRUE) %>%
  mutate(month = format(as.Date(event_timestamp), format = "%Y-%m"))

installs <- #calculates all installs for all versions of the app by month
  df %>% 
  group_by(user_id) %>%
  slice(1) %>%
  group_by(month) %>%
  summarise(tot_installs = n())

last_use_date <- #finds the last time a user actually used any version of the app (i.e., when they "churned" away)
  df %>%
  group_by(user_id, month) %>%
  summarise(tot_uses = n()) %>%
  group_by(user_id) %>%
  filter(month == max(month)) %>%
  group_by(month) %>%
  summarise(stopped_using = n())

installs %>%
  full_join(last_use_date) %>%
  mutate(cum_sum_install = cumsum(tot_installs), 
         cum_sum_stopped = cumsum(stopped_using), 
         Churn = cum_sum_stopped/cum_sum_install, 
         Retention = 1 - Churn) %>%
  select(month, Churn, Retention) %>%
  melt(id.vars = "month") %>% # melt the data frame for easy plotting
  ggplot(aes(x = month, y = value, fill = variable)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(name = "", values = c("red", "blue")) +
  labs(x = "Month", y = "") +
  scale_y_continuous(labels = percent) +
  theme(legend.position = "bottom", 
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

enter image description here

Steven
  • 3,238
  • 21
  • 50
  • interesting. but how does this work with the months? For example the image i've given says the: for one month we observe this percentage. In your example is like time series.It means that the users in march 15' are still in the next months? Care to explain your graph? I appreciate the effort – J_p Feb 08 '18 at 19:33
  • In `2015-03` 60% of total users churned away. In `2015-04` ~70% of total users had churned away, etc. If you step through the code I provided, you can understand exactly how I calculated everything in this figure. – Steven Feb 08 '18 at 19:44
  • How to include the per version distinction of the graph that i mentioned? i tried `facet_grid(app_version ~ .) ` but isn't working? – J_p Feb 09 '18 at 10:38
  • You'll need to add version to the `group_by()` statements in order to facet by version. I suggest you understand how the analysis stream works before trying to modify, however. – Steven Feb 09 '18 at 16:15