1

I want to know how to check oftenness.
The oftenness means , for example, how often do online user enter the online channel.
So, I wanna get some index that some user is heavy user or not.

Here is a sample dataset.

d <- data.table(
  timestamp = paste0('202001', str_pad(rep(1:30,each=3), width = 2, side = 'left', pad = '0')),
  user = sample(x=LETTERS[1:5], size = 90, replace = T),
  value = rnorm(90)
)

head(d[user == 'B'], 10)

#     timestamp user       value
# 1: 2020-01-01    B -0.05698572
# 2: 2020-01-01    B -0.16677841
# 3: 2020-01-03    B  0.06953150
# 4: 2020-01-04    B  0.29374589
# 5: 2020-01-05    B  0.59508578
# 6: 2020-01-06    B -0.16237362
# 7: 2020-01-07    B -0.34246076
# 8: 2020-01-07    B -0.04670312
# 9: 2020-01-08    B  1.92830277
# 10: 2020-01-08    B  2.04701468


Q1. Then, how do I prove the user B is heavy user between 20200101 and 20200108(ignore value column)
Q2. Is there any index to describe oftenness?
Q3. I used to calculate date diff distribution(mean, std). Is it good way? for example, following..

sam <- head(d[user == 'B'], 10)

sam[, timestamp := as.Date(timestamp, format = "%Y%m%d")]
sam[, lag_timestamp := dplyr::lag(timestamp)]
sam[, diff_prev_date := timestamp - lag_timestamp]
sam

#     timestamp user       value lag_timestamp diff_prev_date
# 1: 2020-01-01    B -0.05698572          <NA>        NA days
# 2: 2020-01-01    B -0.16677841    2020-01-01         0 days
# 3: 2020-01-03    B  0.06953150    2020-01-01         2 days
# 4: 2020-01-04    B  0.29374589    2020-01-03         1 days
# 5: 2020-01-05    B  0.59508578    2020-01-04         1 days
# 6: 2020-01-06    B -0.16237362    2020-01-05         1 days
# 7: 2020-01-07    B -0.34246076    2020-01-06         1 days
# 8: 2020-01-07    B -0.04670312    2020-01-07         0 days
# 9: 2020-01-08    B  1.92830277    2020-01-07         1 days
# 10: 2020-01-08    B  2.04701468    2020-01-08         0 days

plot(density(as.numeric(sam$diff_prev_date), na.rm = T), main = "")
mean(sam$diff_prev_date, na.rm = T)          # Time difference of 0.7777778 days
sqrt(var(sam$diff_prev_date, na.rm = T))     # 0.6666667
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Woody
  • 11
  • 1
  • It depends a bit on a few factors. Do you want to measure the customer's access frequency against her/himself or do you want to compare it to the rest of the population? Do you expect to have multiple modes (i.e. customers that only connect in the beginning of the month)? In a business context, I would recommend you to ask yourself what is your target (i.e. understand if the server loads vary across the month, what behaviours do customers show,... ) – Jon Nagra Jun 04 '20 at 05:40

0 Answers0