I want to know how to check oftenness.
The oftenness means , for example, how often do online user enter the online channel.
So, I wanna get some index that some user is heavy user or not.
Here is a sample dataset.
d <- data.table(
timestamp = paste0('202001', str_pad(rep(1:30,each=3), width = 2, side = 'left', pad = '0')),
user = sample(x=LETTERS[1:5], size = 90, replace = T),
value = rnorm(90)
)
head(d[user == 'B'], 10)
# timestamp user value
# 1: 2020-01-01 B -0.05698572
# 2: 2020-01-01 B -0.16677841
# 3: 2020-01-03 B 0.06953150
# 4: 2020-01-04 B 0.29374589
# 5: 2020-01-05 B 0.59508578
# 6: 2020-01-06 B -0.16237362
# 7: 2020-01-07 B -0.34246076
# 8: 2020-01-07 B -0.04670312
# 9: 2020-01-08 B 1.92830277
# 10: 2020-01-08 B 2.04701468
Q1. Then, how do I prove the user B is heavy user between 20200101 and 20200108(ignore value column)
Q2. Is there any index to describe oftenness?
Q3. I used to calculate date diff distribution(mean, std). Is it good way? for example, following..
sam <- head(d[user == 'B'], 10)
sam[, timestamp := as.Date(timestamp, format = "%Y%m%d")]
sam[, lag_timestamp := dplyr::lag(timestamp)]
sam[, diff_prev_date := timestamp - lag_timestamp]
sam
# timestamp user value lag_timestamp diff_prev_date
# 1: 2020-01-01 B -0.05698572 <NA> NA days
# 2: 2020-01-01 B -0.16677841 2020-01-01 0 days
# 3: 2020-01-03 B 0.06953150 2020-01-01 2 days
# 4: 2020-01-04 B 0.29374589 2020-01-03 1 days
# 5: 2020-01-05 B 0.59508578 2020-01-04 1 days
# 6: 2020-01-06 B -0.16237362 2020-01-05 1 days
# 7: 2020-01-07 B -0.34246076 2020-01-06 1 days
# 8: 2020-01-07 B -0.04670312 2020-01-07 0 days
# 9: 2020-01-08 B 1.92830277 2020-01-07 1 days
# 10: 2020-01-08 B 2.04701468 2020-01-08 0 days
plot(density(as.numeric(sam$diff_prev_date), na.rm = T), main = "")
mean(sam$diff_prev_date, na.rm = T) # Time difference of 0.7777778 days
sqrt(var(sam$diff_prev_date, na.rm = T)) # 0.6666667