1

I want to do retention/event analytics based on buried point data that are stored in ClickHouse. Let's say I have two types of events: app_launch (buried_point_id=1) and user_register (buried_point_id=2). I want to understand: Within the 1-day time window, how many users register after launching the app. See sample buried point data below:

buried_point_id happened_at user_id
1 1613923200 123
1 1614009600 345
2 1613966400 123
2 1614009600 234
2 1614182400 345

Here's the query I want to run:

SELECT COUNT (DISTINCT t1.user_id), COUNT (DISTINCT t2.user_id)
FROM
  (SELECT user_id,
          happened_at
   FROM buried_points
   WHERE buried_point_id = 1
     AND happened_at >= 1613923200
     AND happened_at <= 1614182400
     AND ) AS t1
ASOF LEFT JOIN
  (SELECT user_id,
          happened_at
   FROM buried_points
   WHERE buried_point_id = 2
     AND happened_at >= 1613923200
     AND happened_at <= 1614182400) AS t2
ON t1.user_id = t2.user_id
AND t1.happened_at < t2.happened_at
AND t2.happened_at - t1.happened_at < 86400;

This is the expected query result:

2 (123,345), 1 (123)

However, according to ClickHouse docs, only 1 inequality can be supported:

You can use any number of equality conditions and exactly one closest match condition. For example, SELECT count() FROM table_1 ASOF LEFT JOIN table_2 ON table_1.a == table_2.b AND table_2.t <= table_1.t. Conditions supported for the closest match: >, >=, <, <=.

However I need 2 inequalities to do my work - Is there a work around to this problem?

vladimir
  • 13,428
  • 2
  • 44
  • 70
Pang Fish
  • 11
  • 1

1 Answers1

0

Consider using the specialized aggregate function sequenceMatch:

SELECT 
  user_id, 
  sequenceMatch('(?1)(?2)')(happened_at, buried_point_id = 1, buried_point_id = 2) retention
FROM (
  /* emulate test dataset */
  SELECT data.1 buried_point_id, data.2  happened_at, data.3 user_id
  FROM (
    SELECT arrayJoin(
      [(1, 1613966400, 123), 
      (1, 1613966411, 123), 
      (1, 1613966422, 123), 
      (1, 1614009600, 345), 
      (1, 1614009611, 345), 
      (2, 1613923200, 123), 
      (2, 1614009600, 234), 
      (2, 1614182400, 345)]) data)
  )
WHERE happened_at >= 1613923200 AND happened_at <= 1614182400 
GROUP BY user_id  

/*
┌─user_id─┬─retention─┐
│     123 │         0 │
│     234 │         0 │
│     345 │         1 │
└─────────┴───────────┘
*/

Calculation based on minIf:

SELECT 
  user_id, 
  minIf(happened_at, buried_point_id = 1) first_launch,
  minIf(happened_at, buried_point_id = 2) first_registration,
  first_launch != 0 and first_registration > first_launch ? 1 : 0 AS is_user_registered_after_launch
FROM (
  /* emulate test dataset */
  SELECT data.1 buried_point_id, data.2  happened_at, data.3 user_id
  FROM (
    SELECT arrayJoin(
      [(1, 1613966400, 123), 
      (1, 1614009600, 345), 
      (2, 1613923200, 123), 
      (2, 1614009600, 234), 
      (2, 1614182400, 345)]) data)
  )
WHERE happened_at >= 1613923200 AND happened_at <= 1614182400 
GROUP BY user_id 

/*
┌─user_id─┬─first_launch─┬─first_registration─┬─is_user_registered_after_launch─┐
│     123 │   1613966400 │         1613923200 │                               0 │
│     234 │            0 │         1614009600 │                               0 │
│     345 │   1614009600 │         1614182400 │                               1 │
└─────────┴──────────────┴────────────────────┴─────────────────────────────────┘
*/
vladimir
  • 13,428
  • 2
  • 44
  • 70
  • Hi vladimir, thanks for sharing the awesome function! I have one more question about it, from official document, it seems `retention` cannot define time relation as conditions. In this case, I only want user_register happens after app_launch. If I change test cases to: [(1, 1613966400, 123), (1, 1614009600, 345), (2, 1613923200, 123), (2, 1614009600, 234), (2, 1614182400, 345)] it will still return return [1,1] for user 123, do you know how can I define time relation in conditions? Thanks! – Pang Fish Mar 12 '21 at 08:11
  • Hi @PangFish, look at the updated answer. – vladimir Mar 14 '21 at 00:12
  • Hi @vladimir, thanks for the updates, I learned a lot about aggregate functions in ClickHouse. However, I also need to define time window for the second event, say 3600 sec after the first event happened, I tried `sequenceMatch('(?1)(?t<=3600)(?2)')(happened_at, buried_point_id = 1, buried_point_id = 2) retention`, it worked well! – Pang Fish Mar 15 '21 at 10:26
  • Also, I have a question about function `windowFunnel`, seems it matches my use case perfectly, but I found it has a bug after running on test dataset. It guarantees `timestamp of cond2 <= timestamp of cond1 + window`, but doesn't guarantee `timestamp of cond2 > timestamp of cond1`, so the running rest of my dataset is: (123, 1), (234, 0), (345, 1). Do you happen to know this issue, is there a way to fix it? – Pang Fish Mar 15 '21 at 10:29