2

Let's say, I have a table with the following columns:

date | event     | user_id |  unit_id  |cost |  ad_id  | spend
03-15 | impression | 2353   |   3436 | 0.15 |  NULL | NULL
03-15 | impression | 2353   |   3436 | 0.12 |  NULL | NULL
03-15 | impression | 1234   |   5678 | 0.10 |  NULL | NULL
03-15 | click | 1234   |   5678 |  NULL  |  NULL | NULL
03-15 | create_ad |     1234 | 5678 | NULL | 6789 | 10

I want to calculate how many impressions on average it takes before a user creates an id. In this particular scenario, it took one impression for user 1234 to create an ad.

I'm not sure that I can somehow use date to discriminate events (but logically all these events should happen at different moments). However, you can see that impressions have NULLs in ad_id and spend, while create_id does have a number in spend.

This one doesn't work:

select i.user_id
     , i.unit_id
     , count(i.event) impressions_n
     , count(c.event) as ads_n
  from add4ad i
   left 
  join add4ad c
     on i.user_id = c.user_id 
   and i.unit_id = c.unit_id
 where i.event in ('impression')
   and c.spend <> NULL
 group 
    by i.user_id
     , i.unit_id 

I have created a SQLFiddle with this data

ulm
  • 49
  • 5
  • 1
    Change `c.spend <> NULL` to `c.spend is not NULL` and check if you get the correct results. – forpas Dec 26 '19 at 20:20
  • It runs but the result is not correct (i.e. it doesn't return any results). – ulm Dec 26 '19 at 20:23
  • Then you have to explain better what you need and provide expected results. Also this: *...how many impressions on average it takes before a user creates an id...* can only make sense only if there is an order defined say by date which is not the case here since all the dates are equal. – forpas Dec 26 '19 at 20:25
  • Your fiddle data are different and really return no data. Your question data should return one row. – Serg Dec 26 '19 at 20:26
  • @forpas, It's fair and I have added more data to the fiddle. Now it has more days of data so some order can be defined. – ulm Dec 26 '19 at 20:32
  • @ulm . . . You need a better ordering than `date`, because all the rows have the same date. It is impossible to know what is "before" and what is "after". – Gordon Linoff Dec 26 '19 at 20:41

4 Answers4

2

I went to SQL Fiddle and ran the test via MS SQL engine.

CREATE TABLE add4ad (date date, event varchar(10), user_id int,
                   unit_id int, cost float, ad_id float, spend float);
INSERT INTO add4ad (date, Event, user_id,unit_id,cost,ad_id,spend)
VALUES
    ('2018-03-15','impression','2353','3436','0.15',NULL,NULL),
    ('2018-03-15','impression','2353','3436','0.12',NULL,NULL),
    ('2018-03-15','impression','2353','3436','0.10',NULL,NULL),
    ('2018-03-15','click','1234','5678', NULL, NULL,NULL),
    ('2018-03-15','create_ad','2353','5678', NULL, 6789,10);

My query

with e10 as (select  user_id, event, date, rowid=row_number() over (Partition by user_id order by date)
from add4ad
where event='create_ad'
),
e20 as ( -- get the first create_ad event
select user_id, date
  from e10
  where rowid=1
  )
  select a.user_id, count(1) as N
  from e20 inner join add4ad a
  on e20.user_id=a.user_id
  and a.date<=e20.date
  and a.event='impression'
  group by a.user_id
Seaport
  • 153
  • 2
  • 14
  • Looks like it's working well and solves the problem! Would you mind walking through your thinking process (given that the MS SQL syntax is more exotic than other variations). – ulm Dec 26 '19 at 20:43
  • It is about the assumptions in your original post. My query assumes that there might be more than one "create_ad" event per user and there might be "impression" events after "create_ad". Your latest post basically denies the possibility of both. – Seaport Dec 26 '19 at 21:54
  • This query is based on loosened assumptions - there is ONLY one "create_ad" event per user and there might be "impression" events after "create_ad". ` with e20 as ( select user_id, date from add4ad where event='create_ad' ) select a.user_id, count(1) as N from e20 inner join add4ad a on e20.user_id=a.user_id and a.date<=e20.date and a.event='impression' group by a.user_id ` – Seaport Dec 26 '19 at 21:58
  • I believe that once a given ad_id is created, there is no reason to have another impression associated with the same unit_id (just trying to assess the business logic of the question). – ulm Dec 26 '19 at 23:42
2

If I got it right, you need to count distinct ads

CREATE TABLE add4ad (`date` date, `event` varchar(10), `user_id` int,
                   `unit_id` int, `cost` float, `ad_id` float, `spend` float);
INSERT INTO add4ad (`date`, `Event`, `user_id`,`unit_id`,`cost`,`ad_id`,`spend`)
VALUES
    ('2018-03-15','impression','2353','3436','0.15',NULL,NULL),
    ('2018-03-15','impression','2353','3436','0.12',NULL,NULL),
    ('2018-03-15','impression','2353','3436','0.10',NULL,NULL),
    ('2018-03-15','impression','1234','5678','0.10',NULL,NULL),
    ('2018-03-15','click','1234','5678', NULL, NULL,NULL),
    ('2018-03-15','create_ad','1234','5678', NULL, 6789,10),
    ('2018-03-16','impression','8765','8871','0.10',NULL,NULL),
    ('2018-03-16','impression','8765','8871','0.10',NULL,NULL),
    ('2018-03-16','impression','8765','8871','0.2',NULL,NULL),
    ('2018-03-16','impression','8765','8871','0.23',NULL,NULL),
    ('2018-03-16','click','8765','8871', NULL, NULL,NULL),
    ('2018-03-16','create_ad','8765','8871', NULL, 6789,10);

select i.user_id, i.unit_id, count(i.event) as impressions_n,
    count(distinct c.event) as ads_n
from add4ad i
join add4ad c
   on i.user_id = c.user_id and i.unit_id = c.unit_id
where i.event in ('impression')
   and c.event in ('create_ad') and c.spend is not NULL
group by i.user_id, i.unit_id 

Returns

user_id unit_id impressions_n   ads_n
1234    5678    1   1
8765    8871    4   1

I've replaced left join with join because where as it is effectively makes your join inner If you still need left join move predicates to ON clause or handle NULLs in where.

fiddle

Serg
  • 22,285
  • 5
  • 21
  • 48
  • It works well but only for one group. In the fiddle above, I've added more data (for another day). And your code only captures one group. But the idea seems correct. – ulm Dec 26 '19 at 20:47
  • See edited version. The query returns 2 rows. What is the expected result? – Serg Dec 26 '19 at 20:54
  • the edited version works as it's supposed. Thank you and upvoting! – ulm Dec 26 '19 at 21:01
0

The issue is for checking NULLS you have to use is NULL or is not NULL. Also your data in fiddle is incorrect. It does not have impression for 1234 in fiddle.

select i.user_id, i.unit_id, count(i.event) as impressions_n,
count(c.event) as ads_n
from add4ad i
 left join add4ad c
  on i.user_id = c.user_id and i.unit_id = c.unit_id
where i.event in ('impression')
/*and c.event in ('create_ad')*/ and c.spend is not NULL
group by i.user_id, i.unit_id 
PraveenB
  • 1,270
  • 10
  • 11
  • Yes, I've corrected both: added 1234 and am using "is not NULL." However, it doesn't change the results. – ulm Dec 26 '19 at 20:34
  • its printing 1234 with 1. You have only 1 impression for user 1234. – PraveenB Dec 26 '19 at 20:56
  • Also the event id is also incorrect in your fiddle. Here is the updated fiddle with correct data : http://sqlfiddle.com/#!9/70a44d/2 – PraveenB Dec 26 '19 at 20:58
0

Seems this is the solution:

select sum(c.impressions_n) / count(1) as average_num_of_impressions from (
select count(i.event) as impressions_n 
  from add4ad i 
  join add4ad c
     on i.user_id = c.user_id and i.unit_id = c.unit_id
 where i.event in ('impression') and c.event in ('create_ad')
 group by i.user_id, i.unit_id              ) c
ulm
  • 49
  • 5