-1

Using the provided table I would like to randomly sample users per day. The number of users to be sampled is specified in the to_sample column and it is filled by another query. In this example I would like to sample 1 observation for day one and 2 observations for day two (but this will change with every execution of the query, so don't set your mind to these numbers). I would like the users assigned to different days to be different (no overlapping assignment).

drop table if exists test; 

create table test (
user_id int,
day_of_week int,
to_sample int);

insert into test values (1, 1, 1);
insert into test values (1, 2, 2);
insert into test values (2, 1, 1);
insert into test values (2, 2, 2);
insert into test values (3, 1, 1);
insert into test values (3, 2, 2);
insert into test values (4, 1, 1);
insert into test values (4, 2, 2);
insert into test values (5, 1, 1);
insert into test values (5, 2, 2);
insert into test values (6, 1, 1);
insert into test values (6, 2, 2);

The expected results would look like this:

create table results (
user_id int,
day_of_week int);

insert into results values (1, 1);
insert into results values (3, 2);
insert into results values (6, 2);

As I said, the number of users to be sampled will be different every time, as should be taken from the to_sample column in the test table. Also I will run it for 7 days, here there are 2 to keep the example simple.

EDIT:

with day_1 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      fromtest t where t.day_of_week = 1 
     ) t 
where t.seqnum <= (select distinct to_sample fromtest where day_of_week = 1)
)
, day_2 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      from test t where t.user_id not in (select distinct user_id from day_1) and t.day_of_week = 2 
     ) t 
where t.seqnum <= (select distinct to_sample from test where day_of_week = 2) 
)
select * from day_1 union all select * from day_2

I tried creating a brute solution based on some of the answers, but still there are some repeated user, even though I remove the user_id that is already used in day_1 from day_2.

user_id | day_of_week
---------+-------------
       4 |           1
       4 |           2
       1 |           2
abu
  • 737
  • 5
  • 8
  • 19
  • 1
    I don't understand what your aimed resultset should show – Eray Balkanli Feb 07 '18 at 14:05
  • As i said, I would like to sample one user for day 1, and two users for day 2, hence 3 observations in the results set. The user_id's are of course arbitrary here, they should be random. – abu Feb 07 '18 at 14:06
  • if you used insert into results values (5, 2); INSTEAD OF insert into results values (6, 2); in your question, would it still be true?? – Eray Balkanli Feb 07 '18 at 14:07
  • Yes, as long as none of the user_id's is used more than once in the results table. – abu Feb 07 '18 at 14:09
  • why would u skip 4 and 5 , any reason for that ? – Ven Feb 07 '18 at 14:10
  • As I said, I want to sample a total of 3 users, one for day one and two for day two. As long as they do not repeat in the results table, it does not matter which users are sampled. – abu Feb 07 '18 at 14:11
  • in that case why cant u use ' select distinct ' ? – Ven Feb 07 '18 at 14:16
  • Do u have a data column in test table ? – Ven Feb 07 '18 at 14:17
  • Yes, and the case is more sophisticated than the example. However, I am interested in sampling, as I can handle the rest of the data. – abu Feb 07 '18 at 14:18
  • @abu Still Unclear, if u add date column to the sample source data, we can possibly suggest u the required out put – Ven Feb 07 '18 at 15:03
  • @BHouse I have a date column, but ultimately I want to receive some samples per day of the week, not date. That is why I am using this approach. – abu Feb 07 '18 at 15:05
  • will you accept solution with `UDTFx`? – sKwa Feb 07 '18 at 17:56

2 Answers2

1

If I got you correctly, so try next: (actually its a improved solution of @BHouse)

SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample ORDER BY randomint(max(user_id) + 1)) AS RN
    FROM
        test
    GROUP BY
        user_id,
        day_of_week,
        to_sample
    ORDER BY
        to_sample
    ) AS T
WHERE
    T.RN <= T.to_sample;

Output example for provided data:

1st execution:

 user_id | day_of_week
---------+-------------
       1 |           1
       3 |           2
       2 |           2

2nd execution:

 user_id | day_of_week
---------+-------------
       1 |           1
       1 |           2
       4 |           2

3rd execution:

 user_id | day_of_week
---------+-------------
       5 |           1
       4 |           2
       2 |           2

So, some randomness is guaranteed.

UPDATE

Or try this:

 SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample) AS RN,
        randomint(42) AS RANDOM_ORDER /* <<-- here is main problem, number should be >= max(user_id) + 1 */
    FROM
        test
    ORDER BY
        to_sample,
        RANDOM_ORDER
    ) AS T
WHERE
    T.RN <= T.to_sample;

A second option is more faster, but I didn't testes it for critical cases.

sKwa
  • 889
  • 5
  • 11
  • Thanks for your input! The problem with the first one, I don’t know about the update yet, is that the user cannot repeat for different days. If he appeared in day 1, he can’t be included in day 2. Do you know how to incorporate this addition? – abu Feb 07 '18 at 19:18
  • @abu, now it make sense, update do not solves it, I need a time :( – sKwa Feb 07 '18 at 21:55
  • @abu, Im not an expert in `SQL`, so I have no idea how to do it in pure `SQL`, because it require complex flow - if user in prev set selected => so.... I can write an `UDF` for it. Sorry. – sKwa Feb 08 '18 at 00:10
  • I will try writing an R UDF for this problem then, maybe it will work. Thanks for the effort! – abu Feb 08 '18 at 09:08
0

Using random row number , you will get this required sample output

select USER_ID,day_of_week 
from 
(
select user_id,day_of_week, ROW_NUMBER() over (  order by
user_id) rn from #test where day_of_week = 1 
) x where rn = 1
union all 
select USER_ID,day_of_week 
from 
(
select user_id,day_of_week, ROW_NUMBER() over (  order by
user_id) rn from #test where day_of_week = 2
) x where rn in (3,6)
Ven
  • 2,011
  • 1
  • 13
  • 27
  • I read question+comments couple times and it still unclear for me, but I got that condition `where rn = X` or `where rn in [ARR]` isn't solves a problem. – sKwa Feb 07 '18 at 15:45
  • As far as I see there is no required randomness in the code. It always returns the same data. And as I specified, I need to load randomly sampled users in the amount specified by `to_sample` column. – abu Feb 07 '18 at 15:47
  • @sKwa lets see what OP comes back with, a sample Random Output he requires i guess – Ven Feb 07 '18 at 15:47
  • @BHouse I modified the question to state more clearly what the desired output is. Hope this makes it clear now. – abu Feb 07 '18 at 15:51