0

I'm using beeline hive here, and for the records I'm not a pro in SQL, but I make my way through.

So, my task is create a sample database for an A/B study, (and so far so good) but the issue started when I've asked to sample based on demographic distributions.

Wrapping up, my dataset is 1M users and they are unevenly distributed in the countries (currently is ~ 70 countries), my task is to randomly choose 50k users in a way that is proportional to the amount of users in each country in my database.

So, my output should be a table with 50k lines, and after that I'd do some joins with other tables to finish my data set.

For an ilustration:

  • Total users = 1M
  • Sample goal = 50k
  • Expected sample = 50K * % of Total for each country
Country # Users % of Total Expected Sample
UK 120000 12% 6000
Germany 87000 8.7% 4350
... ... ... ...
Spain 240000 24% 12000

The only way I've thought was to select a percent of total user by country in one query, do the math in excel with 1st query result, and write another query for each country using LIMIT to select the correct amount.

CREATE TABLE AB_Study as 
              SELECT user_id FROM main_base WHERE country="UK"  ORDER BY RAND() LIMIT 6000
        UNION SELECT user_id FROM main_base WHERE ddd="Germany" ORDER BY RAND() LIMIT 4350
                                              ...
        UNION SELECT user_id FROM main_base WHERE ddd="Spain"   ORDER BY RAND() LIMIT 12000

The problem that isn't working because probably is a gigantic query, and I believe that should be a super nice approach to do this, but unfortunately I don't know.

vsobz
  • 1
  • Select User_ID and Country, then order by rand() and limit 50k. The distributions should be very close to what you need. The `order by rand()` will do the work for you. – Isolated Apr 25 '23 at 15:36

0 Answers0