I'm using beeline hive here, and for the records I'm not a pro in SQL, but I make my way through.
So, my task is create a sample database for an A/B study, (and so far so good) but the issue started when I've asked to sample based on demographic distributions.
Wrapping up, my dataset is 1M users and they are unevenly distributed in the countries (currently is ~ 70 countries), my task is to randomly choose 50k users in a way that is proportional to the amount of users in each country in my database.
So, my output should be a table with 50k lines, and after that I'd do some joins with other tables to finish my data set.
For an ilustration:
- Total users = 1M
- Sample goal = 50k
- Expected sample = 50K * % of Total for each country
Country | # Users | % of Total | Expected Sample |
---|---|---|---|
UK | 120000 | 12% | 6000 |
Germany | 87000 | 8.7% | 4350 |
... | ... | ... | ... |
Spain | 240000 | 24% | 12000 |
The only way I've thought was to select a percent of total user by country in one query, do the math in excel with 1st query result, and write another query for each country using LIMIT
to select the correct amount.
CREATE TABLE AB_Study as
SELECT user_id FROM main_base WHERE country="UK" ORDER BY RAND() LIMIT 6000
UNION SELECT user_id FROM main_base WHERE ddd="Germany" ORDER BY RAND() LIMIT 4350
...
UNION SELECT user_id FROM main_base WHERE ddd="Spain" ORDER BY RAND() LIMIT 12000
The problem that isn't working because probably is a gigantic query, and I believe that should be a super nice approach to do this, but unfortunately I don't know.