0

I need to create a random sample from a table that has users and postings. Each user could have more than one posting. I need to select only 200 from a variable size of users (each day we will have a different total). I created a rand() variable and select only the items that have this rand() under

200/count(*)

But the problem is that I will might have users repeated. How can I select only 200 users from this variable total, considering the original distribution of users (ones are there more times, so I need to give them more chances to be selected)?

I was thinking of creating a loop that populates a field counting the user.... so I will have the same number for each user (right now I don't have a user id, instead of that I have a char field). But I'm not sure how to do this....

Thanks!

GabyLP
  • 3,649
  • 7
  • 45
  • 66

2 Answers2

1

You mean you have users AND their postings in one table ? What's your schema like ?

You could be trying something like that, although it's hard to guess without the actual schema :

SELECT name, count( post ) as "number of posts"
FROM user_and_posts
GROUP BY name
HAVING count( post ) >= 100  -- (or whatever your limit is)
ORDER BY rand( )
LIMIT 0, 199
Git Psuh
  • 312
  • 1
  • 3
  • 11
0

You can have an array having the index(key) as the userid and the value as the count of posts. Then you could simply sort the array descending and pick the top 200 elements. This would ensure that people with the highest number of posts almost always get selected.

Kanishk Dudeja
  • 1,201
  • 3
  • 17
  • 33