I'm working with AWS Personalize and one of the service Quotas is to have "At least 1000 records containing a min of 25 unique users with at least 2 records each", I know my raw data has those numbers but I'm trying to find a way to guarantee that those numbers will always be met, even if the query is run by someone else in the future.
The easy way out would be to just use the full dataset, but right now we are working towards a POC, so that is not really my first option. I have covered the "two records each" section by just counting the appearances, but I don't know how to guarantee the min of 25 users.
It is important to say that my data is not shuffled in any way at the time of saving.
My query
SELECT C.productid AS ITEM_ID,
A.userid AS USER_ID,
A.createdon AS "TIMESTAMP",
B.fromaddress_countryname AS "LOCATION"
FROM A AS orders
JOIN B AS sub_orders ON orders.order_id = sub_orders.order_id
JOIN C AS order_items ON orders.order_id = order_items.order_id
WHERE orders.userid IN (
SELECT orders.userid
FROM A AS ORDERS
GROUP BY orders.userid
HAVING count(*) > 2
)
LIMIT 10
I use the LIMIT to just query a subset since I'm in AWS Athena.