I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.
The dataframe I am working with looks like the following:
+-------------+-------------------+
| Make| Model|
+-------------+-------------------+
| PONTIAC| GRAND AM|
| BUICK| CENTURY|
| LEXUS| IS 300|
|MERCEDES-BENZ| SL-CLASS|
| PONTIAC| GRAND AM|
| TOYOTA| PRIUS|
| MITSUBISHI| MONTERO SPORT|
|MERCEDES-BENZ| SLK-CLASS|
| TOYOTA| CAMRY|
| JEEP| WRANGLER|
| CHEVROLET| SILVERADO 1500|
| TOYOTA| AVALON|
| FORD| RANGER|
|MERCEDES-BENZ| C-CLASS|
| TOYOTA| TUNDRA|
| FORD|EXPLORER SPORT TRAC|
| CHEVROLET| COLORADO|
| MITSUBISHI| MONTERO|
| DODGE| GRAND CARAVAN|
+-------------+-------------------+
I need to return at most 10,000 rows for each model:
+--------------------+-------+
| Model| count|
+--------------------+-------+
| MDX|1658647|
| ASTRO| 682657|
| ENTOURAGE| 72622|
| ES 300H| 80712|
| 6 SERIES| 145252|
| GRAN FURY| 9719|
|RANGE ROVER EVOQU...| 4290|
| LEGACY WAGON| 2070|
| LEGACY SEDAN| 104|
| DAKOTA CHASSIS CAB| 8|
| CAMARO|2028678|
| XT| 10009|
| DYNASTY| 171776|
| 944| 43044|
| F430 SPIDER| 506|
|FLEETWOOD SEVENTY...| 6|
| MONTE CARLO|1040806|
| LIBERTY|2415456|
| ESCALADE| 798832|
| SIERRA 3500 CLASSIC| 9541|
+--------------------+-------+
This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows
(Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.