How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

Question

I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.

The dataframe I am working with looks like the following:

+-------------+-------------------+
|         Make|              Model|
+-------------+-------------------+
|      PONTIAC|           GRAND AM|
|        BUICK|            CENTURY|
|        LEXUS|             IS 300|
|MERCEDES-BENZ|           SL-CLASS|
|      PONTIAC|           GRAND AM|
|       TOYOTA|              PRIUS|
|   MITSUBISHI|      MONTERO SPORT|
|MERCEDES-BENZ|          SLK-CLASS|
|       TOYOTA|              CAMRY|
|         JEEP|           WRANGLER|
|    CHEVROLET|     SILVERADO 1500|
|       TOYOTA|             AVALON|
|         FORD|             RANGER|
|MERCEDES-BENZ|            C-CLASS|
|       TOYOTA|             TUNDRA|
|         FORD|EXPLORER SPORT TRAC|
|    CHEVROLET|           COLORADO|
|   MITSUBISHI|            MONTERO|
|        DODGE|      GRAND CARAVAN|
+-------------+-------------------+

I need to return at most 10,000 rows for each model:

+--------------------+-------+
|               Model|  count|
+--------------------+-------+
|                 MDX|1658647|
|               ASTRO| 682657|
|           ENTOURAGE|  72622|
|             ES 300H|  80712|
|            6 SERIES| 145252|
|           GRAN FURY|   9719|
|RANGE ROVER EVOQU...|   4290|
|        LEGACY WAGON|   2070|
|        LEGACY SEDAN|    104|
|  DAKOTA CHASSIS CAB|      8|
|              CAMARO|2028678|
|                  XT|  10009|
|             DYNASTY| 171776|
|                 944|  43044|
|         F430 SPIDER|    506|
|FLEETWOOD SEVENTY...|      6|
|         MONTE CARLO|1040806|
|             LIBERTY|2415456|
|            ESCALADE| 798832|
| SIERRA 3500 CLASSIC|   9541|
+--------------------+-------+

This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows (Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.

will distinct() not do what you need .. i dont think i understand your question. can you provide more code and data.. — Andy_101, Jan 10 '20 at 11:32
Does this answer your question? [Retrieve top n in each group of a DataFrame in pyspark](https://stackoverflow.com/questions/38397796/retrieve-top-n-in-each-group-of-a-dataframe-in-pyspark) — blackbishop, Jan 11 '20 at 11:50
A window function is one way. You could also partition by make and model then map partition with index and pass in a function that returns all rows where the index is less than 1000. — Molotch, Jan 11 '20 at 18:11
@Ravaal please explain in more detail why the posted duplicate does not answer your question. The best way would be to create a [mcve] with a small number of models and make the threshold a low number (like 3) to show exactly what the desired output is. — pault, Jan 14 '20 at 15:28
@pault I've tried to explain this so many times by now. I don't understand what is so hard to understand about this. If there are more than 10K rows for each value ONLY RETURN 10K ROWS FOR THAT VALUE. If there is less than 10K rows, RETURN ALL ROWS. This will give me a much smaller sample that I can work with. Unfortunately I don't have the ability to just go in to my instance and play with code. It is very expensive for my company... — Ravaal, Jan 14 '20 at 15:37
How is it a duplicate? The other question only returns the rows if rank <= 2. I want every row I can get for each value; not only the values that have 10K or less rows. It IS NOT a dupe. Feel free to write some code and let's find out if it is a dupe or not. Challenge accepted? — Ravaal, Jan 14 '20 at 15:45
Change rank to <= 10000 on the duplicate. It will return 10k for rows with greater than 10k samples, otherwise all rows. — pault, Jan 14 '20 at 15:47
Just as an update, that answer to the other question did not solve my problem... as I have stated numerous times already. — Ravaal, Jan 14 '20 at 21:13

score 1 · Accepted Answer · edited Jan 15 '20 at 17:43

1

I guess you should put row_number with window, orderBy, and partitionBy to query the result and then you can filter with your limit. For example, getting a random shuffle and limiting the sample to 10,000 rows per value is demonstrated by the following:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window.partitionBy(df['Model']).orderBy(F.rand())
df = df.select(F.col('*'), 
               F.row_number().over(window).alias('row_number')) \
               .where(F.col('row_number') <= 10000)

edited Jan 15 '20 at 17:43

Ravaal

3,233
6
39
66

answered Jan 15 '20 at 11:04

lvnt

487
3
10

congrats slmn! Your post was the closest. I made major edits but nonetheless you earned the bounty! – Ravaal Jan 15 '20 at 14:38

LaSul · Answer 2 · 2020-01-13T16:18:17.690

0

Simply do

import pyspark.sql.functions as F

df = df.groupBy("Model").agg(F.count(F.lit(1)).alias("Count"))
df = df.filter(df["Count"] < 10000).select("Model", "Count")

df.write.parquet("data.parquet")

edited Jan 13 '20 at 16:18

answered Jan 10 '20 at 15:50

LaSul

2,231
1
20
36

That won't work. I need a sample for every value in the Model column... – Ravaal Jan 10 '20 at 16:38
I added the `distinct()` tho, this will do what you want – LaSul Jan 13 '20 at 09:16
This would be ideal. – Ravaal Jan 13 '20 at 14:37
It didn't end up working. The dataframe went from over 300M records to only 1699. Nice idea though. – Ravaal Jan 13 '20 at 14:43
This is working then. This means you have 1699 unique values. So you cannot have more row than unique values – LaSul Jan 13 '20 at 15:07
It is working for that purpose, but it still isn't an answer to my question. I need "at most" 10,000 rows per unique value. Some values contain 600 rows, others have 18240 rows, I need "at most" 10,000 rows for each value... Sorry to reiterate. Does that make sense? I may end up with 600,000 rows out of the 300M that I already have which will be a much more manageable dataframe to work with. I wish I didn't have to do this but I do so I'm still stuck. No one has answered this question yet. I've exhausted the search. – Ravaal Jan 13 '20 at 16:03
So 10K rows as a minimum or a maximum ? Because I modified and this should do what you want. – LaSul Jan 13 '20 at 16:15
10k rows as a maximum without excluding all rows with more than 10k. Just limiting those rows to 10k. – Ravaal Jan 13 '20 at 16:18

ggeop · Answer 3 · 2020-01-10T23:12:40.883

0

If I understand your question you want to sample few rows (e.g 10000) but these records should have count greater to 10000. If I understand your question, this is the answer:

df = df.groupBy('Make', 'Model').agg(count(lit(1)).alias('count'))
df = df.filter(df['count']>10000).select('Model','count')
df.write.parquet('output.parquet')

edited Jan 10 '20 at 23:12

answered Jan 10 '20 at 18:46

ggeop

1,230
12
24

@Ravaal I had added the limit because I suppose that you need to 'sample' as you said, and not take all the records. Ok nice! So the question have answered? – ggeop Jan 10 '20 at 21:46
I had to retract the answer. It didn't work. Please refer to my edit for the actual dataframe and clarification. – Ravaal Jan 10 '20 at 21:49
@Ravaal I edited my answer again, take a look and tell me if it works and if not what is the problem. – ggeop Jan 10 '20 at 22:17
Perfect, so @Ravaal is the question answered? – ggeop Jan 10 '20 at 22:27
`count()` is not defined. I tried `from pyspark.sql.functions import count` but then count needed an argument, so I passed `'Model'` as the argument to count. I then put in `df = df.filter(df['count']>10000).select('Model','count')` and followed it up by `df.count()` and got 0 results. Unfortunately we do not have an answer here. – Ravaal Jan 10 '20 at 22:47
I don't have tested the code, but the concept is write. I added lit(1) to count (import it from functions) – ggeop Jan 10 '20 at 23:14

Foxan Ng · Answer 4 · 2020-01-15T09:49:50.713

I will modify the given problem slightly so that it can be visualized here, by reducing the maximum number of rows for each distinct value to 2 rows (instead of 10,000).

Sample dataframe:

df = spark.createDataFrame(
  [('PONTIAC', 'GRAND AM'), ('BUICK', 'CENTURY'), ('LEXUS', 'IS 300'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('MERCEDES-BENZ', 'SL-CLASS'), ('PONTIAC', 'GRAND AM'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('TOYOTA', 'PRIUS'), ('MITSUBISHI', 'MONTERO SPORT'), ('MERCEDES-BENZ', 'SLK-CLASS'), ('TOYOTA', 'CAMRY'), ('JEEP', 'WRANGLER'), ('CHEVROLET', 'SILVERADO 1500'), ('TOYOTA', 'AVALON'), ('FORD', 'RANGER'), ('MERCEDES-BENZ', 'C-CLASS'), ('TOYOTA', 'TUNDRA'), ('FORD', 'EXPLORER SPORT TRAC'), ('CHEVROLET', 'COLORADO'), ('MITSUBISHI', 'MONTERO'), ('DODGE', 'GRAND CARAVAN')],
  ['Make', 'Model']
)

Let's do a row count:

df.groupby('Model').count().collect()

+-------------------+-----+
|              Model|count|
+-------------------+-----+
|             AVALON|    2|
|            CENTURY|    1|
|             TUNDRA|    2|
|           WRANGLER|    3|
|           GRAND AM|    3|
|EXPLORER SPORT TRAC|    1|
|            C-CLASS|    2|
|      MONTERO SPORT|    3|
|              CAMRY|    3|
|      GRAND CARAVAN|    1|
|     SILVERADO 1500|    2|
|              PRIUS|    3|
|            MONTERO|    1|
|           COLORADO|    1|
|             RANGER|    2|
|          SLK-CLASS|    3|
|           SL-CLASS|    2|
|             IS 300|    1|
+-------------------+-----+

If I understand your question correctly, you can assign a row number to each row with a partition by Model:

from pyspark.sql import Window
from pyspark.sql.functions import row_number, desc

win_1 = Window.partitionBy('Model').orderBy(desc('Make'))
df = df.withColumn('row_num', row_number().over(win_1))

And then filter the rows to where row_num <= 2:

df = df.filter(df.row_num <= 2).select('Make', 'Model')

There should be a total of 2+1+2+2+2+1+2+2+2+1+2+2+1+1+2+2+2+1 = 30 rows

Final results:

+-------------+-------------------+
|         Make|              Model|
+-------------+-------------------+
|       TOYOTA|             AVALON|
|       TOYOTA|             AVALON|
|        BUICK|            CENTURY|
|       TOYOTA|             TUNDRA|
|       TOYOTA|             TUNDRA|
|         JEEP|           WRANGLER|
|         JEEP|           WRANGLER|
|      PONTIAC|           GRAND AM|
|      PONTIAC|           GRAND AM|
|         FORD|EXPLORER SPORT TRAC|
|MERCEDES-BENZ|            C-CLASS|
|MERCEDES-BENZ|            C-CLASS|
|   MITSUBISHI|      MONTERO SPORT|
|   MITSUBISHI|      MONTERO SPORT|
|       TOYOTA|              CAMRY|
|       TOYOTA|              CAMRY|
|        DODGE|      GRAND CARAVAN|
|    CHEVROLET|     SILVERADO 1500|
|    CHEVROLET|     SILVERADO 1500|
|       TOYOTA|              PRIUS|
|       TOYOTA|              PRIUS|
|   MITSUBISHI|            MONTERO|
|    CHEVROLET|           COLORADO|
|         FORD|             RANGER|
|         FORD|             RANGER|
|MERCEDES-BENZ|          SLK-CLASS|
|MERCEDES-BENZ|          SLK-CLASS|
|MERCEDES-BENZ|           SL-CLASS|
|MERCEDES-BENZ|           SL-CLASS|
|        LEXUS|             IS 300|
+-------------+-------------------+

How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

4 Answers4

Linked