It's possible to select distinct and no distinct in Pyspark?

Question

I need to select 2 columns from a fact table (attached below). The problem I find is that for one of the columns I need unique values and for the other one I'm happy to have them duplicated as they below to a specific ticket id.

Fact table used:

df = (
      spark.table(f'nn_table_{country}.fact_table')
     .filter(f.col('date_key').between(start_date,end_date))
     .filter(f.col('is_client_plus')==1)
     .filter(f.col('source')=='tickets')
     .filter(f.col('subtype')=='item_pm')
     .filter(f.col('external_id')=='DISC0000077144 | DISC0000076895')
     .filter(f.col('external_id').isNotNull())
     .select('customer_id','external_id').distinct()
     #.join(dim_promotions, 'external_id', 'left')
     )

display(df)

As you can see, the select statement contains a customer_id and external_id column, where I'm only interested in get the unique customer_id.

.select('customer_id','external_id').distinct()

Desired output:

customer_id             external_id
77000000505097070       DISC0000077144 
77000002294023644       DISC0000077144 
77000000385346302       DISC0000076895
77000000291101490       DISC0000076895

any idea about how to do that? or if it's possible?

Thanks in advance!

Does this or many other similar posts answer your question? [How to get distinct rows in dataframe using pyspark?](https://stackoverflow.com/questions/38649793/how-to-get-distinct-rows-in-dataframe-using-pyspark) — mazaneicha, Dec 22 '20 at 15:00

score 1 · Accepted Answer · answered Dec 22 '20 at 08:53

1

Use dropDuplicates:

df.select('customer_id','external_id').dropDuplicates(['customer_id'])

answered Dec 22 '20 at 08:53

mck

40,932
13
35
50

It's possible to select distinct and no distinct in Pyspark?

1 Answers1