Filter rows by distinct values in one column in PySpark

Question

Let's say I have the following table:

+--------------------+--------------------+------+------------+--------------------+
|                host|                path|status|content_size|                time|
+--------------------+--------------------+------+------------+--------------------+
|js002.cc.utsunomi...|/shuttle/resource...|   404|           0|1995-08-01 00:07:...|
|    tia1.eskimo.com |/pub/winvn/releas...|   404|           0|1995-08-01 00:28:...|
|grimnet23.idirect...|/www/software/win...|   404|           0|1995-08-01 00:50:...|
|miriworld.its.uni...|/history/history.htm|   404|           0|1995-08-01 01:04:...|
|      ras38.srv.net |/elv/DELTA/uncons...|   404|           0|1995-08-01 01:05:...|
| cs1-06.leh.ptd.net |                    |   404|           0|1995-08-01 01:17:...|
|dialip-24.athenet...|/history/apollo/a...|   404|           0|1995-08-01 01:33:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:35:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:36:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|  h96-158.ccnet.com |/history/apollo/a...|   404|           0|1995-08-01 01:37:...|
|hsccs_gatorbox07....|/pub/winvn/releas...|   404|           0|1995-08-01 01:44:...|
|www-b2.proxy.aol....|/pub/winvn/readme...|   404|           0|1995-08-01 01:48:...|
|www-b2.proxy.aol....|/pub/winvn/releas...|   404|           0|1995-08-01 01:48:...|
+--------------------+--------------------+------+------------+--------------------+

How I would filter this table to have only distinct paths in PySpark? But the table should contains all columns.

score 39 · Accepted Answer · answered Sep 02 '16 at 09:11

39

If you want to save rows where all values in specific column are distinct, you have to call dropDuplicates method on DataFrame. Like this in my example:

dataFrame = ... 
dataFrame.dropDuplicates(['path'])

where path is column name

answered Sep 02 '16 at 09:11

likern

3,744
5
36
47

1

out of duplicate records, how would dropDuplicates decide which record to delete? – prudhvi Indana Feb 05 '18 at 23:05
1

@prudhviIndana You can't tune this behaviour. If you need this, probably you should use other query, for example to use **filter** / **groupby** – likern Mar 20 '18 at 20:20
Not true. See here for examples of how to only keep the first occurrence in an ordered dataframe: https://stackoverflow.com/a/54738843/4166885 – Juergen Sep 23 '19 at 15:53

score 1 · Answer 2 · edited Sep 07 '22 at 08:51

As for tuning which records are kept and discarded, if you can work your conditions into a Window expression, you can use something like this. This is in scala (more or less) but I imagine you can do it in PySpark, too.

val window = Window.parititionBy('columns,'to,'make,'unique).orderBy('conditionToPutRowToKeepFirst)

dataframe.withColumn("row_number",row_number().over(window)).where('row_number===1).drop('row_number)

Filter rows by distinct values in one column in PySpark

2 Answers2

Linked