How to calculate the counts of each distinct value in a pyspark dataframe?

Question

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.

For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ".

I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

I don't know a thing about pyspark, but if your collection of strings is iterable, you can just pass it to a [`collections.Counter`](https://docs.python.org/2/library/collections.html#collections.Counter), which exists for the express purpose of counting distinct values. — Kevin, Feb 25 '17 at 02:35

score 112 · Accepted Answer · answered Feb 25 '17 at 05:42

112

I think you're looking to use the DataFrame idiom of groupBy and count.

For example, given the following dataframe, one state per row:

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
df.show()
+-----+
|state|
+-----+
|   TX|
|   NJ|
|   TX|
|   CA|
|   NJ|
+-----+

The following yields:

df.groupBy('state').count().show()
+-----+-----+
|state|count|
+-----+-----+
|   TX|    2|
|   NJ|    2|
|   CA|    1|
+-----+-----+

answered Feb 25 '17 at 05:42

eddies

7,113
3
36
39

2

how can I get a sorted list as the output? – Rotail Oct 07 '19 at 19:19
2

add `.sort('state')` like `df.groupBy('state').count().sort('state').show()` – Paul Nov 07 '19 at 00:37
1

just to add to @Paul's comment, link to the api docs for orderBy/sort: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy (which also demonstrates how to specify ascending/descending order) – eddies Nov 07 '19 at 08:24

score 0 · Answer 2 · answered Jan 02 '20 at 10:35


import pandas as pd
import pyspark.sql.functions as F

def value_counts(spark_df, colm, order=1, n=10):
    """
    Count top n values in the given column and show in the given order

    Parameters
    ----------
    spark_df : pyspark.sql.dataframe.DataFrame
        Data
    colm : string
        Name of the column to count values in
    order : int, default=1
        1: sort the column descending by value counts and keep nulls at top
        2: sort the column ascending by values
        3: sort the column descending by values
        4: do 2 and 3 (combine top n and bottom n after sorting the column by values ascending) 
    n : int, default=10
        Number of top values to display

    Returns
    ----------
    Value counts in pandas dataframe
    """

    if order==1 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc_nulls_first("count")).head(n),columns=["value","count"]) 
    if order==2 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]) 
    if order==3 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"]) 
    if order==4 :
        return pd.concat([pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]),
                          pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"])])

How to calculate the counts of each distinct value in a pyspark dataframe?

2 Answers2