49

I have a column filled with a bunch of states' initials as strings. My goal is to how the count of each state in such list.

For example: (("TX":3),("NJ":2)) should be the output when there are two occurrences of "TX" and "NJ".

I'm fairly new to pyspark so I'm stumped with this problem. Any help would be much appreciated.

Jerrybibo
  • 1,315
  • 1
  • 21
  • 27
madsthaks
  • 2,091
  • 6
  • 25
  • 46
  • I don't know a thing about pyspark, but if your collection of strings is iterable, you can just pass it to a [`collections.Counter`](https://docs.python.org/2/library/collections.html#collections.Counter), which exists for the express purpose of counting distinct values. – Kevin Feb 25 '17 at 02:35

2 Answers2

112

I think you're looking to use the DataFrame idiom of groupBy and count.

For example, given the following dataframe, one state per row:

df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
df.show()
+-----+
|state|
+-----+
|   TX|
|   NJ|
|   TX|
|   CA|
|   NJ|
+-----+

The following yields:

df.groupBy('state').count().show()
+-----+-----+
|state|count|
+-----+-----+
|   TX|    2|
|   NJ|    2|
|   CA|    1|
+-----+-----+
eddies
  • 7,113
  • 3
  • 36
  • 39
  • 2
    how can I get a sorted list as the output? – Rotail Oct 07 '19 at 19:19
  • 2
    add `.sort('state')` like `df.groupBy('state').count().sort('state').show()` – Paul Nov 07 '19 at 00:37
  • 1
    just to add to @Paul's comment, link to the api docs for orderBy/sort: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy (which also demonstrates how to specify ascending/descending order) – eddies Nov 07 '19 at 08:24
0

import pandas as pd
import pyspark.sql.functions as F

def value_counts(spark_df, colm, order=1, n=10):
    """
    Count top n values in the given column and show in the given order

    Parameters
    ----------
    spark_df : pyspark.sql.dataframe.DataFrame
        Data
    colm : string
        Name of the column to count values in
    order : int, default=1
        1: sort the column descending by value counts and keep nulls at top
        2: sort the column ascending by values
        3: sort the column descending by values
        4: do 2 and 3 (combine top n and bottom n after sorting the column by values ascending) 
    n : int, default=10
        Number of top values to display

    Returns
    ----------
    Value counts in pandas dataframe
    """

    if order==1 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc_nulls_first("count")).head(n),columns=["value","count"]) 
    if order==2 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]) 
    if order==3 :
        return pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"]) 
    if order==4 :
        return pd.concat([pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.asc(colm)).head(n),columns=["value","count"]),
                          pd.DataFrame(spark_df.select(colm).groupBy(colm).count().orderBy(F.desc(colm)).head(n),columns=["value","count"])])
gench
  • 1,063
  • 1
  • 11
  • 17