Partitioning by multiple columns in PySpark with columns in a list

Question

My question is similar to this thread: Partitioning by multiple columns in Spark SQL

but I'm working in Pyspark rather than Scala and I want to pass in my list of columns as a list. I want to do something like this:

column_list = ["col1","col2"]
win_spec = Window.partitionBy(column_list)

I can get the following to work:

win_spec = Window.partitionBy(col("col1"))

This also works:

col_name = "col1"
win_spec = Window.partitionBy(col(col_name))

And this also works:

win_spec = Window.partitionBy([col("col1"), col("col2")])

score 29 · Accepted Answer · edited Sep 07 '22 at 15:57

29

Convert column names to column expressions with a list comprehension [col(x) for x in column_list]:

from pyspark.sql.functions import col
from pyspark.sql import Window
column_list = ["col1","col2"]
win_spec = Window.partitionBy([col(x) for x in column_list])

edited Sep 07 '22 at 15:57

Nikunj Kakadiya

2,689
2
20
35

answered Mar 12 '18 at 17:41

Psidom

209,562
33
339
356

3

Update for people coming to this answer: Newer versions of pyspark allow you to pass in a list, like the answers below. See @Naguveeru's answer. – EnterPassword Mar 31 '22 at 02:33

score 7 · Answer 2 · answered May 14 '21 at 18:53

7

PySpark >= 2.4, this works too =>

column_list = ["col1","col2"]

win_spec = Window.partitionBy(*column_list)

answered May 14 '21 at 18:53

Naguveeru

81
2
6

Just a quick question, the ```*``` stands for all items in ```column_list``` , did I understand the code correctly? – Vanessa_C Oct 06 '22 at 15:17
2

@Vanessa_C quite late, but the `*` operator is used to unpack an iterable into a function call in Python. Instead of passing `"col1", "col2"` (manually) to the function call, by using `*column_list` you are also doing exactly that, but without having to specify each element in the list. – Quan Bui Nov 28 '22 at 07:52

score 6 · Answer 3 · answered Mar 12 '18 at 17:48

Your first attempt should work.

Consider the following example:

import pyspark.sql.functions as f
from pyspark.sql import Window

df = sqlCtx.createDataFrame(
    [
        ("a", "apple", 1),
        ("a", "orange", 2),
        ("a", "orange", 3),
        ("b", "orange", 3),
        ("b", "orange", 5)
    ],
    ["name", "fruit","value"]
)
df.show()
#+----+------+-----+
#|name| fruit|value|
#+----+------+-----+
#|   a| apple|    1|
#|   a|orange|    2|
#|   a|orange|    3|
#|   b|orange|    3|
#|   b|orange|    5|
#+----+------+-----+

Suppose you wanted to calculate a fraction of the sum for each row, grouping by the first two columns:

cols = ["name", "fruit"]
w = Window.partitionBy(cols)
df.select(cols + [(f.col('value') / f.sum('value').over(w)).alias('fraction')]).show()

#+----+------+--------+
#|name| fruit|fraction|
#+----+------+--------+
#|   a| apple|     1.0|
#|   b|orange|   0.375|
#|   b|orange|   0.625|
#|   a|orange|     0.6|
#|   a|orange|     0.4|
#+----+------+--------+

Partitioning by multiple columns in PySpark with columns in a list

3 Answers3

Linked