0

I have a function that creates a pyspark WindowSpec partitioning on either a single column or a pair of columns in a list, depending on a boolean param. Mypy throws an error that I can't understand, because my parameter partition_cols should be Union[str, List[str]] which is acceptable for Window.partitionBy().

Example method and error:

from pyspark.sql import Window, WindowSpec


def get_window(single_column: bool) -> WindowSpec:

    partition_cols = "key" if single_column else ["key", "name"]

    return Window.partitionBy(partition_cols).orderBy("timestamp").rangeBetween(0, 10)

Then running mypy:

$ mypy tmp.py
tmp.py:8: error: Argument 1 to "partitionBy" of "Window" has incompatible type "Sequence[str]"; expected "Union[Union[Column, str], List[Union[Column, str]]]"  [arg-type]
STerliakov
  • 4,983
  • 3
  • 15
  • 37
jhawk101
  • 33
  • 5
  • Wow, it's a really bad design decision in pyspark. First, `list` is invariant, so even `list[str]` is not allowed for this function, so only unpacking is a valid option. It also checks for `isinstance(..., list)`, so you cannot pass a `tuple` and other sequences (at least), and maybe even `set` should be supported here. This is awkward: either deny this scenario (allow only `*str`) or support it properly, `list` is not the only sequence in python! – STerliakov Oct 12 '22 at 14:33

1 Answers1

1

Something similar happened to this user and thread: Partitioning by multiple columns in PySpark with columns in a list

Found that this solution was working for me :)

column_list = ["col1","col2"]

win_spec = Window.partitionBy(*column_list)
Jopepato
  • 26
  • 3