This seems like it should be pretty simple, but I'm stumped for some reason. I have a list of PySpark columns that I would like to sort by name (including aliasing, as that will be how they are displayed/written to disk). Here's some example tests and things I've tried:
def test_col_sorting():
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
# Active spark context needed
spark = SparkSession.builder.getOrCreate()
# Data to sort
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]
# Attempt 1
result = sorted(cols)
# This fails with ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
# Attempt 2
result = sorted(cols, key=lambda x: x.name())
# Fails for the same reason, `name()` returns a Column object, not a string
# Assertion I want to hold true:
assert result = [f.col('a'), f.col('c'), f.col('b').alias('z')]
Is there any reasonable way to actually get the string back out of the Column object that was used to initialize it (but also respecting aliasing)? If I could get this from the object I could use it as a key.
Note that I am NOT looking to sort the columns on a DataFrame, as answered in this question: Python/pyspark data frame rearrange columns. These Column objects are not bound to any DataFrame. I also do not want to sort the column based on the values of the column.