Pyspark dataframe dynamic select clause error

Question

I am trying to parse the select clause dynamically to pyspark dataframe and I keep getting an error saying 'cannot resolve ... given input columns: [value];;

split_col = split(df[column_name], delimiter)
file = open(schema_file, 'r')
data = csv.reader(file)
o_str = ''
for row in data:
    if row[0] == rec_type:
        col_names = row[1:]
        for i in range(len(col_names)):
            o_str += ("split_col.getItem("+str(i)+").alias('"+col_names[i]+"'),")
df_out = df.select(o_str.rsplit(',', 1)[0])

Given above is code snippet. Here o_str.rsplit(',', 1)[0] resolves to split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID') When I hardcode the value in the select clause it works fine but when I try to generate it dynamically it gives the error

An error occurred while calling o44.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')`' given input columns: [value];;
'Project ['split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')']
+- Filter StartsWith(value#2, L)
   +- Relation[value#2] text

df_out = df.select(split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')) --> This works

df_out = df.select(o_str.rsplit(',', 1)[0]) --> This does not work.

I figured out the issue. It was because the dynamically generated statement was being treated as a string rather than an expression. So I used the eval() function and now it works. — OhMoh24, May 12 '23 at 21:49

score 0 · Answer 1 · answered May 11 '23 at 07:50

You are trying to do most of the calculations in python. I would suggest you go the pyspark way. What I understand from your code is that you want to skip a specific column in the select clause. If so, do it in pyspark like this :

# Existing dataframe
df.show(5, False)

+----------+----------+----------+----------+
|dummy_col1|dummy_col2|dummy_col3|dummy_col4|
+----------+----------+----------+----------+
|a         |12        |9         |ab        |
|b         |6         |38        |bc        |
|c         |4         |81        |cd        |
|d         |9         |32        |de        |
|e         |5         |19        |ef        |
+----------+----------+----------+----------+

# Mention the column you want to skip
rejected_col = 'dummy_col1'

# Select the columns dynamically skipping one
df_out = df.select([f.col(c) for c in df.columns if c not in rejected_col])

df_out.show(5, False)

+----------+----------+----------+
|dummy_col2|dummy_col3|dummy_col4|
+----------+----------+----------+
|12        |9         |ab        |
|6         |38        |bc        |
|4         |81        |cd        |
|9         |32        |de        |
|5         |19        |ef        |
+----------+----------+----------+

Pyspark dataframe dynamic select clause error

1 Answers1