I am trying to parse the select clause dynamically to pyspark dataframe and I keep getting an error saying 'cannot resolve ... given input columns: [value];;
split_col = split(df[column_name], delimiter)
file = open(schema_file, 'r')
data = csv.reader(file)
o_str = ''
for row in data:
if row[0] == rec_type:
col_names = row[1:]
for i in range(len(col_names)):
o_str += ("split_col.getItem("+str(i)+").alias('"+col_names[i]+"'),")
df_out = df.select(o_str.rsplit(',', 1)[0])
Given above is code snippet. Here o_str.rsplit(',', 1)[0]
resolves to split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')
When I hardcode the value in the select clause it works fine but when I try to generate it dynamically it gives the error
An error occurred while calling o44.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')`' given input columns: [value];;
'Project ['split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')']
+- Filter StartsWith(value#2, L)
+- Relation[value#2] text
df_out = df.select(split_col.getItem(0).alias('RCD_TYPE'),split_col.getItem(1).alias('VER'),split_col.getItem(2).alias('ID'),split_col.getItem(3).alias('OL_IND'),split_col.getItem(4).alias('PERS_ID')) --> This works
df_out = df.select(o_str.rsplit(',', 1)[0]) --> This does not work.