I've a df like this:
Zip | City |
---|---|
'River' | 'London' |
' ' | 'Rome' |
'River' | 'Berlin' |
'River' | 'Madrid' |
' ' | 'Munich' |
' ' | 'Paris' |
SAS code below which I've converted in Pyspark
if Zip = ''
then do Flud = City;
Flag = 'City';
end;
else do Flud = Zip;
Flag = 'Zip';
end;
So I expected output like this:
Zip | City | Flud | Flag |
---|---|---|---|
'River' | 'London' | 'River' | 'Zip' |
' ' | 'Rome' | 'Rome' | 'City' |
'River' | 'Berlin' | 'River' | 'Zip' |
'River' | 'Madrid' | 'River' | 'Zip' |
' ' | 'Munich' | 'Munich' | 'City' |
' ' | 'Paris' | 'Paris' | 'City' |
I've converted in Pyspark but is giving me error:
my code:
output=df.withColumn('Flag',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
Pyspark give me this error:
AnalysisException: Column 'Flag' does not exist.
So I've tried to create the variable first
df= df.withColumn("Flag", lit(''))
and write this new code:
output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
and now I've this error:
Column 'Flud' does not exist.
So I've tried to create the second variable first
df= df.withColumn("Flud", lit(''))
and write this new code:
output=df.withColumn('Flud',when((col('Zip').isNull()) & (col('Flag') == 'City'), col('Flud')==col('City'))
.otherwise(when((col('Zip').isNotNull()) & (col('Flag') == 'Zip'), col('Flud')==col('Zip'))
.otherwise(col('Zip'))))
and now I've this error:
AnalysisException: cannot resolve 'CASE WHEN ((Flud.Zip IS NOT NULL) AND (Flag = 'Zip')) THEN (Flud = Flud.Zip) ELSE Flud.Zip END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type, got CASE WHEN ... THEN boolean ELSE string END;
Someone can help me?