0

I'm looking to create a dynamic .withColumn.

with the column "rules" being replaced by a list depending on the file being processed.

for example: File A has a column called "Validated" that is based on a different condition to File B but has the same column name A. So can we loop through all files A-Z applying different rules for the same column in each file?

Here I am trying to validate many dataframes. Creating an EmailAddress_Validation field on each dataframe. Each data frame has a different email validation rule set. The rules are stored in a list called EmailRuleList. As we loop through each data set the corresponding rule "EmailRuleList[i]" is passed in from the list.

code below has the syntax. Also commented out with an "#" (hash) is an example of a rule. Interestingly if I supply the rule with out the loop (the # comment) the code works except it then obviously applies the same rule to all files.

i=0
for FileProcessName in FileProcessListName:
    EmailAddress_Validation = EmailRuleList[i]
    #EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
    print(EmailAddress_Validation)
    print(FileProcessName)
    i=i+1
    vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", EmailAddress_Validation)

Error Message: col should be Column

EmailRuleList is something like...

['when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),1).otherwise(0)',
 'when((regexp_extract(col("EmailAddress"),EmailRegEx2,0))==(col("EmailAddress")),0).otherwise(1)',
 'when((regexp_extract(col("EmailAddress"),EmailRegEx3,0))==(col("EmailAddress")),0).otherwise(1)',
 'when((regexp_extract(col("EmailAddress"),EmailRegEx4,0))==(col("EmailAddress")),0).otherwise(1)']

tried lots of different things but am a bit stuck

rodders
  • 1
  • 1

1 Answers1

0
  • The error is in the last line of the for loop. The when condition that you want to check in the .withColumn() is actually a string (each element of EmailRuleList which is a string).

  • Since withColumn expects the send argument to be a column, it is giving the error. Look at a similar error when I try to give something similar to your code (in withColumn()):

from pyspark.sql.functions import when,col

df.withColumn("check","when(col('gname')=='Ana','yes').otherwise('No')").show()

enter image description here

  • To make it work, I have used eval function. So, using the following code wouldn't throw an error:
from pyspark.sql.functions import when,col

df.withColumn("check",eval("when(col('gname')=='Ana','yes').otherwise('No')")).show()

enter image description here

  • So, modify your code to the one given below to make it work:
i=0
for FileProcessName in FileProcessListName:
    EmailAddress_Validation = EmailRuleList[i]
    #EmailAddress_Validation = when((regexp_extract(col("EmailAddress"),EmailRegEx,0))==(col("EmailAddress")),0).otherwise(1)
    print(EmailAddress_Validation)
    print(FileProcessName)
    i=i+1
    vars()[FileProcessName] = vars()[FileProcessName].withColumn("EmailAddress_Validation", eval(EmailAddress_Validation))
Saideep Arikontham
  • 5,558
  • 2
  • 3
  • 11