getattr error while calling foreach for dataframe in pyspark

Question

I am trying to parse a SQL query, and want to call a function for each row of a dataframe. Function is as below:

def updateParser(df):
# update tab1 set value1 = 0.34 where id = 1111
# identify positions
setPos = df.select(instr(df.query, ' set ').alias('set')).collect()[0].set
wherePos = df.select(instr(df.query, ' where ').alias('where')).collect()[0].where
idPos = df.select(instr(df.query, ' id').alias('id')).collect()[0].id

# identify table, fields&values, id
df = df.withColumn('table', upper(trim(df.query.substr(7, setPos - 7))))
df = df.withColumn('fieldValueList', upper(trim(df.query.substr(setPos + 5, (wherePos - (setPos + 5) + 1)))))
df = df.withColumn('id', upper(trim(df.query.substr(idPos + 5, 10))))
#identify the column being updated and the value
df.show(n=5, truncate = False)

And I am calling this via:

updateDF.foreach(updateParser)

But I am getting the below error:

  File "/home/mapr/scripts/cdc.py", line 19, in updateParser
setPos = df.select(instr(df.query, ' set ').alias('set')).collect()[0].set
  File "/opt/mapr/spark/spark-1.5.2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1257, in __getattr__
raise AttributeError(item)
AttributeError: select

I am not using getattr anywhere.. is it required? If I do not use foreach and just run this directly on the dataframe, then it runs fine. Could anyone please advise.

a) This is not valid Python code (at least fix indentation) b) If `updateDF` is a `DataFrame` this is not valid Spark code. — zero323, Apr 21 '16 at 04:22
Indentation was lost because it was copied from vi editor, and the code is running fine in pyspark and it is tested in CLI and in pyspark job. — manmeet, Apr 21 '16 at 04:36

score 0 · Answer 1 · answered Apr 21 '16 at 04:38

I found the issue - Since I am calling a dataframe for each row, I cannot use df.select on each row. Rather I need to use Row object and its methods. That is the reason in the Attribute error, select is giving an error, because it is not a valid operation.

__getattr__ error while calling foreach for dataframe in pyspark

1 Answers1

getattr error while calling foreach for dataframe in pyspark