0

I am trying to parse a SQL query, and want to call a function for each row of a dataframe. Function is as below:

def updateParser(df):
# update tab1 set value1 = 0.34 where id = 1111
# identify positions
setPos = df.select(instr(df.query, ' set ').alias('set')).collect()[0].set
wherePos = df.select(instr(df.query, ' where ').alias('where')).collect()[0].where
idPos = df.select(instr(df.query, ' id').alias('id')).collect()[0].id

# identify table, fields&values, id
df = df.withColumn('table', upper(trim(df.query.substr(7, setPos - 7))))
df = df.withColumn('fieldValueList', upper(trim(df.query.substr(setPos + 5, (wherePos - (setPos + 5) + 1)))))
df = df.withColumn('id', upper(trim(df.query.substr(idPos + 5, 10))))
#identify the column being updated and the value
df.show(n=5, truncate = False)

And I am calling this via:

updateDF.foreach(updateParser)

But I am getting the below error:

  File "/home/mapr/scripts/cdc.py", line 19, in updateParser
setPos = df.select(instr(df.query, ' set ').alias('set')).collect()[0].set
  File "/opt/mapr/spark/spark-1.5.2/python/lib/pyspark.zip/pyspark/sql/types.py", line 1257, in __getattr__
raise AttributeError(item)
AttributeError: select

I am not using getattr anywhere.. is it required? If I do not use foreach and just run this directly on the dataframe, then it runs fine. Could anyone please advise.

manmeet
  • 330
  • 2
  • 4
  • 15
  • a) This is not valid Python code (at least fix indentation) b) If `updateDF` is a `DataFrame` this is not valid Spark code. – zero323 Apr 21 '16 at 04:22
  • Indentation was lost because it was copied from vi editor, and the code is running fine in pyspark and it is tested in CLI and in pyspark job. – manmeet Apr 21 '16 at 04:36

1 Answers1

0

I found the issue - Since I am calling a dataframe for each row, I cannot use df.select on each row. Rather I need to use Row object and its methods. That is the reason in the Attribute error, select is giving an error, because it is not a valid operation.

manmeet
  • 330
  • 2
  • 4
  • 15