0

I am using spark 1.2.0 with python.

My problem is that in a sql query if the value of a field is zero , i need to replace it by some other value.

I have tried case /coalese which works for 1.4.0 but not for 1.2.0

case when COALESCE("+fld+",0)=0 then "+str(numavgnumlst[0][lock])+" else "+fld+" end.

However for 1.2.0 i tried to do the same with map

sc = SparkContext(appName="RunModelCCATTR")
sqlContext=SQLContext(sc)
sqlstr="select ..."
nonzerodf=sqlContext.sql(sqlstr)
.....
iifdatadf=nonzerodf.map(lambda candrow:replacezeroforrow(candrow,numavgnumlst))

....
def replacezeroforrow(rowfields,avgvalfields):
   ind=0
   lent=len(rowfields)
   for rowfield in rowfields[4:lent]:
    if rowfield==0:
     rowfields[ind]=avgvalfields[ind]
    ind=ind+1
   return rowfields;

this throws error

TypeError: 'Row' object does not support item assignment

Not sure what i can do to achieve the objective in spark 1.2.0.

thanks for the help i think it is working now.. except for the order of the columns seems to have changed .. but that is something that may not be an issue. thanks again

Edit:

The idea helped me a lot ,needed a little modification to solved the immediate problem,-

def replacezeroforrow(rowfields,avgvalfields,dont_replace=[]):
    rdict = rowfields.asDict()
    return Row(dict([(k,avgvalfields[k] if v == 0 and k not in dont_replace else v ) for (k,v) in rdict.items()]))

I modified the original solution to avoid syntax error at 'for'.

The call to method is as under,-

restrictdict=[FieldSet1,FieldSet2,FieldSet3,FieldSet4,modeldepvarcat[0]]
iifdatadf=nonzerodf.map(lambda candrow: replacezeroforrow(candrow,numavgnumlst[0].asDict(),restrictdict))

However now i am trying to access iifdatadf,

frstln= iifdatadf.first()
print frstln

i am having following error

  return "<Row(%s)>" % ", ".join(self)
TypeError: sequence item 0: expected string, dict found

would hugely appreciate help.

zero323
  • 322,348
  • 103
  • 959
  • 935
P RAY
  • 353
  • 1
  • 2
  • 12
  • It is preferred if you can post separate questions instead of combining your questions into one. That way, it helps the people answering your question and also others hunting for at least one of your questions. Thanks! – NightShadeQueen Aug 29 '15 at 13:44

1 Answers1

0

You can use dictionaries instead of lists and simply return a new row:

def replacezeroforrow(row, avgvalfields):
    rdict = row.asDict()
    return Row(**{k: avgvalfields[k] if v == 0 and k in avgvalfields
        else v for (k, v) in rdict.items()})

usage:

>>> r1 = Row(fld1="a", fld2=99, fld3=0, fld4=0)
>>> avgvalfields = {'fld3': 3, 'fld4': 1}
>>> replacezeroforrow(r1, avgvalfields)
Row(fld1='a', fld2=99, fld3=3, fld4=1)
zero323
  • 322,348
  • 103
  • 959
  • 935