filtering dataframe in LAMBDA function in python

Question

I am trying to calculate standard deviation over cloudant dataframe. I can either use rdd or spark.sql, below is my code snip which is giving me error.

cloudantdata.createOrReplaceTempView("washing")
from math import sqrt
n= spark.sql("SELECT Count(temperature) as tempCount from        washing").first().tempCount
meanX = meanTemperature(cloudantdata,spark)
#= spark.sql("SELECT temperature as temp from washing").first().temp
tempx = cloudantdata.filter(lambda x: x[["temperature"]])
ret= tempx.rdd.map(lambda x : pow(x-meanX,2)).sum()
print(ret)

error-

TypeError                                 Traceback (most recent call last)
<ipython-input-61-a97f833d6cc6> in <module>()
      4 meanX = meanTemperature(cloudantdata,spark)
      5 #= spark.sql("SELECT temperature as temp from washing").first().temp
 ----> 6 tempx = cloudantdata.filter(lambda x: x[["temperature"]])
      7 ret= tempx.rdd.map(lambda x : pow(x-meanX,2)).sum()
      8 print(ret)

/usr/local/src/spark21master/spark/python/pyspark/sql/dataframe.py in     filter(self, condition)
   1033             jdf = self._jdf.filter(condition._jc)
   1034         else:
 -> 1035             raise TypeError("condition should be string or Column")
   1036         return DataFrame(jdf, self.sql_ctx)
   1037 

TypeError: condition should be string or Column

I don't understand. Is `temperature` the name of a column in your DataFrame? — Edgar Ramírez Mondragón, Oct 06 '18 at 04:19
i'm expecting dataframe->(tempx) with only temperature column so that I can run lambda on all temp value to calculate x-meanx which is used to produce standard deviation — Mansi Gupta, Oct 06 '18 at 05:09
Then you could instead try doing `tempX = cloudantdata.select("temperature")` (see [this question](https://stackoverflow.com/questions/35495197/how-do-i-collect-a-single-column-in-spark)) — Edgar Ramírez Mondragón, Oct 06 '18 at 05:28
@EdgarR.Mondragón I tried the same but in following code --- tempX = cloudantdata.select("temperature") ret= tempx.apply(lambda x : pow(x-meanX,2)).sum() I get below error, i removed map and changed it to apply. ----> 7 ret= tempx.apply(lambda x : pow(x-meanX,2)).sum() 8 print(ret) AttributeError: 'int' object has no attribute 'apply' — Mansi Gupta, Oct 06 '18 at 07:17
It seems that `tempX = cloudantdata.select("temperature")` is for some reason returning an `int` and not a dataframe. Try printing out `tempX`. — Edgar Ramírez Mondragón, Oct 07 '18 at 00:21

filtering dataframe in LAMBDA function in python

0 Answers0