1

I am not able to find the sum of RDD. I am a newbie in this field please help.

Using Python 2.7 Sith spark 2.1 I am currently using SQL query to fetch the dataframe and then converting it to RDD using .rdd. Even if I use df.select().rdd

This code its giving me the same error:

def meanTemperature(df,spark):
    tempDF = spark.sql("SELECT TEMPERATURE FROM washing").rdd
    return tempDF.sum()

The error I am getting:

     Py4JJavaErrorTraceback (most recent call last)
<ipython-input-40-3c99bf995d59> in <module>()
----> 1 meanTemperature(cloudantdata,spark)

<ipython-input-39-cb1480c78493> in meanTemperature(df, spark)
      4 
      5 
----> 6     return tempDF.sum()

/usr/local/src/spark21master/spark/python/pyspark/rdd.py in sum(self)
   1029         6.0
   1030         """
-> 1031         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
   1032 
   1033     def count(self):

/usr/local/src/spark21master/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
    903         # zeroValue provided to each partition is unique from the one provided
    904         # to the final reduce call
--> 905         vals = self.mapPartitions(func).collect()
    906         return reduce(op, vals, zeroValue)
    907 

/usr/local/src/spark21master/spark/python/pyspark/rdd.py in collect(self)
    806         """
    807         with SCCallSiteSync(self.context) as css:
--> 808             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    809         return list(_load_from_socket(port, self._jrdd_deserializer))
    810 

/usr/local/src/spark21master/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/src/spark21master/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/local/src/spark21master/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 18.0 failed 10 times, most recent failure: Lost task 2.9 in stage 18.0 (TID 848, yp-spark-lon02-env5-0105.bluemix.net, executor cd1ff543-2b85-4961-8632-26de9890cbca): com.cloudant.client.org.lightcouch.TooManyRequestsException: 429 Too Many Requests at https://49b92f6e-fb6d-4003-aa11-80280f96591d-bluemix.cloudant.com/washing/_all_docs?include_docs=true&limit=19&skip=38. Error: too_many_requests. Reason: You`ve exceeded your rate limit allowance. Please try again later..
    at com.cloudant.client.org.lightcouch.CouchDbClient.execute(CouchDbClient.java:575)
    at com.cloudant.client.api.CloudantClient.executeRequest(CloudantClient.java:388)
    at org.apache.bahir.cloudant.CloudantConfig.executeRequest(CloudantConfig.scala:73)
    at org.apache.bahir.cloudant.common.JsonStoreDataAccess.getQueryResult(JsonStoreDataAccess.scala:114)
    at org.apache.bahir.cloudant.common.JsonStoreDataAccess.getIterator(JsonStoreDataAccess.scala:62)
    at org.apache.bahir.cloudant.common.JsonStoreRDD.compute(JsonStoreRDD.scala:223)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:326)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:290)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:326)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:290)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:326)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:290)
    at 

Please help me with this error.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
Aryan Soni
  • 236
  • 4
  • 15
  • 1
    You want to sum all values in the temperature column? You can try using `"SELECT SUM(TEMPERATURE) FROM washing"`. – Shaido Aug 30 '18 at 07:33
  • Yes I want to sum all the values in temperature column – Aryan Soni Aug 30 '18 at 07:35
  • 1
    Alternativly, you could use `tempDF.select(sum(TEMPERATURE))` to get the sum afterwards. – Shaido Aug 30 '18 at 07:37
  • @Shaido These are good suggestions (the current code wouldn't work), but take a look at the exception - "Reason: You`ve exceeded your rate limit allowance. Please try again later". Personally I cannot say it is caused by Spark exceptions or fails right away. – Alper t. Turker Aug 30 '18 at 10:16
  • If I had exceeded the rate limit I would not be able to access the data and perform operations on it but I am able to perform operations like count and run SQL queries on it – Aryan Soni Aug 30 '18 at 10:33
  • This is not a spark exception, you are doing something you are not supposed to do with your couchbase – eliasah Aug 30 '18 at 13:35
  • can you please elaborate – Aryan Soni Aug 30 '18 at 13:40

1 Answers1

0

It was not at all related to exceeding limit though error depicted that. Error was that I didn't ommited null values and had to use it in lambda function. Also I had to access the the column of rdd using x.temperature

Aryan Soni
  • 236
  • 4
  • 15