Pyspark count() and collect() do not work

Question

I am confused with my situation. I find sequence patterns in pyspark. At the first I have key value RDD like this

p_split.take(2)

[(['A', 'B', 'C', 'D'], u'749'),
 (['O', 'K', 'A'], u'162')]

Than I found combinations of string and join them:

def patterns1(text):
    output = [list(combinations(text, i)) for i in range(len(text) + 1)]
    output = output[2:-1]
    paths = []
    for item in output:
        for i in range(len(item)):
            paths.append('->'.join(item[i]))
    return paths


p_patterns = p_split.map(lambda (x,y): (patterns1(x), y))

p_patterns.take(2)

 [(['A->B',
   'A->C'
   'A->D',
   'B->C',
   'B->D',
   ...
  u'749'), .....

And with this RDD p_patterns I can not do operations like count() and collect(). With p_split I did this operations succesfully.

p_patterns.count()

    ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-14-75eb19776fa7> in <module>()
----> 1 p_patterns.count()

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in count(self)
    930         3
    931         """
--> 932         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
    933 
    934     def stats(self):

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in sum(self)
    921         6.0
    922         """
--> 923         return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
    924 
    925     def count(self):

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in reduce(self, f)
    737             yield reduce(f, iterator, initial)
    738 
--> 739         vals = self.mapPartitions(func).collect()
    740         if vals:
    741             return reduce(f, vals)

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in collect(self)
    711         """
    712         with SCCallSiteSync(self.context) as css:
--> 713             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    714         return list(_load_from_socket(port, self._jrdd_deserializer))
    715 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
    process()
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2252, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 282, in func
    return f(iterator)
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 932, in <lambda>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 932, in <genexpr>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "<ipython-input-12-0e1339e78f5c>", line 1, in <lambda>
  File "<ipython-input-11-b71a29b24fa7>", line 7, in patterns1
MemoryError

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

What is my mistake?

score 1 · Answer 1 · edited May 23 '17 at 11:51

As noted by @lanenok it is a memory error, and given what is going inside patterns1 function it is not that surprising. Memory complexity of the following statement:

o = [list(combinations(text, i)) for i in range(len(text) + 1)]

is roughly O(2^N) where N is a length of the input text.

There is a second problem hidden behind this one. It doesn't make things worse than an exponential complexity, but it is rather bad by itself. When you convert combinations to a list you loose all the benefits of having a lazy sequence, which could be leveraged to push limits set by a memory complexity a little bit further.

I would recommend using generators and lazy functions (toolz rocks here) whenever you can. I've already mentioned this approach here so please take a look. For example pattern1 could be rewritten as follows:

from itertools import combinations
from toolz.itertoolz import concat, map

def patterns1(text): 
    return map(
        lambda x: '->'.join(x), 
        concat(combinations(text, i) for i in range(2, len(text) + 1)))

Obviouslly it won't solve memory complexity issue but it is a place start how to optimize your algorithm.

Thanks, but it really did not solve the memory problem. Is it depend of parameters of my cluster? What shuld I do to solve this problem? — Татьяна Паскевич, Jul 04 '15 at 19:49
No, in a long term it doesn't depend on any parameter of your cluster. To give you some perspective for a text of length 59 it would be necessary to generate list longer than number of seconds since the beginning of the universe. It is simply not feasible. Depending on your goal you should be able to find some approximate solution. I guess you're goal is to perform some kind of pattern mining. If so search for the terms like closed patterns and max-patterns. It should give you some idea where to go. — zero323, Jul 04 '15 at 20:20

score 0 · Accepted Answer · answered Jul 04 '15 at 14:26

0

As far as I see, you have a MemoryError with ipython. At the same time your p_patterns.take(2) works, which means that your RDD is fine.

So, can it be that simple, that you only need to cache your RDD before using it? Like

p_patterns = p_split.map(lambda (x,y): (patterns1(x), y)).cache()

answered Jul 04 '15 at 14:26

lanenok

2,699
17
24

I have used `.cache()`, but got the same error even for `p_patterns.take(2)` – Татьяна Паскевич Jul 04 '15 at 18:45

Pyspark count() and collect() do not work

2 Answers2