1

We are getting the following error after running "pio train". It works about 20 minutes and fails on Stage 26.

[ERROR] [Executor] Exception in task 0.0 in stage 1.0 (TID 3)
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-4,5,main]
[WARN] [TaskSetManager] Lost task 2.0 in stage 1.0 (TID 5, localhost): java.lang.OutOfMemoryError: Java heap space
  at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
  at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:80)
  at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:289)
  at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:289)
  at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:293)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

Our server has about 30gb memory, but about 10gb is taken by hbase+elasticsearch.

We are trying to process about 20 millions of records created by Universal Recommender.

I've tried the following command to increase executor/driver memory, but it didn't help:

pio train -- --driver-memory 6g --executor-memory 8g

What options could we try to fix the issue? Is it possible to process that amount of events on server with that amount of memory?

gvalmon
  • 938
  • 5
  • 18

1 Answers1

1

Vertical scaling can take you only so far but you could try increasing the memory available if it's AWS by stopping and restarting with a larger instance.

CF looks at a lot of data, Since Spark gets it's speed by doing in-memory calculations (by default) you will need enough memory to hold all of your data spread over all Spark workers and in your case you have only 1.

Another thing that comes to mind is that this is a Kryo error so you might try increasing the Kryo buffer size a little, which is configured in engine.json

Also there is a Google Group for community support here: https://groups.google.com/forum/#!forum/actionml-user

pferrel
  • 5,673
  • 5
  • 30
  • 41