Apache Spark with Spark JobServer crash after some hours

Question

I'm using Apache Spark 2.0.2 together with Apache JobServer 0.7.0.

I know this is not a best practice but this is a first step. My server have 52 Gb RAM and 6 CPU Cores, Cent OS 7 x64, Java(TM) SE Runtime Environment (build 1.7.0_79-b15) and it have the following running applications with the specified memory configuration.

JBoss AS 7 (6 Gb)
PDI Pentaho 6.0 (12 Gb)
MySQL (20 Gb)
Apache Spark 2.0.2 (8 Gb)

I start it and everything works as expected. And works so for several hours. I have a jar with 2 implemented jobs who extends from My_Job class.

public class VIQ_SparkJob extends JavaSparkJob {

protected SparkSession sparkSession;
protected String TENANT_ID;

@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
    sparkSession = SparkSession.builder()
                    .sparkContext(ctx)
                    .enableHiveSupport()
                    .config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryoserializer.buffer", "8m")
                    .getOrCreate();
            Class<?>[] classes = new Class<?>[2];
            classes[0] = UsersCube.class;
            classes[1] = ImportCSVFiles.class;
    sparkSession.sparkContext().conf().registerKryoClasses(classes);
    TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");       
    return true;
}

@Override
public SparkJobValidation validate(SparkContext sc, Config config) {
    return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}

}

This Job import the data from some .csv files and store them as parquet files partitioned by tenant. Are 2 entities users which ocupe 674 Mb in disk as a parquet files and user_processed with 323 Mb.

 @Override
 public Object runJob(SparkContext jsc, Config jobConfig) {
 super.runJob(jsc, jobConfig); 
 String entity= jobConfig.getString("entity");
 Dataset<Row> ds = sparkSession.read()
        .option("header", "true")
        .option("inferschema", true)
        .csv(csvPath);

ds.withColumn("tenant_id", ds.col("tenant_id").cast("int"))
        .write()
        .mode(SaveMode.Append)
        .partitionBy(JavaConversions.asScalaBuffer(asList("tenant_id")))
        .parquet("/value_iq/spark-warehouse/"+entity);
  return null;
 }

This one is to query the parquet files:

@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
super.runJob(jsc, jobConfig); //To change body of generated methods, choose Tools | Templates.
String query = jobConfig.getString("query");
Dataset<Row> lookup_values = getDataFrameFromMySQL("value_iq", "lookup_values").filter(new Column("lookup_domain").equalTo("customer_type"));
Dataset<Row> users = getDataFrameFromParket(USERS + "/tenant_id=" + TENANT_ID);
Dataset<Row> user_profiles = getDataFrameFromParket(USER_PROCESSED + "/tenant_id=" + TENANT_ID);
lookup_values.createOrReplaceTempView("lookup_values");
users.createOrReplaceTempView("users");
user_profiles.createOrReplaceTempView("user_processed");
//CREATING VIEWS DE AND EN
sparkSession
        .sql(Here I join the 3 datasets)
        .coalesce(200)
        .createOrReplaceTempView("cube_users_v_de");
List<String> list = sparkSession.sql(query).limit(1000).toJSON().takeAsList(1000);
String result = "[";
for (int i = 0; i < list.size(); i++) {
    result += (i == 0 ? "" : ",") + list.get(i);
}
result += "]";
return result;
}

Every day I run the first job saving to parquet files some csv. And during the day I execute some queries to the second one. But after some hours crash because of out of memory this the error log:

k.memory.TaskMemoryManager [] [akka://JobServer/user/context-supervisor/application_analytics] - Failed to allocate a page (8388608 bytes), try again.
WARN  .netty.NettyRpcEndpointRef [] [] - Error sending message [message = Heartbeat(0,[Lscala.Tuple2;@18c54652,BlockManagerId(0, 157.97.107.42, 55223))] in 1 attempt

I have the master and one worker in this server. My spark-defaults.conf

spark.debug.maxToStringFields  256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true

This is my Spark Jobserver settings.sh

DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8

I create my context with the following curl:

curl -k --basic --user 'user:password' -d "" 'https://localhost:4810/contexts/application?num-cpu-cores=5&memory-per-node=8G'

And the spark driver use 2Gb.

The created application looks like

ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr

Those are my executors

Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5

I have a process who checks the memory used per process and the top amount was 8468 Mb.

There are 4 processes related with spark.

The master process. Start with 1Gb memory assigned, I don't know from where this configuration cames. But seems to be enough. Use only 0.4 Gb at top.
The worker process. The same as the master with the memory use.
The driver process. Who have 2Gb configured.
The context. Who have 8Gb configured.

In the following table you can see how the memory used by the driver and contexts behaves. After getting the java.lang.OutOfMemoryError: Java heap space. The context fails, but the driver acept another context, so it remains fine.

system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.

I have 2 questions:

1- Why when I check the spark GUI my driver who has 2 configured Gb only use 1, the same with the executor 0 which only use 4.4 Gb. Where goes the other configured memory? But when the processes in the system the driver it use 2Gb.

2- If I have enough memory on the server then why I'm out of memory?

When the job is not running, can you check saprk UI and figure what is the actual amount of memory available? You said you allocated 8 GB for Spark, but you should reserve some memory Spark itself. Also you have so many other processes but you are locating 5 cores to the executor. — noorul, Feb 05 '17 at 02:37
Well, I have monitored the system and the top memory used by all app together was about 42 Gb, so there were about 10 Gb free memory. I have checked the Spark UI the worker was there but the application was with status killed, so I wasn't able to see the memory status. Did you see any missing configuration ? Or a better way to alocate the current memory? — José Carlos Guevara Turruelles, Feb 05 '17 at 10:00
One doubt regarding memory. If I gave to the jobserver 2Gb memory, that will be the driver memory right? And when I start a context with, let's say 8Gb, this 8Gb are goin to be somehow only 6Gb for the executor and 2Gb for the driver? Because I'm monitoring each spark process and I still don't know how the memory allocation works. — José Carlos Guevara Turruelles, Feb 07 '17 at 09:30
At the end of the questions I have add how the driver and contexts memory behaves before and after the crash. — José Carlos Guevara Turruelles, Feb 07 '17 at 10:30

Apache Spark with Spark JobServer crash after some hours

0 Answers0