I'm using Apache Spark 2.0.2 together with Apache JobServer 0.7.0.
I know this is not a best practice but this is a first step. My server have 52 Gb RAM and 6 CPU Cores, Cent OS 7 x64, Java(TM) SE Runtime Environment (build 1.7.0_79-b15) and it have the following running applications with the specified memory configuration.
- JBoss AS 7 (6 Gb)
- PDI Pentaho 6.0 (12 Gb)
- MySQL (20 Gb)
- Apache Spark 2.0.2 (8 Gb)
I start it and everything works as expected. And works so for several hours. I have a jar with 2 implemented jobs who extends from My_Job class.
public class VIQ_SparkJob extends JavaSparkJob {
protected SparkSession sparkSession;
protected String TENANT_ID;
@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
sparkSession = SparkSession.builder()
.sparkContext(ctx)
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer", "8m")
.getOrCreate();
Class<?>[] classes = new Class<?>[2];
classes[0] = UsersCube.class;
classes[1] = ImportCSVFiles.class;
sparkSession.sparkContext().conf().registerKryoClasses(classes);
TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");
return true;
}
@Override
public SparkJobValidation validate(SparkContext sc, Config config) {
return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}
}
This Job import the data from some .csv files and store them as parquet files partitioned by tenant. Are 2 entities users which ocupe 674 Mb in disk as a parquet files and user_processed with 323 Mb.
@Override public Object runJob(SparkContext jsc, Config jobConfig) { super.runJob(jsc, jobConfig); String entity= jobConfig.getString("entity"); Dataset<Row> ds = sparkSession.read() .option("header", "true") .option("inferschema", true) .csv(csvPath); ds.withColumn("tenant_id", ds.col("tenant_id").cast("int")) .write() .mode(SaveMode.Append) .partitionBy(JavaConversions.asScalaBuffer(asList("tenant_id"))) .parquet("/value_iq/spark-warehouse/"+entity); return null; }
This one is to query the parquet files:
@Override public Object runJob(SparkContext jsc, Config jobConfig) { super.runJob(jsc, jobConfig); //To change body of generated methods, choose Tools | Templates. String query = jobConfig.getString("query"); Dataset<Row> lookup_values = getDataFrameFromMySQL("value_iq", "lookup_values").filter(new Column("lookup_domain").equalTo("customer_type")); Dataset<Row> users = getDataFrameFromParket(USERS + "/tenant_id=" + TENANT_ID); Dataset<Row> user_profiles = getDataFrameFromParket(USER_PROCESSED + "/tenant_id=" + TENANT_ID); lookup_values.createOrReplaceTempView("lookup_values"); users.createOrReplaceTempView("users"); user_profiles.createOrReplaceTempView("user_processed"); //CREATING VIEWS DE AND EN sparkSession .sql(Here I join the 3 datasets) .coalesce(200) .createOrReplaceTempView("cube_users_v_de"); List<String> list = sparkSession.sql(query).limit(1000).toJSON().takeAsList(1000); String result = "["; for (int i = 0; i < list.size(); i++) { result += (i == 0 ? "" : ",") + list.get(i); } result += "]"; return result; }
Every day I run the first job saving to parquet files some csv. And during the day I execute some queries to the second one. But after some hours crash because of out of memory this the error log:
k.memory.TaskMemoryManager [] [akka://JobServer/user/context-supervisor/application_analytics] - Failed to allocate a page (8388608 bytes), try again.
WARN .netty.NettyRpcEndpointRef [] [] - Error sending message [message = Heartbeat(0,[Lscala.Tuple2;@18c54652,BlockManagerId(0, 157.97.107.42, 55223))] in 1 attempt
I have the master and one worker in this server. My spark-defaults.conf
spark.debug.maxToStringFields 256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true
This is my Spark Jobserver settings.sh
DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8
I create my context with the following curl:
curl -k --basic --user 'user:password' -d "" 'https://localhost:4810/contexts/application?num-cpu-cores=5&memory-per-node=8G'
And the spark driver use 2Gb.
The created application looks like
ExecutorID Worker Cores Memory State Logs
0 worker-20170203084218-157.97.107.42-50199 5 8192 RUNNING stdout stderr
Those are my executors
Executor ID Address ▴ Status RDD Blocks Storage Memory Disk Used Cores
driver 157.97.107.42:55222 Active 0 0.0 B / 1018.9 MB 0.0 B 0
0 157.97.107.42:55223 Active 0 0.0 B / 4.1 GB 0.0 B 5
I have a process who checks the memory used per process and the top amount was 8468 Mb.
There are 4 processes related with spark.
- The master process. Start with 1Gb memory assigned, I don't know from where this configuration cames. But seems to be enough. Use only 0.4 Gb at top.
- The worker process. The same as the master with the memory use.
- The driver process. Who have 2Gb configured.
- The context. Who have 8Gb configured.
In the following table you can see how the memory used by the driver and contexts behaves. After getting the java.lang.OutOfMemoryError: Java heap space. The context fails, but the driver acept another context, so it remains fine.
system_user | RAM(Mb) | entry_date
--------------+----------+---------------------
spark.driver 2472.11 2017-02-07 10:10:18 //Till here everything was fine
spark.context 5470.19 2017-02-07 10:10:18 //it was running for more thant 48 hours
spark.driver 2472.11 2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context 0.00 2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
//in $LOG_FOLDER/job-server-master/server_startup.log
# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed
spark.driver 2472.11 2017-02-07 10:16:18 //Here I have deleted and created again
spark.context 105.20 2017-02-07 10:16:18
spark.driver 2577.30 2017-02-07 10:19:18 //Here I execute the three big
spark.context 3734.46 2017-02-07 10:19:18 //concurrent queries again.
spark.driver 2577.30 2017-02-07 10:20:18 //Here after the queries where
spark.context 5154.60 2017-02-07 10:20:18 //executed. No memory issue.
I have 2 questions:
1- Why when I check the spark GUI my driver who has 2 configured Gb only use 1, the same with the executor 0 which only use 4.4 Gb. Where goes the other configured memory? But when the processes in the system the driver it use 2Gb.
2- If I have enough memory on the server then why I'm out of memory?