2

I'm using Apache Spark 2.0.2 together with Apache JobServer 0.7.0.

I know this is not a best practice but this is a first step. My server have 52 Gb RAM and 6 CPU Cores, Cent OS 7 x64, Java(TM) SE Runtime Environment (build 1.7.0_79-b15) and it have the following running applications with the specified memory configuration.

  • JBoss AS 7 (6 Gb)
  • PDI Pentaho 6.0 (12 Gb)
  • MySQL (20 Gb)
  • Apache Spark 2.0.2 (8 Gb)

I start it and everything works as expected. And works so for several hours. I have a jar with 2 implemented jobs who extends from My_Job class.

public class VIQ_SparkJob extends JavaSparkJob {

protected SparkSession sparkSession;
protected String TENANT_ID;

@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
    sparkSession = SparkSession.builder()
                    .sparkContext(ctx)
                    .enableHiveSupport()
                    .config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryoserializer.buffer", "8m")
                    .getOrCreate();
            Class<?>[] classes = new Class<?>[2];
            classes[0] = UsersCube.class;
            classes[1] = ImportCSVFiles.class;
    sparkSession.sparkContext().conf().registerKryoClasses(classes);
    TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");       
    return true;
}

@Override
public SparkJobValidation validate(SparkContext sc, Config config) {
    return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}

}
  1. This Job import the data from some .csv files and store them as parquet files partitioned by tenant. Are 2 entities users which ocupe 674 Mb in disk as a parquet files and user_processed with 323 Mb.

     @Override
     public Object runJob(SparkContext jsc, Config jobConfig) {
     super.runJob(jsc, jobConfig); 
     String entity= jobConfig.getString("entity");
     Dataset<Row> ds = sparkSession.read()
            .option("header", "true")
            .option("inferschema", true)
            .csv(csvPath);
    
    ds.withColumn("tenant_id", ds.col("tenant_id").cast("int"))
            .write()
            .mode(SaveMode.Append)
            .partitionBy(JavaConversions.asScalaBuffer(asList("tenant_id")))
            .parquet("/value_iq/spark-warehouse/"+entity);
      return null;
     }
    
  2. This one is to query the parquet files:

    @Override
    public Object runJob(SparkContext jsc, Config jobConfig) {
    super.runJob(jsc, jobConfig); //To change body of generated methods, choose Tools | Templates.
    String query = jobConfig.getString("query");
    Dataset<Row> lookup_values = getDataFrameFromMySQL("value_iq", "lookup_values").filter(new Column("lookup_domain").equalTo("customer_type"));
    Dataset<Row> users = getDataFrameFromParket(USERS + "/tenant_id=" + TENANT_ID);
    Dataset<Row> user_profiles = getDataFrameFromParket(USER_PROCESSED + "/tenant_id=" + TENANT_ID);
    lookup_values.createOrReplaceTempView("lookup_values");
    users.createOrReplaceTempView("users");
    user_profiles.createOrReplaceTempView("user_processed");
    //CREATING VIEWS DE AND EN
    sparkSession
            .sql(Here I join the 3 datasets)
            .coalesce(200)
            .createOrReplaceTempView("cube_users_v_de");
    List<String> list = sparkSession.sql(query).limit(1000).toJSON().takeAsList(1000);
    String result = "[";
    for (int i = 0; i < list.size(); i++) {
        result += (i == 0 ? "" : ",") + list.get(i);
    }
    result += "]";
    return result;
    }
    

Every day I run the first job saving to parquet files some csv. And during the day I execute some queries to the second one. But after some hours crash because of out of memory this the error log:

k.memory.TaskMemoryManager [] [akka://JobServer/user/context-supervisor/application_analytics] - Failed to allocate a page (8388608 bytes), try again.
WARN  .netty.NettyRpcEndpointRef [] [] - Error sending message [message = Heartbeat(0,[Lscala.Tuple2;@18c54652,BlockManagerId(0, 157.97.107.42, 55223))] in 1 attempt

I have the master and one worker in this server. My spark-defaults.conf

spark.debug.maxToStringFields  256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true

This is my Spark Jobserver settings.sh

DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8

I create my context with the following curl:

curl -k --basic --user 'user:password' -d "" 'https://localhost:4810/contexts/application?num-cpu-cores=5&memory-per-node=8G'

And the spark driver use 2Gb.

The created application looks like

ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr

Those are my executors

Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5 

I have a process who checks the memory used per process and the top amount was 8468 Mb.

There are 4 processes related with spark.

  • The master process. Start with 1Gb memory assigned, I don't know from where this configuration cames. But seems to be enough. Use only 0.4 Gb at top.
  • The worker process. The same as the master with the memory use.
  • The driver process. Who have 2Gb configured.
  • The context. Who have 8Gb configured.

In the following table you can see how the memory used by the driver and contexts behaves. After getting the java.lang.OutOfMemoryError: Java heap space. The context fails, but the driver acept another context, so it remains fine.

system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.

I have 2 questions:

1- Why when I check the spark GUI my driver who has 2 configured Gb only use 1, the same with the executor 0 which only use 4.4 Gb. Where goes the other configured memory? But when the processes in the system the driver it use 2Gb.

2- If I have enough memory on the server then why I'm out of memory?

  • When the job is not running, can you check saprk UI and figure what is the actual amount of memory available? You said you allocated 8 GB for Spark, but you should reserve some memory Spark itself. Also you have so many other processes but you are locating 5 cores to the executor. – noorul Feb 05 '17 at 02:37
  • Well, I have monitored the system and the top memory used by all app together was about 42 Gb, so there were about 10 Gb free memory. I have checked the Spark UI the worker was there but the application was with status killed, so I wasn't able to see the memory status. Did you see any missing configuration ? Or a better way to alocate the current memory? – José Carlos Guevara Turruelles Feb 05 '17 at 10:00
  • One doubt regarding memory. If I gave to the jobserver 2Gb memory, that will be the driver memory right? And when I start a context with, let's say 8Gb, this 8Gb are goin to be somehow only 6Gb for the executor and 2Gb for the driver? Because I'm monitoring each spark process and I still don't know how the memory allocation works. – José Carlos Guevara Turruelles Feb 07 '17 at 09:30
  • At the end of the questions I have add how the driver and contexts memory behaves before and after the crash. – José Carlos Guevara Turruelles Feb 07 '17 at 10:30

0 Answers0