6

I am performing grid search with GridSearchCV (scikit-learn) on Spark and Linux. For this reason, I am running nohup ./spark_python_shell.sh > output.log & at my bash shell to ignite the Spark cluster and I also get my python script running (see below spark-submit \ --master yarn 'grid_search.py'):

    SPARK_HOME=/u/users/******/spark-2.3.0 \
    Q_CORE_LOC=/u/users/******/q-core \
    ENV=local \
    HIVE_HOME=/usr/hdp/current/hive-client \
    SPARK2_HOME=/u/users/******/spark-2.3.0 \
    HADOOP_CONF_DIR=/etc/hadoop/conf \
    HIVE_CONF_DIR=/etc/hive/conf \
    HDFS_PREFIX=hdfs:// \
    PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
    YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
    SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
    PYSPARK_PYTHON=/usr/bin/python2.7 \
    QQQ_LOC=/u/users/******/three-queues \
    spark-submit \
    --master yarn 'grid_search.py' \
    --executor-memory 10g \
    --num-executors 8 \
    --executor-cores 10 \
    --conf spark.port.maxRetries=80 \
    --conf spark.dynamicAllocation.enabled=False \
    --conf spark.default.parallelism=6000 \
    --conf spark.sql.shuffle.partitions=6000 \
    --principal ************************ \
    --queue default \
    --name lets_get_starting \
    --keytab /u/users/******/.******.keytab \
    --driver-memory 10g

This is the part of the grid_search.py python script which connects the Grid Search to the Spark cluster and executes the Grid Search:

# Spark configuration
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)

# Execute grid search - using spark_sklearn library
from spark_sklearn import GridSearchCV
classifiers_grid = GridSearchCV(sc, estimator=classifier, param_grid=parameters, scoring='precision', cv=3,n_jobs=5, pre_dispatch=10)
classifiers_grid.fit(X, y)

This grid search apparently create mutiple processes on Linux and these processes have different PIDs.

My question is the following:

How can I limit the memory usage of this grid search?

For example, how can I set the maximum memory usage for it at 10GB?

Theoretically speaking there are three different routes to follow:

  1. Limit memory usage on Scikit-Learn
  2. Limit memory usage on Python
  3. Limit memory usage on Linux

For now, I experimented with (1) by setting different values to n_jobs and pre_dispatch and then cheking the memory usage of the relevant processes on Linux (free -h, ps aux --sort-rss etc).

However, I think that this is pretty inefficient because you cannot exactly specify a memory cap (e.g. 10 GB) and the memory usage by these processes constantly changes as the times passes. As a result, I have to constantly keep on eye on memory usage and then modify the values of n_jobs and pre_dispatch and so on.

Outcast
  • 4,967
  • 5
  • 44
  • 99

0 Answers0