I am performing grid search with GridSearchCV
(scikit-learn
) on Spark
and Linux
. For this reason, I am running nohup ./spark_python_shell.sh > output.log &
at my bash
shell to ignite the Spark cluster and I also get my python script running (see below spark-submit \ --master yarn 'grid_search.py'
):
SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
spark-submit \
--master yarn 'grid_search.py' \
--executor-memory 10g \
--num-executors 8 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name lets_get_starting \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g
This is the part of the grid_search.py
python script which connects the Grid Search to the Spark cluster and executes the Grid Search:
# Spark configuration
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)
# Execute grid search - using spark_sklearn library
from spark_sklearn import GridSearchCV
classifiers_grid = GridSearchCV(sc, estimator=classifier, param_grid=parameters, scoring='precision', cv=3,n_jobs=5, pre_dispatch=10)
classifiers_grid.fit(X, y)
This grid search apparently create mutiple processes on Linux and these processes have different PIDs.
My question is the following:
How can I limit the memory usage of this grid search?
For example, how can I set the maximum memory usage for it at 10GB?
Theoretically speaking there are three different routes to follow:
- Limit memory usage on Scikit-Learn
- Limit memory usage on Python
- Limit memory usage on Linux
For now, I experimented with (1) by setting different values to n_jobs
and pre_dispatch
and then cheking the memory usage of the relevant processes on Linux (free -h
, ps aux --sort-rss
etc).
However, I think that this is pretty inefficient because you cannot exactly specify a memory cap (e.g. 10 GB) and the memory usage by these processes constantly changes as the times passes. As a result, I have to constantly keep on eye on memory usage and then modify the values of n_jobs
and pre_dispatch
and so on.