0

I'm currently training a Top2Vec ML model on a CommonCrawl news dataset in Azure ML Studio. When running my Python code inside a ipynb Notebook in ML Studio itself (online) the CPU is being fully used (100% workload) but when executing my script as a task in a job the CPU utilization (Monitoring) does not get over 25%.

I've noticed that the section "containerInstance" in the full JSON job definition contains the resource settings for this container instance which is configured the following way:

"containerInstance": {
    "region": null,
    "cpuCores": 2,
    "memoryGb": 3.5
}

However, I'm somehow not able to launch a job with more than 2 cpuCores and 3.5 GB RAM. My compute machine is a STANDARD_F4S_V2 instance with 4 vCPUs and 8 GB RAM. So I'm expecting that my container instance uses all available resources instead of only 50%.

This are my hyperparamters which I use to train my model:

hyperparameters = {
    'min_count': 50,
    'topic_merge_delta': 0.1,
    'embedding_model': 'doc2vec',
    'embedding_batch_size': 32,
    'split_documents': False,
    'document_chunker': 'sequential',
    'chunk_length': 100,
    'chunk_overlap_ratio': 0.5,
    'chunk_len_coverage_ratio': 1,
    'speed': 'learn',
    'use_corpus_file': False,
    'keep_documents': True,
    'workers': 4,
    'verbose': True
}

Is there a possibilty to edit the containerInstance options? I saw that I can config "Process count per node" but that sounds like how many times my script should be executed in parallel.

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
Luca
  • 11
  • 1
  • 4

1 Answers1

1

I finally got to the root of the problem. It was not due to the Docker container instance not using all cores, but due to my Python script. My script relied on Python's threading library to ensure parallel execution, but at the time I was unaware of the GIL (Global Interpreter Lock) that allows only one thread to hold the control of the Python interpreter, which of course threw off my understanding of threads in Python a bit. After rewriting my script with the multiprocessing library, the Docker container instance then used all available resources.

Nonetheless, if you plan to manually define the number of CPU cores and the amount of RAM you can use the Python script below to start your custom Azure ML Job:

# Install azureml-core package first: pip install azureml-core

from azureml.core import RunConfiguration, Experiment, Workspace, ScriptRunConfig, Environment
from azureml.core.runconfig import DockerConfiguration

workspace = Workspace("<SUBSCRIPTION_ID>", "<RESOURCE_GROUP_NAME>", "<AZURE_ML_WORKSPACE_NAME>")

# 'Default' is the name of the ML experiment, change this if you need to.
experiment = Experiment(workspace, "Default") 
# Define the environment to be used.
env = Environment.get(workspace, name="top2vec-env", version="1") 
# If you have a compute cluster set up enter the cluster name, otherwise comment this line out and replace 'cluster' on line 13 with the name of your compute instance.
cluster = workspace.compute_targets['<COMPUTE_CLUSTER_NAME>']
run_config = RunConfiguration()
# Define the number of CPU cores and the amount of memory to be used by the Docker container instance. 
run_config.docker = DockerConfiguration(use_docker=True, arguments=["--cpus=16", "--memory=128g"], shm_size="64M") 
run_config.environment = env
run_config.target = cluster
run_config.command = "python main_file_of_your_python_script.py"
# Pass the required environment variables to run your script.
run_config.environment_variables = {} 
# Enter the relative or absolute path to your source directory. Everything in it will be uploaded to the computing VM.
config = ScriptRunConfig("<RELATIVE_PATH_TO_SOURCE_DIR>", run_config=run_config) 

script_run = experiment.submit(config)
Luca
  • 11
  • 1
  • 4