0

I'm trying to run distributed python job through azure ML pipelines using MPIStep pipeline class, by referring to the below example link - https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/pipeline-style-transfer/pipeline-style-transfer.ipynb

I tried implemented the same but even I change the node count parameter in MpiStep class, while running the script the it shows size (i.e comm.Get_size()) as 1 always. Can you please help me in what I'm missing here. Is there any specific setup required on the cluster?

Code snippets:

Pipeline code snippet:

model_dir = model_ds.path('./'+saved_model_blob+'/',data_reference_name='saved_model_path').as_mount()
label_dir = model_ds.path('./'+model_label_blob+'/',data_reference_name='model_label_blob').as_mount()

input_images = result_ds.path('./'+score_blob_name+'/',data_reference_name='Input_images').as_mount()

output_container = 'abc'
inti_container = 'xyz'



distributed_batch_score_step = MpiStep(
    name="batch_scoring",
    source_directory=SCRIPT_FOLDER,
    script_name="batch_scoring_script_mpi.py",
    arguments=["--dataset_path", input_images, 
               "--model_name", model_dir,
               "--label_dir", label_dir, 
               "--intermediate_data_container", inti_container, 
               "--output_container", output_container],
    compute_target=gpu_cluster,
    inputs=[input_images, model_dir,label_dir],
    pip_packages=["tensorflow","tensorflow-gpu==1.13.1","pillow","azure-keyvault","azure-storage-blob"],
    conda_packages=["mesa-libgl-cos6-x86_64","mpi4py==3.0.2","opencv=3.4.2","scikit-learn=0.21.2"],                                 
    use_gpu=True,
    allow_reuse = False,
    node_count = nodecount_param,
    process_count_per_node = 1

)

Python Script code snippet:

def run(input_dataset,comm):

rank = comm.Get_rank()
size = comm.Get_size()
print("Rank:" , rank)
print("Size:", size) # shows always 1, even the input node count is >1
print(MPI.Get_processor_name())


file_names = get_file_names(args.dataset_path)
sorted(file_names)


partition_size = len(file_names) // size
print("partition_size-->",partition_size)
partitioned_filenames = file_names[rank * partition_size: (rank + 1) * partition_size]
print("RANK {}  - is processing {} images out of the total {}".format(rank, len(partitioned_filenames),
                                                                     len(file_names)))

# call to Function 01

# call to Function 02

img_names = score_df['image_name'].unique()
output_batch = pd.DataFrame()
for i in img_names:
    # call to Function 3
    output_batch = output_batch.append(pp_output, ignore_index=True)
    output_paths_list = comm.gather(output_batch, root=0)



print("RANK {} - number of pre-aggregated output files {}".format(rank, len(output_batch)))

print("saved in", currentDT + '\\' + 'data.csv')

if rank == 0:
    print("RANK {} - number of aggregated output files {}".format(rank, len(output_paths_list)))
    print("RANK {} - end".format(rank))

if __name__ == "__main__":
    with tf.device('/GPU:0'):
        init()
        comm = MPI.COMM_WORLD
        run(args.dataset_path,comm)
Lavanya
  • 11
  • 4
  • Hi Lavanya, can you please share the code snipped where you define `nodecount_param` as well as how you submit the pipeline? In the notebook you see those lines like this: `nodecount_param = PipelineParameter(name="nodecount", default_value=1) pipeline_run = Experiment(ws, 'style_transfer').submit(pipeline, pipeline_parameters={"style": "mosaic", "nodecount": 3})` – Santhosh Pillai Oct 03 '19 at 05:05
  • Hi Santhosh, Thank you for your response. Below is the code which creates nodecount parameter: `nodecount_param = PipelineParameter(name="nodecount", default_value=1)` Code to submit pipeline: `pipeline = Pipeline(workspace=ws, steps=[distributed_batch_score_step])` `pipeline_run = Experiment(ws, 'test').submit(pipeline, pipeline_parameters={"nodecount": 2})` – Lavanya Oct 03 '19 at 06:31
  • Got to know the issue is due to package version, earlier it is installed via conda with `conda_packages=["mpi4py==3.0.2"]`, it worked after changing the install through pip - `pip_packages=["mpi4py"]` – Lavanya Oct 03 '19 at 09:16
  • Cool!You can convert your comment into an answer, it can help other community members who get the same issues more easily to find the solution and we could archive this thread, thanks. – Hugh Lin Oct 03 '19 at 09:53

1 Answers1

0

Got to know the issue is due to package version, earlier it is installed via conda with conda_packages=["mpi4py==3.0.2"], it worked after changing the install through pip - pip_packages=["mpi4py"]

Lavanya
  • 11
  • 4
  • Great! Thanks for sharing your solution here, you could accept it as the answer, so it could help other community members who get the same issues and we could archive this thread, thanks. – Hugh Lin Oct 07 '19 at 01:43