0

I have some processing job that i run as a step in sagemaker pipelines, i pass the my python script filename/path to the script processor and also specify command = ['python3']. my main.py file can take an argument and locally i can call it as such => python3 main.py -f somevalue.

how can i achieve the same thing while running this file via steps in sagemaker pipelines, i tried this => command = ['python3', "src/main.py" "-f", "somevalue"]. but this doesn't work.

is there anyother way to call my script and pass the arugument ?

main.py


import argparse
parser = argparse.ArugumentParser()
parser.add_argument("-f", "--flag", type=boolean) 

is_enabled = args.flag

def main():

if(is_enabled):
  //do something 
   
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

my_processor = ScriptProcessor(
    framework_version=framework_version,  # e.g. "1.0-1", 
    role=role, 
    instance_type=your_instance_type,  # e.g. 'ml.m5.large'
    base_job_name = your_base_job_name,
    instance_count=your_instance_count,  # e.g. 1
    command = ['python3']
)

my_step = ProcessingStep(
    code=your_script_path,
    inputs=[
        ProcessingInput(
            input_name='custom',
            source='src/main.py',
            destination="/opt/ml/processing/input/data",
            s3_data_type='S3Prefix',
            s3_input_mode="File"
        )
    ]
    ...
)
haju
  • 95
  • 6

1 Answers1

1

For sending in additional CLI arguments, you'll want the arguments parameter rather than command - because all elements of command get positioned before your script location, whereas you probably want them after e.g. python3 your-script.py --flag

To minimise the code change in your notebooks between experimenting with jobs interactively and connecting them into pipelines, I'd also suggest switching to the newer pipeline session syntax as shown in this example. The pipeline session allows you to build step definitions with the same function calls as you would usually run jobs (estimator.fit(), processor.run(), etc).

...So instead of:

my_processor = ScriptProcessor(
    command=['python3'],
    ...
)

my_step = ProcessingStep(
    "MyPreprocessStep",
    code=your_script_path,
    inputs=[...],
    arguments=["--flag"],
    ...
)

...you could have:

from sagemaker.workflow.pipeline_context import PipelineSession

# Swap this out to run interactively instead of building a pipeline:
session = PipelineSession()

my_processor = ScriptProcessor(
    command=['python3'],
    sagemaker_session=session,
    ...
)

my_step = ProcessingStep(
    "MyPreprocessStep",
    step_args=my_processor.run(
        code=your_script_path,
        inputs=[...],
        arguments=["--flag"],
        ...
    ),
)

...Which is closer to the syntax you'd usually use for initially testing your stand-alone job outside of a pipeline (processor.run()).

dingus
  • 655
  • 1
  • 7
  • 18