with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py
fails because my training job don't have the sagemaker_submit_directory
parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
- Disable network isolation in the training job configuration
- Provide credentials to the docker image to write to S3 bucket
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.