0

I am working on Aws Machine learning with MERN(Mongodb,Express,React,NodeJS)Stack Code.But the issue is that when I upload the data file (.csv file) for process machine learning after sometime process training is failed with TrainingFailed Error which is follow:

AlgorithmError: CannotStartContainerError. Please make sure the container can be run with 'docker run train'. Please refer SageMaker documentation for details. It is possible that the Dockerfile's entrypoint is not properly defined, or missing permissions.

I also setup the following settings in AWS Account. enter image description here

Also give following permissions in AWS Account:

enter image description here

I also apply all the keys in mongodb configuration settings after all the settings and permissions I can not understand what I need to process of Machine learning.Actually Training is not completed and can not get modelartifacts in s3 bucket.Its look like : enter image description here sagemaker process is not started . can any one help me about this?

My DockerFile which is stored at the project folder with named Dockerfile.

FROM ubuntu
RUN apt-get update
RUN apt-get install curl -y
RUN curl -sL https://deb.nodesource.com/setup_10.x -o nodesource_setup.sh
RUN bash nodesource_setup.sh
RUN apt install nodejs -y
WORKDIR /usr/app
COPY . /usr/app/
RUN npm install
EXPOSE 3000
ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py" ]

I also set Code Images in Docker Hub for Sagemaker linear learner and xgboost and also create repositories in ECR in aws. enter image description here

I also copy train.py in opt/ml/code/train.py directory in aws and also got the output output: /home/ec2-user/SageMaker/docker_test_folder but still got this error.

  • There is an issue with the docker image.. can you post the dockerfile? The docker needs to have a file named train in the workdir (usually /opt/ml/code) that manage the training.. – rok Jan 12 '21 at 08:55
  • Hi @rok Yes I save the Dockerfile at the location of project folder and Dockerfile have named Dockerfile. what I need can you please tell me clearly? What I need to do? I place the code of Dockerfile if any mistake in Docker file then please inform me with corrections and one more thing is that what is the issue with dockerimage I can't understand because Docker image is already placed in Docker hub. – mehul daxini Jan 12 '21 at 10:02
  • ** Dockerfile Code ** FROM ubuntu RUN apt-get update RUN apt-get install curl -y RUN curl -sL https://deb.nodesource.com/setup_10.x -o nodesource_setup.sh RUN bash nodesource_setup.sh RUN apt install nodejs -y WORKDIR /usr/app COPY . /usr/app/ RUN npm install EXPOSE 3000 ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py" ] – mehul daxini Jan 12 '21 at 10:03
  • please edit your question and post the dockerfile there with the correct formatting, it's unreadable in this way.. Then it seems to me that you are not copying the train.py inside the container and also from the documentation name should be train and not train.py – rok Jan 12 '21 at 10:11
  • Hi @rok I edited my question and place the Dockerfile there now I can not understand what you want to tell me ,so I ask you again You want to tell me that my Dockerfile name need to replace? or what you want to say by "from the documentation name should be train and not train.py? Where I need to copy train.py ? – mehul daxini Jan 12 '21 at 10:37
  • @rok Can you help me please? – mehul daxini Jan 12 '21 at 10:47

1 Answers1

0

The error you get means that sagemaker is not able to launch your docker image, this is because you have not defined correctly the entry point. You can a take a look at my repo. Basically in your dockerfile you have to install some packages, create a folder let's say /opt/ml/code and put in this folder your training script that will be called train. The train file should respect some indications that you can read here.

rok
  • 2,574
  • 3
  • 23
  • 44
  • Hi @rok, Which content you commands you wrote in Dockerfile is that all useful for me? I mean I am work on node js as per mention here so I ask you . – mehul daxini Jan 12 '21 at 12:53
  • I don't really understand what you are trying to do..please clarify the role of the architecture (MERN) and what you want to obtain from sagemaker. This is what I do with sagemaker (look at my repo for details): I created a docker image with tensorflow installed, I have in the docker a training script (named train and written in python) that calls the tensorflow object detection API and train a neural network, sagemaker itself download the traning data from s3 and feed the model, at the end upload the trained model back to s3. – rok Jan 12 '21 at 16:53
  • Actually My code is written in node js and express from start to train model in machine learning to complete the process of machine learning (look at my query progress bar need to complete 100%)and all keys are set in mongoDB configurations,but in my process sagemaker process does not start and get this error,as per your suggestion Entrypoint is not perfectly set to Dockerfile and you provide me your repository but I am not aware with python so I can't understand which code I need to take in my Dockerfile.can you please help me? – mehul daxini Jan 13 '21 at 04:09
  • No. Sorry but this is not a forum. Here you post very specific questions with a mwe and explaining what you tried so far. Anyway if I understand well you have developed a nodejs/express application to launch machine learning trainings.. is that right? I don't think putting everything inside the container is a good idea.. Sagemaker should only be responsible of the training (see examples on the web and start using python), so you should host your app somewhere else and communicate with sagemaker to launch training jobs .. – rok Jan 13 '21 at 09:19
  • You write ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py" ] in your dockerfile, but you don't copy any train.py in the dockerfile, there is no command that do that.. and also since you didn't posted this train.py is not possible to know why is not working.. I cannot help more on this. – rok Jan 13 '21 at 09:21