I want to deploy LLM model on Sagemaker and it is giving me this error. I've tried with different models as well but still facing same error

Question

I'm deploying TheBloke/Llama-2-7b-Chat-GPTQ " model on sagemaker. I'm running this code in sagemaker notebook instance. I've used "ml.g4dn.xlarge" instance for deployement. I've used the same code that have been shown on the deployment on Amazon Sagemaker button on huggingface.

`After running the code it takes 10 min of processing it shows me this output while processing: Output:`

These dashes shows the model is deploying. After these dashes I got this error: Error:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-08-24-06-51-13-816: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

"_Please check CloudWatch logs for this endpoint._" Did you do this? What did you find? — takendarkk, Aug 24 '23 at 10:51
This is the CloudWatch logs for my deployment code: RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist 2023-08-24T12:42:01.865+05:00 #033[2m2023-08-24T07:42:01.699855Z#033[0m #033[31mERROR#033[0m #033[1mshard-manager#033[0m: #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Shard complete standard error output: You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. 2023-08-24T12:49:22.510+05:00 Error: ShardCannotStart — Faiq Aslam, Aug 24 '23 at 11:35
# Hub Model configuration. https://huggingface.co/models hub = { 'HF_MODEL_ID':'TheBloke/Llama-2-13B-chat-GPTQ', 'SM_NUM_GPUS': json.dumps(1) } # create Hugging Face Model Class huggingface_model = HuggingFaceModel( image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"), env=hub, role=role, ) # deploy model to SageMaker Inference predictor = huggingface_model.deploy( initial_instance_count=1, instance_type="ml.g4dn.xlarge", container_startup_health_check_timeout=300, ) print("deployment done") Output: ------------------* Error hosting endpoint.... — Faiq Aslam, Aug 24 '23 at 11:41

navule · Answer 1 · 2023-08-30T03:10:22.117

Few troubleshooting tips here:

Since you are trying to use 13B parameter model, try to use the default recommended instance type ml.g5.12xlarge from the official blog here. You might want to request quota for ml.g5.12xlarge if you don't have it already. Alternatively you can refer Choosing instance types for large model inference. Sample notebook to Deploy LLama2 13b Model with high performance on SageMaker using Sagemaker LMI and Rolling batch.
Try changing the version of huggingface from 0.9.3 to 0.8.2 and see if that works out for you.
Follow the troubleshooting steps provided here at Primary container did not pass ping health checks

Alternate approaches:

For deploying large models, it is recommend that you follow Deploying uncompressed models.

You can also check if the model you are trying to deploy is already available to you in SageMaker JumpStart.

I want to deploy LLM model on Sagemaker and it is giving me this error. I've tried with different models as well but still facing same error

After running the code it takes 10 min of processing it shows me this output while processing: Output:

1 Answers1

`After running the code it takes 10 min of processing it shows me this output while processing: Output:`