Sagemaker AWS llama2 endpoint inference

Question

I am calling the inference endpoint of jumpstart-llama2-foundational-model on AWS sagemaker but it gives me the error below:

Error raised by inference endpoint: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (424) from primary with message "{ "code":424, "message":"prediction failure", "error":"Allocation larger than expected: tag 'rms_qkv', requested size: 164831232, expected max size: '100663296'" }"

My code snippet is as below:

llm = SagemakerEndpoint(
                        endpoint_name=endpoint_name, 
                        region_name=region, 
                        model_kwargs={"max_new_tokens": 2048, "top_p": 0.9, "temperature": 0.1},
                        endpoint_kwargs={"CustomAttributes": 'accept_eula=true'},
                        content_handler=content_handler
                )
prompt_template = PromptTemplate(input_variables=["chat_history", "human_input", "context"], template=Chat_llama().get_template())
                chain = load_qa_chain(llm, chain_type="stuff",memory=st.session_state['memory'], prompt=prompt_template)
            
chain({"input_documents": docs, "human_input": prompt}, return_only_outputs=True)
response=chain.memory.buffer

Could someone point me in the right direction.

score 1 · Answer 1 · answered Aug 28 '23 at 15:14

from the error logs it seems like the input text is too large, can you try truncating it and see if you are running into the same error, so we can pinpoint the problem? Is there also anything you are doing in your custom inference script that's adding to the payload size?

Sagemaker AWS llama2 endpoint inference

1 Answers1