I am calling the inference endpoint of jumpstart-llama2-foundational-model on AWS sagemaker but it gives me the error below:
Error raised by inference endpoint: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (424) from primary with message "{ "code":424, "message":"prediction failure", "error":"Allocation larger than expected: tag 'rms_qkv', requested size: 164831232, expected max size: '100663296'" }"
My code snippet is as below:
llm = SagemakerEndpoint(
endpoint_name=endpoint_name,
region_name=region,
model_kwargs={"max_new_tokens": 2048, "top_p": 0.9, "temperature": 0.1},
endpoint_kwargs={"CustomAttributes": 'accept_eula=true'},
content_handler=content_handler
)
prompt_template = PromptTemplate(input_variables=["chat_history", "human_input", "context"], template=Chat_llama().get_template())
chain = load_qa_chain(llm, chain_type="stuff",memory=st.session_state['memory'], prompt=prompt_template)
chain({"input_documents": docs, "human_input": prompt}, return_only_outputs=True)
response=chain.memory.buffer
Could someone point me in the right direction.