I'm new to AI, so apologies if wrong terminology used here.
I'm extracting some information from a body of text, and have setup Llama 2 in Huggingface via their Inference Endpoint so I can call it via curl.
The curl works for short inputs and generated_text answers, but for longer responses the answers seem to be severely truncated, like only a few words whereas I'm expecting to receive a lot more.
So I wanted to set max_new_tokens to a large number and temperature to 0, but I didn't see how to do that. I don't care if it's set in the curl call or configured directly in the model, either is fine. Anyone know how to do this?