If we check /ph/upload
in Snapchat API (last updated 23-12-2013) we can see that you can upload either a photo or a video.
Of course, this is not the latest version (although this is the last documentation I could find) but I am assumming nothing has changed in that regard.
That means the text is inserted to the photo in the mobile client app, not on the server.
In my opinion, you shouldn't base any decisions about your API architecture on Snapchat because it's unlikely you have the same use cases. In general:
- Sending data separately is more flexible and makes client implementation simpler.
- Rendering data on the client is better for user experience (everything is faster and the user can see the final result) and also it saves a lot of server resources (the more users you have the more this will be visible).