The example you give is quite similar to what is being done currently across the internet for image thumbnails. It's most commonly used for animated video previews on... adult... websites. If you Google search for ".jpg" filetype:vtt
you'll find some interesting examples that take the same approach as yours.
As Murray cited already, it's not really the proper way to do it. The metadata track option he cites is much more inline with correct use of the VTT spec. However it's also not broadly supported. You could follow VTT guidelines and end up with a file that can't be read by many players.
One other option is CSS. VTT is designed to work nicely with CSS. So you could include you image as a CSS background-image
. That way it's separate from the text content and (at least in theory) some players might even be able to display it properly.