I don't think this is possible with traditional WebVTT cues - they are pseudo-elements, which are not directly part of the DOM, so you can't bind events to them. Styling is also extremely limited for ::cues.
However, you should be able to leverage TextTrack events to accomplish something that works in a similar way. You can bind a custom function to the video track's oncuechange event, and then use the track's activeCues to generate your own captions. This custom div can then be styled and have whatever events on it that you want.
This will grab the first text track from your video, and get the text from the currently active cue every time a cue change occurs.
$('video')[0].textTracks[0].oncuechange = function() {
var currentCue = this.activeCues[0].text;
// add current cue text to custom caption div
}
You will probably need to parse each word of the cue into its own span so you can add events to it, add highlight classes, etc. Then you can style/interact with each piece however you'd like.
https://developer.mozilla.org/en-US/docs/Web/API/TextTrack