This problem has been tackled with both Generative Adverserial Networks (GAN) and RNNs with decent degree of success.
One fairly successful approach using GANs, introduced by Facebook AI Research here, uses a multi-scale generator. Basically, you sample down the image at various scales, and then predict the next frame for that particular lower resolution. Then using the upsampled predicted frame and original frame at next higher resolution you predict the next frame for that higher scale and the process continues. Multi-scale aggregation helps prevent blurriness and retains details. You can find the code here
A more recent approach uses a combination of Autoencoder and GANs, Basically a Variational encoder recurrently compresses the entire video stream into a latent space and you have different networks to predict flow and the next frame from the latent space representation. It then fuses next frame along with information from predicted flow. You can read the paper here for details.
Another approach from Cornell, does this without GANs. It's rather complex, but in simple terms they use Stacked LSTMs, and propagate error signals to further LSTM layers whose job is then to predict the error for next frame. Here's the paper
There are other approaches too, but these seem to be widely cited and have code available online.