Yann LeCun has a great metaphore about how we might be using our neurons to learn things: to him, learning is a (piece of) cake. More precisely, a Génoise cake where
This cake metaphore shall highlight the amount of input data a system must "eat" to learn new things. What it takes to sustain a gluton mind is not just one cherry but the many layers of basic sponge cake, with lots of cream. Similarly, LeCun argues, intelligence emerges mostly from trying to predict the past/present/future of what happens in the outside world, without human supervision1. A typical task self-supervised learning task is to intentionally remove parts of a message and trying to reconstruct the whole message from what is left, a bit like when a pupil is trying to learn a poem by heart.
Not long after LeCun exposes his theory (in an ACM talk, around 46'), he draws our attention to the fact that, while self-supervised learning works well with text (as the BERT Web-scale experiment showed), there is quite some room for improvement with images and videos. His take on why it is so is that text is discrete and easy to model in a probabilistic way while images and video are inherently continuous. Their projection onto high-dimensional vector spaces can only be imperfectly represented as probabilities.
In the specific task of predicting future frames in a video, convolutional neural networks tend to produce an "average of all possible futures" while one would suffice. LeCun's answer to that problem are the infamous Generative Adversiarial Networks (GANs), those generator/discriminator architectures that can create deep fakes.
The success of GANs is surprising and in fact, LeCun's observation that images and videos are continuous rather than discrete doesn't help explain why GANs are more suited than other architectures for generation.
There is another big difference between text and visual content: one comes from a language expressed by cognitive agents, the other is a passive recording of the physical world. Agents can speak a language but they cannot record the world as a camera would. They can only interpret it (and use arbitrary discrete symbols to convey their interpretation of it).
Human languages betray a notion of abstraction (some words are more generic than others) that recordings don't have. In that perspective, neural networks should learn to describe possible futures rather than to depict them, with all the useful simplifications one can make wherever there is uncertainty. Besides, humans are pretty bad at producing graphical depictions of the physical world, let alone photorealistic ones.
Why are GANs so good at producing deep fakes, then? Their basic language seems to be too primitive to deal with uncertainties: it can generate any image as a 2D array of pixels. Yet, the language they are offered through training sets are much more restricted. The training set used in deep fake experiments on faces are all cropped in a consistent fashion, so that no other detail is salient (backgrounds are blurry, image size is standard). Color and shape distibutions are much more restricted than the HSV space of arbitrary images. Similarly, the GAN that first imitated famous painters (CycleGAN) learn only higher-level features, given that most pixels remain in place in the target picture. In other words, the network does not learn to paint (à la Monet). It learns impressionism, provided backgroun knowledge on fine arts.
Explaining the success of GANs through the lens of language also explains why/how other forms of generative networks work good. In his talk, LeCun moves on to variational autoencoders with an example of highway cameras monitoring traffic and explaing the behavior of individual cars that enter the camera's frame. The presented network already has a model of the world at hand: cars are rectangle having straight trajectories, they can brake, they can accelerate and they can turn left or right, one line at a time. This is pretty simple language2.
To conclude: to make generic neural networks good at predicting the future, it might be more interesting to make them speak. Which means they would also have to learn from textual descriptions of images. For instance, look at that other piece of work on "real" neural networks (humans) that tries to reconstruct images from MRIs: did the patient record the whole image? Probably not. However they surely are able to talk about it after they watched it. An alternative experiment could be to produce textual descriptions from the same MRIs (or description using some abstract scenic language).
See also Ce dont on ne peut parler, il faut l'écrire by Gilles Dowek, a book on computer languages.
(1) That brings us back to the discussion on learning causation (an interpretation) from experience (observations of truth values).
(2) It is even possible to describe driving policies with psychological terms, see Braitenberg vehicles.