Watch: MIT AI Animates Still Photos
The team used a deep-learning method called "adversarial learning" that involves training two competing neural networks. One network generates video, and the other discriminates between the real and generated videos.
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developing a deep-learning algorithm that, given a still image from a scene, can create a brief video that simulates the future of that scene.
Trained on 2 million unlabeled videos that include a year’s worth of footage, the algorithm generated videos that human subjects deemed to be realistic 20 percent more often than a baseline model.
Future versions of this algorithm could be used for everything from improved security tactics and safer self-driving cars. According to CSAIL PhD student Carl Vondrick, the algorithm can also help machines recognize people’s activities without expensive human annotations.
“These videos show us what computers think can happen in a scene,” says Vondrick. “If you can predict the future, you must have understood something about the present.”
The team used a deep-learning method called “adversarial learning” that involves training two competing neural networks. One network generates video, and the other discriminates between the real and generated videos. Over time, the generator learns to fool the discriminator.
From that, the model can create videos resembling scenes from beaches, train stations, hospitals, and golf courses. For example, the beach model produces beaches with crashing waves, and the golf model has people walking on grass.
Previous systems build up scenes frame by frame, which creates a large margin for error. In contrast, this work focuses on processing the entire scene at once, with the algorithm generating as many as 32 frames from scratch per second.
“Building up a scene frame-by-frame is like a big game of ‘Telephone,’ which means that the message falls apart by the time you go around the whole room,” says Vondrick. “By instead trying to predict all frames simultaneously, it’s as if you’re talking to everyone in the room at once.”
Of course, there’s a trade-off to generating all frames simultaneously: While it becomes more accurate, the computer model also becomes more complex for longer videos. Nevertheless, this complexity may be worth it for sharper predictions.
To create multiple frames, researchers taught the model to generate the foreground separate from the background, and to then place the objects in the scene to let the model learn which objects move and which objects don’t.
Vondrick stresses that the model still lacks some fairly simple common-sense principles. For example, it often doesn’t understand that objects are still there when they move, like when a train passes through a scene. The model also tends to make humans and objects look much larger in size than reality.
Another limitation is that the generated videos are just one and a half seconds long, which the team hopes to be able to increase in future work. The challenge is that this requires tracking longer dependencies to ensure that the scene still makes sense over longer time periods. One way to do this would be to add human supervision.
“It’s difficult to aggregate accurate information across long time periods in videos,” says Vondrick. “If the video has both cooking and eating activities, you have to be able to link those two together to make sense of the scene.”