AI Researchers Assemble ‘Moments in Time’
Researchers from the Massachusetts Institute of Technology and IBM Watson's unit have released a massive data set containing 1 million annotated video clips designed to spur development of new machine vision technologies. Potential applications range from self-driving car navigation to a form of video closed-captioning for the sightless.
The "Moments in Time" data set was released during this week's Neural Information Processing Systems Conference. The video database "can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis," the MIT-IBM researchers noted in a separate paper.
The data set also includes more than 300 verbs used to label basic human actions. The intention was to provide as wide as possible semantic coverage English-language verbs, the researchers noted.
Among the goals of the video database is capturing the "moments" humans encounter each day: a passing car, a child kicking a ball, a flock of birds flying overhead. The human brain is able to process these scenes with little effort. The researchers sought to describe these "moments" with key verbs such as kicking, walking or flying.
The work builds on other large data sets such as ImageNet, a collection of still images used for object recognition, and Places, a scene recognition database developed by MIT researchers.
Together, the data sets are being used to develop and train "visual understanding models" that take advantage of new deep learning capabilities. "While new algorithmic ideas have emerged over the years, this success can be largely credited to two other factors: massive labeled data sets and significant improvements in computational capacities, which allowed processing these data sets and training models with millions of parameters in reasonable time scales," noted Dan Gutfreund, a video analytics scientist with IBM Research AI.
The investigators noted the choice of three-second video snippet was not arbitrary: three seconds is about the average short-term memory time span (although that time span seems to shorten when some of us enter a room and can't remember why we did.)
The researchers hope to understand and use the human ability to consciously process a scene in which objects move and interact with each other to advance machine vision technology. "Automatic video understanding already plays an important role in our lives," Gutfreund explained.
"We predict that the number of applications will grow exponentially in domains such as assisting the visually impaired, elderly care, automotive, media and entertainment and many more," he added.
Among the next steps, the AI researchers said, is fine-tuning the video data set to reflect the different levels of abstraction associated with their roster of verbs. How, for example, can a machine be trained to distinguish between a "falling" tree and another falling object?
The data set "will serve as a new challenge to develop models that can appropriately scale to the level of complexity and abstract reasoning that a human processes on a daily basis" the researchers concluded in a paper released this week.