Over at MIT’s Computer Science and Artificial Intelligence Laboratory( CSAIL ), a squad of six researchers made a machine-learning system that matches sound impacts to video clips. Before you get too excited, the CSAIL algorithm can’t do its audio work on any old video, and the voice effects it makes are restriction. For the project, CSAIL PhD student Andrew Owens and postgrad Phillip Isola recorded videos of themselves whacking a bunch of things with drumsticks: stumps, tables, chairs, puddles, banisters, dead leaves, the dirty ground.
The team fed that initial batch of 1,000 videos through its AI algorithm. By investigating the physical appearance of objects in the videos, the movement of each drumstick, and the resulting voices, the computer was able to learn connections between physical objects and the voices they stimulate when hit. Then, by “watching” different videos of objects being whacked, tapped, and scraped by drumsticks, the system was able to calculate the appropriate pitching, volume, and aural properties of the voice that should accompany each clip.
The algorithm doesn’t make its own sounds–it only pulls from a database of tens of thousands of audio clips. Also, sound impacts aren’t selected based on visual matches; as you can see around the 1:20 mark of the video above, the algorithm get creative. It selected sound consequences as varied as a rustling plastic pouch and a smacked stump for a sequence in which a shrub gets a thorough drumsticking.
Owens says the research team employed a convolutional neural network to analyze video frames and a recurrent neural network to pick the audio for it. They leaned heavily on the Caffe deep-learning framework, and the project was funded by the National Science Foundation and Shell. One of the team members works for Google Research, and Owens was part of the Microsoft Research fellowship program.
” We’re largely applying existing techniques in deep learning to a new domain ,” Owens says.” Our goal isn’t to develop new deep learning methods .”
Matching realistic voices to video has primarily been the domain of Foley artists–the post-production audio wizards who record the footsteps, door creakings, and flying roundhouse kicks you see( and hear) in a polished Hollywood movie. A skilled Foley artist can make a voice that precisely matches the visual, fooling the viewer into thinking that the sound was captured on the set.
MIT’s bot isn’t nearly that virtuoso. The research squad carried out in online survey where 400 participants were depicted versions of the same video with the original audio and the algorithm-generated voices, then asked to picking which video had the real voices. The fake audio was selected 22 percent of the time–very far from perfect, but still twice as effective as a previous version of the algorithm.
According to Owens, those exam results are a good sign that the computer-vision algorithm can see information materials an object is made use of, as well as the different physics of tapping, whack, and rubbing an object. Still, certain things tripped information systems up. Sometimes it guessed the drumstick was striking an object when it actually didn’t, and more people were fooled by its sound effects for leaves and clay than its sound consequences for more solid objects.
There’s a deeper reason behind the project beyond simply making fun sound consequences. If perfected, Owens guesses the computer-vision tech could help robots identify the materials and physical properties of an object by analyzing the audios it builds.” We’d like these algorithms to learn by watching this physical interaction occur and observing the response ,” Owens says.” Think of it as a toy version of learning about the world the style that newborns do, by banging, stomping, and playing with things .”