Google Research engineers have developed a deep learning system that can separate voices from audio-visual data recorded in crowded environments.
The system they developed is the equivalent of the “cocktail party” effect, a feature of the human brain that can isolate and focus on one or more particular voices in a crowd.
The system needs both audio and video inputs
The system is designed to work with both audio and video data at the same time. Google says it created its novel tech by feeding it over 100,000 high-quality videos of lectures and talks hosted on YouTube.
All talks were given by a single speaker, with minimal background noise. They trained the AI to recognize sounds based on lip/mouth movement.
Researchers then moved to the next step of the training program by mixing different talks together to create synthetic cocktail parties, along with non-speech background data, to make it harder for the AI to distinguish voices.
The result was a system that could be used to isolate voices in environments with multiple humans talking. The only condition is that the talking person’s face must be visible on screen, so the AI can correlate one of the multiple voice tracks to a certain face and prioritize it over the rest.
Google expects to deploy this tech in its products
This Google-developed system has yielded spectacular results, and the company expects to use it for various types of products in the future.
“We envision a wide range of applications for this technology,” Google said. “We are currently exploring opportunities for incorporating it into various Google products.”
For example, this tech…