Google researchers have found ways to make machine-generated speech sound more natural to humans, members of Google’s Brain and Machine Perception teams said today in a blog post that included samples of the more expressive voices. Earlier today, Google announced the beta release of its Cloud Text-to-Speech services to provide customers with the same speech synthesis used by Google Assistant. Google’s Cloud Text-to-Speech is powered by DeepMind’s WaveNet, which can also be used to generate natural-sounding voices.
Services like text-to-speech and research methods introduced today could be used to bring more natural speech to devices, apps, or digital services that utilize voice control or voice computing.
The new methods for making voices sound human are presented in two recently published articles about how to mimic things like stress or intonation in speech, sounds referred to in linguistics as prosody. Both papers document techniques that build on top of Tacotron 2, an AI system using neural networks trained to mimic human speech that made its debut last December.
Though Tacotron sounded like a human voice to the majority of people in an initial test with 800 subjects, it’s unable to imitate things like stress or a speaker’s natural intonation. In the first study coauthored by Tacotron co-creator Yuxuan Wang, transfer of things like stress level were achieved by embedding style from a recorded clip of human speech.
“This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress,…