Researchers at Amazon developed and trained a new large language model for text-to-speech that they claim to have "emergent" abilities.
The 980 million-parameter model is named BASE TTS, which means it's the biggest one within the text-to-speech models.
The researchers trained models of different sizes on as many as 100,000 hours of public domain speech data and observed if they would see the same performance leaps that come into being in natural language processing models while growing past a particular scale.
They reported that their medium-sized 400-million-parameter model trained on 10,000 hours of audio substantially increased versatility and robustness on tricky test sentences.
These included compound nouns and emotions, foreign words, and punctuation—features that trip most conventional text-to-speech systems up.
While BASE TTS didn't nail them all, it made orders of magnitude fewer mistakes in stress, intonation, and pronunciation than existing models. "These sentences are designed to contain challenging phrases — none of which BASE TTS is specifically trained to generate," write the researchers.
In a strength test, the largest 980 million parameter model — trained on 100,000 hours of audio — didn't show any additional skills over the 400 million parameters one.
While experimental, building BASE TTS shows that the models can reach new versatility thresholds as they scale—an encouraging development for conversational AI.
The scientists plan further research to determine the perfect size of the model where new abilities will emerge. The model is also designed for lightness and readability, achieved by packing emotional and prosodic data separately.
This means the audio can be transmitted over low-bandwidth connections in a packetized way but still sound human.