Today we're introducing the first ever Generative Text to Voice AI model that's capable of synthesizing humanlike speech with incredible voice cloning characteristics.
The future is AI generated
Content creation and creative work is changing forever with the advent of generative ML models like GPT3 & Bloom (text generation), DALLE & Stable Diffusion (image generation), and RunwayML (video generation).
Today we are introducing our first model, PlayHT1.0, an ultra-realistic Text to Speech model for the missing modality in that new set of models which is Generative Audio.
We believe in a future where all content creation will be generated by AI but guided by humans, and the most creative work will depend on the human ability to articulate their desired creation to the model.
The challenges of creating human-like Text to Speech
Text to speech (TTS) synthesizers have gone through great advances since the introduction of neural networks.
As a result, TTS systems are now able to synthesize multi-language, multi-speaker, multi-style high quality speech.
However, despite these achievements, current TTS systems usually demand high quality studio-recorded annotated audio from different speakers with different styles and emotions in order to fulfill the needs for commercial applications.
Furthermore, the addition of a new speaker to the model usually requires at least 30 minutes of clean studio recorded data with phonetic annotations.
And yet, the synthesized speech would still sound mostly unnatural due to its prosody lacking expressiveness (tempo, rhythm, power).
Our approach moves beyond the current technology by introducing a novel TTS method which is able to synthesize speech with a higher degree of realism, making it basically undistinguishable from natural speech as spoken by humans.
And to achieve this we don’t rely on high quality annotated data but audio itself as its naturally uttered.
A Generative Audio approach with PlayHT
Unlike most standard Speech Synthesis ML models and Text to Speech APIs that are designed to trade quality and expressiveness for compute, PlayHT1.0 was designed from the ground up to generate the most expressive and emotional speech and imitate a human voice vividly.
PlayHT1.0 employs the same concept as large language models such as Dalle and GPT-2.
As a result our model, PlayHT1.0, can not only speak in thousands of voices, but has also learned the intricacies of human speech like emotion, tone, even laughter – all in a self-supervised manner.
Aside from the great improvement on naturalness, voice cloning can be done with less than 30 seconds of recorded audio from a single speaker without the need of transcripts, bringing the multi-speaker, multi-style capability of TTS based applications to another level of performance.
And because it is a Large Language Model, it has the ability to compress 100s of thousands of voices in a few GBs of knowledge that can then generate an infinite number of voice variations, emotions, and styles.
Here's an AI generated podcast built entirely using PlayHT.
Why Text to Speech
Speech has an instant practical use case in real world applications.
In industries such as gaming, animation, film and eLearning, voice plays a crucial role.
But creating voice has always been a challenge either in terms of cost, time or the countless back-n-forth in editing.
With an AI Voice model like PlayHT, we are able to reduce the voice production costs, save time and provide instant access to a library of voices that can narrate, explain, engage and captivate the listeners attention like never before.
And this is the first model in a set of models and tools we are building to help unlock truly expressive human-like voice generation at scale.