Today we're introducing a new Generative Text-to-Voice AI Model that's trained and built to generate conversational speech. This model also introduces for the first time the concept of Emotions to Generative Voice AI, allowing you to control and direct the generation of speech with a particular emotion. The model is available in closed beta and will be made accessible through our API and Studio.

Introduction

Eight months ago, the PlayHT team released its first LLM for Speech Synthesis; PlayHT1.0 - at the time, achieving SOTA results in speech synthesis quality and voice cloning. We showed it to the world by creating an AI podcast between Joe Rogan and Steve Jobs, inspiring the creation of a new genre for Conversational Generative Speech. For the first time in history, it was clear to people that AI generated speech can achieve humanlike results in terms of voice expressiveness and quality.

But even with all the capabilities that came from applying LLMs to Speech, PlayHT1.0 had limitations, some of which were:

  • Poor zero-shot capabilities.
  • Short speech generations.
  • Inability to control speech styles or emotions.
  • Worked only in the English language.

These issues resulted from the model architecture, its small dataset, and limited speaker diversity.

From the fantastic feedback and contributions from our users, our team got to work to build a new model that not only solves the above mentioned problems but creates a new era for conversational human-like AI voices and media creation.

Introducing PlayHT2.0

With PlayHT2.0, we increased our model size 10x and dataset to more than 1 million hours of speech across multiple languages, accents, and speaking styles.

It is a leap in the field of Speech Synthesis based on an advanced neural network model, akin to the transformer-based methods used by OpenAI in models like DALLE-2 - yet, uniquely catered to the realm of audio.

At the heart of our system is a Large Language Model (LLM). Think of the LLM as a well-read individual who has spent over 500 years reading and absorbing countless transcriptions of audio clips, thus gaining a predictive superpower. When given a transcript and some clues about a specific speaker, this model takes a guess — a very intelligent guess — at what the corresponding audio should 'sound' like. It converts the text into simplified sound markers, commonly known as MEL tokens.

However, these MEL tokens are just the skeletal structure of the sounds: concise and code-like. That's where the crucial decoder model steps in, gently coaxing the skeletal sound markers to expand and fill out. It's a bit like turning a sparse sketch into full-blown, detailed artwork, transforming the simplified codes into sound waves that our vocoder model can understand to recreate human speech.

Conversational Capabilities

We trained PlayHT2.0 to generate humanlike conversations. Its ability to perfectly carry out a conversation makes it suitable for conversational use cases like phone calls, podcasting, audio messaging.

Generating humanlike speech requires the model to act as though it thinks while speaking, while making use of filler words to make the speech sound extremely realistic.

Here are a few conversations fully generated using PlayHT2.0.

Realtime Speech Generation

One of the significant issues we had with PlayHT1.0 was its compute-intensive nature and slow inference. Our team made major architectural innovations with PlayHT2.0 to help make the model more robust while reducing latency to conversational real-time levels. 

As of today, PlayHT2.0 is capable of generating speech in less than 800ms, with more optimizations to come in the near future.

Instant Voice Cloning

PlayHT2.0 can replicate voices with stunning accuracy and resemblance from as low as 3 seconds of speech and in real-time without finetuning.

Here we clone a few voices with just a few seconds of audio and different accents to demonstrate the cloning capabilities of the model -

Original sample

Cloned sample

Original sample

Cloned sample

Cross-language and Accent Cloning

Due to the extensive and massive dataset on which the model was trained and finetuned, it can clone and generate voices in almost any language or accent. The model can also clone voices speaking other languages and make them speak another language while preserving the original accent.

Original sample

Cloned sample

Directing Emotions

PlayHT2.0 was trained to understand emotions and talking styles and apply them to any voice in real time; we are starting with a few basic emotions. But the holy grail of this is to allow real control and direct-ability of any style or emotion just from prompts.

Prompt: happy

Prompt: sad

Prompt: fear

Prompt: disgust

Although directing Emotions is still in its early stages and is expected to improve with usage and more training, the capability of the model to understand emotions makes it possible to define custom emotions on the fly. An example of this can be to let the user prompt with keywords like scared, panting, horrified and let the model generate an emotion based on them. Such a feature opens up a whole dimension of creative speech generation across use cases.

PlayHT2.0 is now available through our Studio and API in alpha; many major updates are coming up over the next few weeks to increase the model’s quality, speed, and capabilities to allow all our users to get the most out of it.