Voice AI Research

Goal: make computers talk like human.

My notes from audio (voice + speech) AI research started in 2023.

Emerging Research#

Audio-to-audio Models#

Audio-to-audio or more precisely speech-to-speech models are trained end-to-end.

It’s multimodal model = Audio Speech Recognition (ASR)/TTS + TTS + Language Model

Gazelle is an open-sourced audio-to-audio model. real-time conversational speech with large language models. Website: https://tincans.ai/ Demo: Tweet

This demo uses Gazelle, the world’s first public LLM with direct audio input. By skipping transcription decoding, we save time and can operate directly on speech - with inflection, tone, emotion, etc.

Discussions: Real-time voice chat with AI, no transcription (tincans.ai)
https://github.com/google-deepmind/deepmind-research/tree/master/perceiver - Perceiver [1] is a general architecture that works on many kinds of data, including images, video, audio, 3D point clouds, language and symbolic inputs, multimodal combinations, etc.
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation (paper)
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities (paper) by Fudan U, May 2023 - This paper is a great introduction to speech-to speech modeling and details many of the challenges and needed datasets.
https://github.com/alibaba-damo-academy/FunASR
Audiobox: Meta’s new foundation research model for audio generation (meta.com)
OpenVoice: Versatile instant voice cloning (myshell.ai)

TTS#

Classic and modern Text-To-Speech (TTS) models and technologies.

ElevenLabs
- ElevenLabs has developed AI models and tools for creating AI-generated voices with different languages, accents and emotions.
- Funding
  - ElevenLabs launches new generative voice AI products and announces 19m series A round

ElevenLabs today launches a set of new product developments, enabling anyone to create an entire audiobook on the platform in a matter of minutes, and an AI speech detection model Since launch, ElevenLabs has amassed over 1 million registered users who have generated over 10 years worth of audio content.
ElevenLabs tools can turn any text into speech using synthetic voices, cloned voices, or by creating entirely new artificial voices that can be tailored according to gender, age, and accent preferences. Through its research, ElevenLabs has been able to achieve a new level of speech quality that is almost indistinguishable from a real human with sub-1 second latency.

Voice AI startup ElevenLabs gains unicorn status after latest fundraising
Sonia Health backed by ElevenLabs - Mental Health for Every Mind
StyleTTS2 – open-source Eleven-Labs-quality Text To Speech (github.com/yl4579)
- StyleTTS 2 Setup Guide
- Vokan TTS - A StyleTTS2 fine-tune, designed for expressiveness.
HuggingFace’s Parler TTS

A female speaker with a soft-pitched voice delivering his words at a normal pace with a very clear audio and an angelic tone.
A female speaker with an angelic voice delivering her words at a normal pace with a very clear audio and an nice tone.

Any recommendation for human like voice AI model for conversation AI?
- Mozilla’s TTS - good
- TorToiSe TTS - good quality but slow. note: successor of Mozilla’s TTS
  - For offline/local TTS, Coqui TTS is quite good. It’s essentially a continuation of Mozilla’s TTS engine that Mozilla stopped working on ~2 years ago (and IIRC it’s largely the same team that worked on Mozilla TTS).
  - There is also this project, which aims to optimize and accelerate Tortoise TTS inference
- Mycroft’s mimic3 - fast latency
Coqui is a deep learning toolkit for Text-to-Speech, battle-tested in research and production.
Zero-Shot Speech Editing and Text-to-Speech in the Wild (paper) (Apr 2024) - VoiceCraft code
Does any other TTS on the market stand up to?

This was posted yesterday as an interesting tech/pricing focused view of a few

Neets vs. ElevenLabs
- Neets - Stop overpaying for quality TTS.

Where are we going? Last two miles.
Right now, there is a very uncanny valley in text-to-speech. Some speech-to-text model (usually whisper) decodes the user’s speech to text or directly transmits text to a large language model. That large language model generates a text response. This is fed to a TTS API and then streamed to the user. There is no backchanneling, just silence while compute is churned. If that is more than 500 milliseconds, the effect is very odd, almost like a very long-distance phone call. Even if it is very fast, the lack of any noise whatsoever from the putative speaker is eerie and destroys immersion. We hope to release a real-time always-on dialogue agent at some point next year. This, in our opinion, is the true last mile in text-to-speech. After that, the robots have won.

What’s the best TTS engine you’ve heard?
- Edge TTS
Reddit discussion: anyone played and experimented with StyleTTS2?

STT#

Classic and modern Speech-To-Speech (STT) models and technologies.

OpenAI’s Whisper v3, new generation open source ASR model
- Discussion: OpenAI releases Whisper v3
- Tweet: Distil-Whisper v3

Available with 🦀 WebGPU, Whisper.cpp, Transformers, Faster-Whisper and Transformers.js support!

Reddit discussion: Which STT API do I have to choose?
- “Deepgram, Whisper”
NVIDIA’s Parakeet and Canary model
- Top of HF Open ASR leaderboard:
  - Latency: Parakeet-TDT-1.1b
  - Low WER: Canary-1b
- New Model: NVIDIA’s Parakeet STT models beat whisper-large-v3
- Speech-to-Text Benchmark: 47,638 mins transcribed per $1 on RTX3070 Ti (1000-fold cost reduction than managed services)
- NVIDIA Speech and Translation AI Models Set Records for Speed and Accuracy
Deepgram
- Uncovering Voicebots: Secrets to building voice AI agents
- Introducing New Audio Intelligence Models for Sentiment, Intent, and Topic Detection
  - Product Showcase: Audio Intelligence

Voice Platform#

User Interface, Cloud service providers.

Vapi#

Vapi launch at ProductHunt
Application
- Tweet: Eugene Yan created an AI personal coach (demo)
  - Something similar: Suno Chat - The future of personal growth is here. Like chatting with your best friend, 24/7. A judgement-free zone made just for you.
Achieving PMF in voice AI
YC-backed productivity app Superpowered pivots to become a voice API platform for bots

Open Source Softwares#

Bolna - End-to-end platform for building voice first multimodal agents.
- Discussions:
  - Open source projects for connecting LLMs, TTS and STT (bolna.dev)
  - OSS voice based conversational API with <1sec latency and other nuances (bolna.dev)
BUD-E - LAION’s BUD-E: Enhancing AI voice assistants’ conversational quality, naturalness and empathy.
LocalAIVoiceChat - Local AI Voice Chat = llama_cpp with Zephyr 7B + RealtimeSTT with faster_whisper + RealtimeTTS with Coqui XTTS

Real-time Human Conversational AI Systems#

Fixie.ai - scaling real-time human conversational AI systems. CTO: Justin Uberti, ex-Google, created WebRTC and Google Duo; tech lead for Stadia, Hangouts.

Human conversations are fast, typically around 200ms between turns, and we think LLMs should be just as quick. This site, TheFastest.ai provides reliable measurements for the performance of popular models.

Applications#

Text editor
- Aqua Voice (YC W24) – Voice-driven text editor launch

Since voice-to-text has gotten so good …
Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I’m sure you know.
Voice is the future of a lot of the interactions we have with computers.
Not trying to hijack this. Great demo! But STT can be very much real-time now. Try SoundHound’s transcription service available through the Houndify platform [0] (we really don’t market this well enough). It’s lightning fast and it’s half of what powers the Dynamic Interaction demos that we’ve been putting out.

Retell AI (YC W24) – Conversational Speech API for Your LLM launch

Build AI Voice Agents that interact like humans, execute complex tasks, follow instructions, use any LLM

Precursor and alternative? SoundHound Dynamic Interaction demo. Thats’s fast!

Framework
- Livekit Agents - Build real-time multimodal AI applications 🤖🎙️📹
ChatGPT with voice is now available to all free users (twitter.com/openai)

Voice Cloning (Synthetic Voices)#

Real-Time-Voice-Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time.
Navigating the Challenges and Opportunities of Synthetic Voices by OpenAI
MockingBird
- “This repository is forked from Real-Time-Voice-Cloning which only support English.”
GPT-SoVITS - 1 min voice data can also be used to train a good TTS model! (few shot voice cloning).

Text-To-Audio#

Music Generative AI:

Tango 2

Uncategorized#

HF diarizers
- Based on pyannote-audio
  - Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
Personal AI, Pi - Pi is designed to be a kind and supportive companion offering conversations, friendly advice, and concise information in a natural, flowing style.

Communities#

Are you using Voice AI?

the tech is very good, but not quite there yet I don’t think, it’s extremely close though.
While ElevenLabs seems the best, it’s a shame it lacks the ability to edit the clips a little more like some of the other tools have, for speeding up certain words, making them louder or adding in some emotion. The other tools do this far better, however they sound robotic, i’m exploring if this could be achieved with some manual editing. I’d go over quota pretty quickly. I imagine the cost will come down.
I’d like a TTS which is emotionally expressive and can be used for video game characters.

Are Voice AI Pipeline Platforms a Race to the Bottom?

“VoiceAI” platform examples are Vapi.ai (the best in my opinion), Bland.ai, Toma.so, Retell AI, Infer.so, Marr Labs, Elto.ai
Voice examples are - playht, elevenlabs, amazon polly

PlayHT2.0: State-of-the-Art Generative Voice AI Model for Conversational Speech (play.ht)

Realtime Speech Generation
Instant Voice Cloning
Cross-language and Accent Cloning
Directing Emotions

Self-proclaimed state of the art. A year ago, i would have been blown away, today, this is dramatically worse than Eleven Labs. Lower quality audio, strange cadence, pretty monotonic. It’s not what people sound like. I think it’s impressive, but i wouldn’t call it state of the art.
Looking for a 24-7 Real-Time Voice Transcription Tool

if you do decide to, start with ggml/whisper

New models and developer products (openai.com) (Nov 2023)
- As for all the surrounding stuff like detecting speech starting and stopping and listening for interruptions while talking, give my voice AI a try. It has a rough first pass at all that stuff, and it needs a lot of work but it’s a start and it’s fun to play with. Ultimately the answer is end-to-end speech-to-speech models, but you can get pretty far with what we have now in open source!
  - “How do you detect speech starting and stopping?”
    
    Using Silero VAD [What is VAD? Voice activity detection.
- A few notes on pricing:
  - ElevenLabs is $0.24 per 1K characters while OpenAI TTS HD is $0.03 per 1K characters. Elevenlabs still has voice copying but for many use-cases it’s no longer competitive.
- The new TTS is much cheaper than eleven labs and better too. I don’t know how the model works so maybe what i’m asking isn’t even feasible but i wish they gave the option of voice cloning or something similar or at least had a lot more voices for other languages. The default voices tend to make other language output have an accent.
  - I’m not sure if the tts is better than eleven labs. English audio sounded really good, but the Spanish samples I’ve generated are off a bit. It definitely sounds human, but it sounds like an English native speaker speaking Spanish. Also I’ve noticed on inputs just a few sentences long, it will sometimes repeat, drop, or replace a word. The accent part I’m okay with, but the missing words is a big issue.
- The TTS seems really nice, though still relatively expensive, and probably limited to English (?). I can’t wait until that level of TTS will become available basically for free, and/or self-hosted, with multi-language support, and ubiquitous on mobile and desktop.
  - It’s not limited to English. The model at least. Doubt the API will be too. Expensive ? Compared to what? Eleven labs costs an arm and a leg in comparison.
Hume – voice AI with emotional intelligence (hume.ai)
- I’ve been playing around with it for 15 minutes or so. It’s like having a conversation with five or six different people. It’s pretty awesome!
- This should rank higher. Absolutely mind-blowing stuff
Universal Speech Model (research.google)
Reddit discussions
- What about real time voice conversations with local LLMs? Is that a thing already or we’re not there just yet?

Tweets#

https://twitter.com/EthanSutin/status/1753182461440434400

The last demo depended on GPT-3.5 and ElevenLabs, but here's a full open-source stack running on a 16GB M1:

📝 Conversation Transcription: Whisper Medium
💬 Realtime Transcription: Whisper Small
🔊 VAD: Silero
🧠 LLM: Mistral-Instruct 7B
🗣️ TTS: StyleTTS2

https://twitter.com/zaprobest/status/1751495564192289213

A list of TTS
EmotiVoice: a Multi-Voice and Prompt-Controlled TTS Engine [INTERESTING]
VALL-E
Bark (open-sourced Suno)
OpenVoice

StyleTTS 2 - New king of the Text to Speech Arena! 👑 TTS Arena: Benchmarking TTS Models in the Wild: https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Related: https://twitter.com/reach_vb/status/1769842405988040841
https://twitter.com/DeniTechh/status/1780668616113058260

Now that voice is all the hype with AI voice products like AI Pin and AirChat, I thought it was a good time to try making my own voice chatbot! So I played around with @Vapi_AI and created a chatbot helping beginners utilise AI and learn programming!
Idea: A THERAPIST FOR DEVS GOING THROUGH CODING JOURNEY

OpenInterpreter’s 01 Light (demo)

Funding#

https://voice-ai-newsletter.krisp.ai/p/8-predictions-for-voice-ai-in-2024

With the fast advancements in on-device STT and LLM technologies, it’s apparent that this technology will become part of our daily routine beyond call centers.
We predict that the “second brain” sitting next to you and helping/coaching you during your meetings will already be a reality in 2024.
Cloud Speech-to-text will get 2x cheaper The launch of Whisper disrupted the Speech-to-text market.

Stack#

Design and systems architecture:

STT (Speech-To-Text)/ASR: Deepgram, Whisper API, Parakeet (FastConformer) / Canary by NVIDIA NeMo + Suno.ai
TTS (Text-To-Speech): StyleTTS 2, ElevenLabs, Neets.ai

Original text: GitHub Gist