Goal: make computers talk like human.

My notes from audio (voice + speech) AI research started in 2023.

Emerging Research#

Audio-to-audio Models#

Audio-to-audio or more precisely speech-to-speech models are trained end-to-end.

It’s multimodal model = Audio Speech Recognition (ASR)/TTS + TTS + Language Model

  • Gazelle is an open-sourced audio-to-audio model. real-time conversational speech with large language models. Website: https://tincans.ai/ Demo: Tweet

This demo uses Gazelle, the world’s first public LLM with direct audio input. By skipping transcription decoding, we save time and can operate directly on speech - with inflection, tone, emotion, etc.

TTS#

Classic and modern Text-To-Speech (TTS) models and technologies.

ElevenLabs today launches a set of new product developments, enabling anyone to create an entire audiobook on the platform in a matter of minutes, and an AI speech detection model Since launch, ElevenLabs has amassed over 1 million registered users who have generated over 10 years worth of audio content.

ElevenLabs tools can turn any text into speech using synthetic voices, cloned voices, or by creating entirely new artificial voices that can be tailored according to gender, age, and accent preferences. Through its research, ElevenLabs has been able to achieve a new level of speech quality that is almost indistinguishable from a real human with sub-1 second latency.

A female speaker with a soft-pitched voice delivering his words at a normal pace with a very clear audio and an angelic tone.

A female speaker with an angelic voice delivering her words at a normal pace with a very clear audio and an nice tone.

This was posted yesterday as an interesting tech/pricing focused view of a few

Where are we going? Last two miles.

Right now, there is a very uncanny valley in text-to-speech. Some speech-to-text model (usually whisper) decodes the user’s speech to text or directly transmits text to a large language model. That large language model generates a text response. This is fed to a TTS API and then streamed to the user. There is no backchanneling, just silence while compute is churned. If that is more than 500 milliseconds, the effect is very odd, almost like a very long-distance phone call. Even if it is very fast, the lack of any noise whatsoever from the putative speaker is eerie and destroys immersion. We hope to release a real-time always-on dialogue agent at some point next year. This, in our opinion, is the true last mile in text-to-speech. After that, the robots have won.

STT#

Classic and modern Speech-To-Speech (STT) models and technologies.

Available with 🦀 WebGPU, Whisper.cpp, Transformers, Faster-Whisper and Transformers.js support!

Voice Platform#

User Interface, Cloud service providers.

Vapi#

Open Source Softwares#

Real-time Human Conversational AI Systems#

  • Fixie.ai - scaling real-time human conversational AI systems. CTO: Justin Uberti, ex-Google, created WebRTC and Google Duo; tech lead for Stadia, Hangouts.

    Human conversations are fast, typically around 200ms between turns, and we think LLMs should be just as quick. This site, TheFastest.ai provides reliable measurements for the performance of popular models.

Applications#

Since voice-to-text has gotten so good …

Your big challenge right now is just that STT is still relatively slow for this usecase. Time will be on your side in that regard as I’m sure you know.

Voice is the future of a lot of the interactions we have with computers.

Not trying to hijack this. Great demo! But STT can be very much real-time now. Try SoundHound’s transcription service available through the Houndify platform [0] (we really don’t market this well enough). It’s lightning fast and it’s half of what powers the Dynamic Interaction demos that we’ve been putting out.

Build AI Voice Agents that interact like humans, execute complex tasks, follow instructions, use any LLM

Precursor and alternative? SoundHound Dynamic Interaction demo. Thats’s fast!

Voice Cloning (Synthetic Voices)#

Text-To-Audio#

Music Generative AI:

Uncategorized#

  • HF diarizers
    • Based on pyannote-audio
      • Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
  • Personal AI, Pi - Pi is designed to be a kind and supportive companion offering conversations, friendly advice, and concise information in a natural, flowing style.

Communities#

the tech is very good, but not quite there yet I don’t think, it’s extremely close though.

While ElevenLabs seems the best, it’s a shame it lacks the ability to edit the clips a little more like some of the other tools have, for speeding up certain words, making them louder or adding in some emotion. The other tools do this far better, however they sound robotic, i’m exploring if this could be achieved with some manual editing. I’d go over quota pretty quickly. I imagine the cost will come down.

I’d like a TTS which is emotionally expressive and can be used for video game characters.

“VoiceAI” platform examples are Vapi.ai (the best in my opinion), Bland.ai, Toma.so, Retell AI, Infer.so, Marr Labs, Elto.ai

Voice examples are - playht, elevenlabs, amazon polly

Realtime Speech Generation

Instant Voice Cloning

Cross-language and Accent Cloning

Directing Emotions

  • Self-proclaimed state of the art. A year ago, i would have been blown away, today, this is dramatically worse than Eleven Labs. Lower quality audio, strange cadence, pretty monotonic. It’s not what people sound like. I think it’s impressive, but i wouldn’t call it state of the art.

  • Looking for a 24-7 Real-Time Voice Transcription Tool

if you do decide to, start with ggml/whisper

  • New models and developer products (openai.com) (Nov 2023)

    • As for all the surrounding stuff like detecting speech starting and stopping and listening for interruptions while talking, give my voice AI a try. It has a rough first pass at all that stuff, and it needs a lot of work but it’s a start and it’s fun to play with. Ultimately the answer is end-to-end speech-to-speech models, but you can get pretty far with what we have now in open source!

    • A few notes on pricing:

      • ElevenLabs is $0.24 per 1K characters while OpenAI TTS HD is $0.03 per 1K characters. Elevenlabs still has voice copying but for many use-cases it’s no longer competitive.
    • The new TTS is much cheaper than eleven labs and better too. I don’t know how the model works so maybe what i’m asking isn’t even feasible but i wish they gave the option of voice cloning or something similar or at least had a lot more voices for other languages. The default voices tend to make other language output have an accent.

      • I’m not sure if the tts is better than eleven labs. English audio sounded really good, but the Spanish samples I’ve generated are off a bit. It definitely sounds human, but it sounds like an English native speaker speaking Spanish. Also I’ve noticed on inputs just a few sentences long, it will sometimes repeat, drop, or replace a word. The accent part I’m okay with, but the missing words is a big issue.
    • The TTS seems really nice, though still relatively expensive, and probably limited to English (?). I can’t wait until that level of TTS will become available basically for free, and/or self-hosted, with multi-language support, and ubiquitous on mobile and desktop.

      • It’s not limited to English. The model at least. Doubt the API will be too. Expensive ? Compared to what? Eleven labs costs an arm and a leg in comparison.
  • Hume – voice AI with emotional intelligence (hume.ai)

    • I’ve been playing around with it for 15 minutes or so. It’s like having a conversation with five or six different people. It’s pretty awesome!
    • This should rank higher. Absolutely mind-blowing stuff
  • Universal Speech Model (research.google)

  • Reddit discussions

Tweets#

  • https://twitter.com/EthanSutin/status/1753182461440434400

    The last demo depended on GPT-3.5 and ElevenLabs, but here's a full open-source stack running on a 16GB M1:
    
    📝 Conversation Transcription: Whisper Medium
    💬 Realtime Transcription: Whisper Small
    🔊 VAD: Silero
    🧠 LLM: Mistral-Instruct 7B
    🗣️ TTS: StyleTTS2
    
  • https://twitter.com/zaprobest/status/1751495564192289213

A list of TTS

  • EmotiVoice: a Multi-Voice and Prompt-Controlled TTS Engine [INTERESTING]
  • VALL-E
  • Bark (open-sourced Suno)
  • OpenVoice

Now that voice is all the hype with AI voice products like AI Pin and AirChat, I thought it was a good time to try making my own voice chatbot! So I played around with @Vapi_AI and created a chatbot helping beginners utilise AI and learn programming!

Idea: A THERAPIST FOR DEVS GOING THROUGH CODING JOURNEY

Funding#

  • https://voice-ai-newsletter.krisp.ai/p/8-predictions-for-voice-ai-in-2024

With the fast advancements in on-device STT and LLM technologies, it’s apparent that this technology will become part of our daily routine beyond call centers.

We predict that the “second brain” sitting next to you and helping/coaching you during your meetings will already be a reality in 2024.

Cloud Speech-to-text will get 2x cheaper The launch of Whisper disrupted the Speech-to-text market.

Stack#

Design and systems architecture:

Original text: GitHub Gist