AI Voice News: This Week's Insights

🔊 Soundcheck

  • ElevenLabs' $180M funding triples valuation to $3.3B.

  • Amazon's Nova Sonic redefines AI voice interactions.

  • Cartesia secures $64M to revolutionize voice AI.

  • Dia: Open-source TTS model challenges industry giants.

Read time: 6 minutes

🔥 Hot Mic

Big moves, deep dives, and standout stories.

Voice AI startup ElevenLabs has raised $180 million in a Series C funding round, elevating its valuation to $3.3 billion—three times its worth from the previous year. The round was co-led by Andreessen Horowitz and ICONIQ Growth, with participation from new investors like NEA, World Innovation Lab, Valor, Endeavor Catalyst Fund, and Lunate. Existing backers, including Sequoia Capital and Salesforce Ventures, also contributed.

Founded in 2022 by ex-Google engineer Piotr Dąbkowski and former Palantir employee Mati Staniszewski, ElevenLabs specializes in AI-driven voice technology. Their products, such as real-time Conversational AI and AI-powered dubbing in 32 languages, serve industries like content creation, gaming, education, and customer support. Clients include NVIDIA, Perplexity AI, and The Washington Post.

The new funding will support ElevenLabs in advancing voice AI research, enhancing developer tools, expanding globally, and strengthening AI safety measures. CEO Mati Staniszewski emphasized the goal of making digital voice interactions as natural and effortless as conversation.

ElevenLabs is also focusing on the Indian market, appointing local leadership and investing in AI talent and technology localization. Efforts include expanding coverage for Indic languages and enhancing its voice library, aligning with the company's broader global expansion strategy.

Key Points:
• ElevenLabs raises $180 million in Series C funding.
• Valuation triples to $3.3 billion within a year.
• Products include real-time Conversational AI and AI dubbing.
• Plans to expand in India with local leadership and investments.

Takeaway: ElevenLabs' substantial funding and tripled valuation underscore the growing significance of voice AI technology, positioning the company as a key player in transforming digital interactions across various industries.

Amazon has introduced Nova Sonic, a new AI voice model that unifies speech understanding and generation into a single framework. This integration aims to enhance the naturalness and accuracy of voice interactions in AI applications.

By combining these capabilities, Nova Sonic can adapt responses to the acoustic context and spoken input, resulting in more fluid and human-like dialogues. This advancement is particularly beneficial for applications like automated customer service and AI agents.

Rohit Prasad, Amazon's SVP of Artificial General Intelligence, emphasized that Nova Sonic simplifies the development of voice-powered applications, enabling them to perform tasks with higher accuracy and engagement.

The model is integrated into Amazon Bedrock and is accessible through a new bidirectional streaming API, targeting industries such as travel, education, healthcare, and entertainment.

Key Points:
• Nova Sonic unifies speech understanding and generation.
• Enhances naturalness of AI voice interactions.
• Integrated into Amazon Bedrock platform.
• Targets multiple industries including healthcare and education.

Takeaway: Amazon's Nova Sonic represents a significant leap in AI voice technology, offering developers a unified model that delivers more natural and accurate conversational experiences across various applications.

Cartesia, a leader in real-time AI-driven voice technology, has secured $64 million in a Series A funding round led by Kleiner Perkins. This investment aims to accelerate the development of Sonic 2.0, their latest voice model designed for ultra-realistic, low-latency speech generation.

Sonic 2.0 leverages a state space architecture, doubling in size compared to its predecessor while achieving higher speed and efficiency. It delivers 90-millisecond latency for full models and an even faster 40 milliseconds in real-time applications, outperforming competitors in the field.

Beyond speed, Cartesia's technology excels in voice cloning, capturing subtle nuances, accents, and tonal variations. This makes it particularly useful for applications requiring precision, such as customer service, content localization, and accessibility tools.

With this funding, Cartesia plans to refine its voice AI models further, integrate new features like voice changer and infill editing, and advance streaming architectures and on-device inference, positioning itself as a key player in the evolving voice AI ecosystem.

Key Points:
• $64M Series A funding led by Kleiner Perkins.
• Sonic 2.0 achieves 90ms latency; 40ms for real-time.
• Superior voice cloning captures complex accents.
• Infrastructure boasts 99.9% uptime and on-device deployment.

Takeaway: Cartesia's substantial funding and technological advancements in Sonic 2.0 signal a significant leap forward in real-time, low-latency voice AI, poised to transform applications across various industries.

Nari Labs, a two-person startup, has unveiled Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to generate naturalistic dialogue from text prompts. Co-creator Toby Kim asserts that Dia surpasses existing proprietary offerings, including ElevenLabs Studio and Google's NotebookLM podcast feature. The model is available for download and local deployment via Hugging Face and GitHub, providing developers with an accessible, high-quality TTS solution. Dia supports advanced features like emotional tone, speaker tagging, and nonverbal audio cues, enhancing the realism and customization of generated speech.

Key Points:
• Nari Labs introduces Dia, a 1.6B parameter TTS model.
• Dia claims to outperform ElevenLabs and Google's NotebookLM.
• Available for download on Hugging Face and GitHub.
• Supports emotional tone, speaker tagging, and nonverbal cues.

Takeaway: Dia's open-source release democratizes access to advanced TTS technology, offering developers a powerful, customizable alternative to proprietary models.

🎙️ Mic Drop

What else is making noise in voice AI.

Groq partners with PlayAI to deliver Dialog, a 10x faster, emotionally intelligent TTS including the region's first Arabic voice model. (venturebeat.com)

Incept AI raises $3M to deploy robust voice AI in noisy, real-world settings like restaurant drive-thrus and phone ordering. (prweb.com)

Phonic secures investment from Lux Capital to scale its end-to-end voice AI solution for enterprises. (techcrunch.com)

Telli raises $3.6M pre-seed to expand its AI voice agents for global booking and CX at scale. (tech.eu)

pyannoteAI secures €8.1M to enhance multilingual speaker diarization and AI voice dubbing technology. (slator.com)

VoicePatrol launches real-time AI voice moderation to keep online gaming communities safer and combat toxicity. (venturebeat.com)

Hume AI debuts Octave, a TTS model designed to imbue synthetic voices with nuanced emotional expression. (testingcatalog.com)

OpenAI introduces gpt-4o-transcribe, making fast, developer-friendly voice integration for text-based apps seamless. (venturebeat.com)

Krisp unveils new SDK for voice AI agents, enhancing realistic conversations with advanced noise cancellation and better turn-taking. (prweb.com)

Agora launches Conversational AI Engine, enabling scalable, natural voice and video engagement for developers and enterprises. (prnewswire.com)

Sesame’s open-source CSM-1B aims for lifelike interactions with deliberate speech 'imperfections' for authenticity. (the-decoder.com)

Perplexity updates iOS app with an AI voice assistant feature, bringing cross-app, conversational AI to Apple devices. (macrumors.com)

FastRTC from Hugging Face enables simple, Python-based building of real-time voice and video AI applications. (venturebeat.com)

Infinitus launches 'hallucination-free' voice AI agents, processing over 100M minutes of health conversations for a million+ patients. (prnewswire.com)