- The AI Voice Newsletter
- Posts
- Microsoft Launches VibeVoice‐Realtime TTS
Microsoft Launches VibeVoice‐Realtime TTS

🔊 Soundcheck
Microsoft Launches VibeVoice‑Realtime TTS
AI voice takes orders overnight in restaurants
Wispr rockets past keyboards with $25M and global enterprise push
Read time: 4 minutes
🔥 Hot Mic
Big moves, deep dives, and standout stories.
Microsoft’s freshly released VibeVoice‑Realtime‑0.5B model delivers near-instant speech from streaming text, making voice interfaces feel truly conversational. It starts talking in about 300 milliseconds—far faster than typical TTS—and keeps going reliably for up to around ten minutes of continuous speech. The model’s lean 0.5B‑parameter design keeps it efficient, ideal for low-latency applications like live narration and agent-style interfaces.
Built on the VibeVoice framework, this variant uses only a low-rate acoustic tokenizer and a diffusion-based audio generator to achieve real-time performance. It achieves competitive quality—with a 2.0 % word error rate and strong speaker similarity—while maintaining long-form robustness. It plays well with conversational LLMs, streaming in speech as text arrives, fitting neatly into real-time voice agent systems.
Key Points:
First audible speech in about 300 ms latency.
Handles streaming text input with chunked interleaved processing.
Produces up to ~10 minutes of coherent long-form speech.
Competitive LibriSpeech WER 2.00 % and similarity 0.695.
Built with just 0.5 B parameters—deployment friendly.
Integrates seamlessly alongside conversational LLMs as a microservice.
Takeaway: VibeVoice‑Realtime marks a turning point—now text‑to‑speech can feel truly live. With sub‑second start times, long‑form fluency, and compact size, it unlocks voice agents, live data narration, and real‑time interfaces previously held back by lag and instability.
Choco teams up with OpenAI to introduce an AI voice agent that handles restaurant orders anytime in any language.
Choco and OpenAI are rolling out a voice AI agent designed for the food service industry. Built using OpenAI’s Realtime API, it can field calls, take orders, answer questions, and suggest items 24/7 in multiple languages. It’s meant to fill the persistent gap in night‑shift staffing. Operators rely on outdated answering machines or voicemail that often leads to missed orders, mistakes, and food waste. The new agent steps in to confirm stock, recommend alternatives, highlight promos, and push clean orders through—lifting order accuracy to an estimated 95%, slashing manual order processing by about half, and cutting waste. The agent blends Choco’s deep industry know‑how with OpenAI’s real‑time voice capabilities, aiming to boost efficiency, sustainability, and growth across regions.
Key Points:
Choco Voice Agent handles calls and orders around the clock in any language
Built using OpenAI’s Realtime API for real-time conversational understanding
Addresses staffing issues by replacing manual night-shift order entry
Achieves ~95% order accuracy and ~50% faster processing with reduced waste
Takeaway: Merging OpenAI’s real-time voice tech with Choco’s sector-specific platform, the Voice Agent brings tangible gains—better order accuracy, lower costs, less waste, and smoother after-hours operations—for restaurants and distributors.
Wispr raised $25M more to expand its voice-first platform to enterprise, international, and new APIs.
Wispr just closed a $25 million funding round led by Notable Capital, boosting its total capital to $81 million and valuing the company at $700 million. With investor confidence mounting, it’s poised to accelerate hiring, ramp up AI model development, and expand internationally.
The traction is hard to ignore: 270 Fortune 500 companies are using Wispr Flow, onboarding 125 new enterprise customers weekly, and achieving 40% month-over-month growth. The startup is already building personalized voice models, lowering transcription errors and pushing the platform toward operating-system-level utility for professionals.
Key Points:
Raised $25M led by Notable Capital, now $81M total funding.
Valued at $700 million post-money.
270 Fortune 500 companies using Wispr Flow.
125 new enterprise clients onboarded weekly.
40% month-over-month user growth.
100× year‑over‑year user growth and 70% retention.
Android beta by year‑end; stable release expected Q1 2025.
Developing proprietary voice models with ~10% error rate.
Takeaway: Wispr is maturing fast into a voice-native productivity platform, moving beyond dictation toward becoming a foundational interface for enterprise workflows and global usage, backed by strong growth and fresh investment.
🎙️ Mic Drop
What else is making noise in voice AI.
Gradium emerges with $70M, aiming at real-time, multilingual AI voice stacks for enterprise use. (slator.com)
Eric Schmidt invests in Gradium, boosting the profile of European voice tech innovation. (sifted.eu)
A data breach at LG Uplus exposes privacy risks in AI-powered voice call platforms. (totaltele.com)
Genesys named leader by IDC for front-office conversational AI platforms, signaling vendor maturity in enterprise deployments. (businesswire.com)
Voice AI is rapidly scaling in food and retail, shifting from pilot to enterprise-grade ordering systems. (technology.org)
Insurance industry flags regulatory and legal threats due to proliferation of AI voice cloning technologies. (insurancebusinessmag.com)
Eleos introduces an AI-powered voice agent for income protection and life insurance customer workflows. (finovate.com)
A cancer survivor uses AI voice cloning to recover speech, underlining accessibility wins for assistive tech. (kffhealthnews.org)
Step-by-step instructions for integrating TTS in React Native apps with major speech synthesis providers. (vocal.media)
Demo shows PSOC Edge enabling low-power voice control for IoT devices via DEEPCRAFT voice assistant software. (embeddedcomputing.com)
The AI voice generator market is projected to grow rapidly, reaching $20.71 billion globally by 2031. (prnewswire.com)
AI voice agents are being considered as virtual hires in recruitment processes, reshaping workforce structure. (iotforall.com)