How Does It Work?
OpenAI has announced the general availability of its Realtime API, making it easier for developers and enterprises to build production-ready voice agents. Alongside this release, the company also launched gpt-realtime, its most advanced speech-to-speech model yet, promising better natural conversation, improved comprehension, and more reliable function calling.
A Smarter Voice API for Developers
First introduced in public beta last October, the Realtime API is now fully open for developers worldwide. The updated version brings stronger reliability, lower latency, and better audio quality. Unlike traditional pipelines that chain multiple models for speech recognition and synthesis, the Realtime API processes and generates audio through a single model. This approach reduces delays, preserves nuance, and creates more natural responses.
New Features for Voice Agents
The Realtime API now includes support for remote MCP servers, image inputs, and phone calling through SIP (Session Initiation Protocol). These additions allow developers to integrate external tools, provide contextual data, and build more capable AI-driven voice assistants.
Introducing gpt-realtime
The newly launched gpt-realtime model is designed to handle real-world tasks in customer support, education, and personal assistance. The model shows significant upgrades in audio quality, comprehension, instruction following, and function calling — key elements for production-ready AI applications.
Natural-Sounding Audio Quality
OpenAI trained gpt-realtime to produce more expressive and human-like speech. It adapts intonation, emotion, and pacing, while also following instructions like “speak empathetically” or “switch to a French accent.” Two new voices — Marin and Cedar — have been added, and the existing eight voices now also sound more natural.
Higher Intelligence and Comprehension
The model can understand complex spoken inputs, capture non-verbal cues like laughter, and even switch languages mid-sentence. It also performs better at reading back alphanumeric data, such as phone numbers or codes, across multiple languages including Spanish, Chinese, Japanese, and French. On the Big Bench Audio evaluation, gpt-realtime scored 82.8% accuracy, a leap from 65.6% in December 2024.
Improved Instruction Following
Developers often rely on instruction-based prompts to guide AI speech agents. OpenAI has improved how gpt-realtime follows directions, even subtle ones. On the MultiChallenge benchmark, it scored 30.5% accuracy, compared to the previous model’s 20.6%. This makes it more effective at reading disclaimers, scripts, or instructions word-for-word when needed.
Better Function Calling
Building useful voice agents also depends on how well models can call external functions. OpenAI enhanced this ability across three areas—choosing relevant functions, calling them at the right time, and supplying correct arguments. On ComplexFuncBench, gpt-realtime achieved 66.5% accuracy, up from 49.7%. The API also supports asynchronous function calling, enabling continuous conversation while waiting for tool results.
Pricing and Availability
The Realtime API and gpt-realtime model are available starting today. OpenAI has cut prices by 20% compared to gpt-4o-realtime-preview. The new rates are $32 (₹2,656) per 1M audio input tokens (or $0.40 / ₹33 for cached input tokens) and $64 (₹5,312) per 1M audio output tokens. Developers can also manage costs better with fine-grained conversation context controls, making long sessions more affordable.