Technology

OpenAI launches three audio models for real-time voice agents

OpenAI rolled out three audio models that can transcribe, translate and act during live calls, sharpening its push to own real-time voice agents. The pricing starts at $0.017 a minute.

Marcus Williams·5/8/2026·2 min read

Published 03:48 AM

Listen to this article•0:00 min

Share this article:

OpenAI launches three audio models for real-time voice agents — Source: thenews.com.pk

OpenAI moved further into live voice software on May 7, unveiling three audio models that are designed to listen, translate and act while people are still speaking. The new lineup, GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper, is aimed at making voice agents more conversational, more useful and faster enough to work inside customer service lines, travel tools and workplace software.

GPT-Realtime-2 is the flagship model in the release. OpenAI said it is the company’s first voice model with GPT-5-class reasoning, and its developer documentation says the model adds reasoning to speech-to-speech workflows. The system is built to handle harder requests, call tools, manage interruptions and hold context across longer voice sessions, a shift that points beyond simple transcription toward agents that can complete tasks in real time.

Translation is the second pillar. GPT-Realtime-Translate supports speech from more than 70 input languages into 13 output languages, which makes it immediately relevant to support desks, classrooms and travel settings where live cross-language communication matters. OpenAI’s documentation says translation sessions are continuous and stream translated audio and transcript deltas as the audio arrives, a design that is meant to keep latency low enough for live conversation.

The third model, GPT-Realtime-Whisper, is a streaming speech-to-text system that transcribes as the speaker talks. OpenAI says speech-to-text is useful for captions, call analysis, search, records and accessibility, and the live format is intended to support meeting notes, workflow updates and other tasks that depend on fast, readable output.

The release also shows how aggressively OpenAI is trying to turn voice into a production platform. The company says the Realtime API supports streaming audio, interruptions, background function calling, remote MCP servers, image inputs and SIP phone calling. OpenAI said thousands of developers had already built with the API before the latest release, and customers already testing the models include Zillow, Priceline and Deutsche Telekom.

OpenAI has previously framed customer support, personal assistance and education as key use cases for voice agents. Its earlier audio releases in 2024 emphasized strong speech-to-text performance in noisy environments and more customizable text-to-speech voices. The latest models push that work into a more competitive arena, where the prize is not just better transcription but the real-time voice layer that can sit inside phones, service centers and office tools, while raising harder questions about reliability, latency, accuracy and consent when software can listen and respond during a live conversation.

Know something we missed? Have a correction or additional information?

Submit a Tip