OpenAI has launched its Realtime API in public beta, which will allow developers to build voice assistants capable of communicating more naturally with users.
The new low-latency, multimodal, and speech-to-speech experiences enabled by the Realtime API is likened by OpenAI to ChatGPT’s Advanced Voice Mode.
The US AI research organisation also revealed via its online newsroom that it is adding audio input and output in the Chat Completions API for those use cases which do not require such rapid system responses.
OpenAI explains how the Realtime API can help developers: “From language apps and educational software to customer support experiences, developers have already been leveraging voice experiences to connect with their users.
“Now with the Realtime API and soon with audio in the Chat Completions API, developers no longer have to stitch together multiple models to power these experiences.
“Instead, you can build natural conversational experiences with a single API call.”
What’s New?
In the past, voice assistant developers had to transcribe audio with an automatic speech recognition system such as Whisper, pass it on to a text model to apply reasoning, and finally apply a text-to-text speech model.
This frequently led to ‘noticeable’ latency and a loss of emotion, accents, and emphasis, according to OpenAI.
Now, the Chat Completions API lets developers process all the steps through a single API call, although it is not as quick as normal human interactions.
The Realtime API, however, speeds things up to a more natural level by directly streaming audio inputs and outputs.
It allows developers to create persistent WebSocket connections to send and receive messages powered by GPT-4o.
Realtime API has been tested with various OpenAI partners to help improve the service and, in the process, it has already uncovered some potential use cases.
For example, the fitness coaching and wellness application, Healthify, has used the Realtime API to provide natural conversations with its AI coach Ria.
Speak, a language learning application, has also deployed Realtime API to enable its role-play feature through which users can practice their conversation skills.
Privacy and Security
OpenAI reassured users when it comes to their privacy and security: “Prior to launch, we tested the Realtime API with our external red teaming network and found that the Realtime API didn’t introduce any high-risk gaps not covered by our existing mitigations.
“As with all API services, the Realtime API is subject to our Enterprise privacy commitments.
“We do not train our models on the inputs or outputs used in this service without your explicit permission.”
Realtime API began rolling out in public beta yesterday to all paid developers, with audio in the Chat Completions API due to be released in the ‘coming weeks’.
Text input tokens and audio tokens are both used on the Realtime API. Text input tokens are prices at $5 per million and $20 per million output tokens. Audio inputs cost $100 per million tokens and output is $200 for every million tokens. Audio in the Chat Completions API will follow the same pricing structure.
OpenAI has also made headlines recently as it is eyeing huge revenue over the next five years despite incurring losses of $5 billion.