Skip to content

OpenAI Realtime API

📝 Overview

Introduction

OpenAI Realtime API provides two connection methods:

  1. WebRTC - For real-time audio/video interaction in browsers and mobile clients

  2. WebSocket - For server-to-server application integration

Use Cases

  • Real-time voice conversations
  • Audio/video conferencing
  • Real-time translation
  • Speech transcription
  • Real-time code generation
  • Server-side real-time integration

Key Features

  • Bidirectional audio streaming
  • Mixed text and audio conversations
  • Function calling support
  • Automatic Voice Activity Detection (VAD)
  • Audio transcription capabilities
  • WebSocket server-side integration

🔐 Authentication & Security

Authentication Methods

  1. Standard API Key (server-side only)
  2. Ephemeral Token (client-side use)

Ephemeral Token

  • Validity: 1 minute
  • Usage limit: Single connection
  • Generation: Created via server-side API
POST https://your-newapi-server-address/v1/realtime/sessions
Content-Type: application/json
Authorization: Bearer $API_KEY

{
  "model": "gpt-4o-realtime-preview-2024-12-17",
  "voice": "verse"
}

Security Recommendations

  • Never expose standard API keys on the client side
  • Use HTTPS/WSS for communication
  • Implement appropriate access controls
  • Monitor for unusual activity

🔌 Connection Establishment

WebRTC Connection

  • URL: https://your-newapi-server-address/v1/realtime
  • Query parameters: model
  • Headers:
  • Authorization: Bearer EPHEMERAL_KEY
  • Content-Type: application/sdp

WebSocket Connection

  • URL: wss://your-newapi-server-address/v1/realtime
  • Query parameters: model
  • Headers:
  • Authorization: Bearer YOUR_API_KEY
  • OpenAI-Beta: realtime=v1

Connection Flow

sequenceDiagram
    participant Client
    participant Server
    participant OpenAI

    alt WebRTC Connection
        Client->>Server: Request ephemeral token
        Server->>OpenAI: Create session
        OpenAI-->>Server: Return ephemeral token
        Server-->>Client: Return ephemeral token

        Client->>OpenAI: Create WebRTC offer
        OpenAI-->>Client: Return answer

        Note over Client,OpenAI: Establish WebRTC connection

        Client->>OpenAI: Create data channel
        OpenAI-->>Client: Confirm data channel
    else WebSocket Connection
        Server->>OpenAI: Establish WebSocket connection
        OpenAI-->>Server: Confirm connection

        Note over Server,OpenAI: Begin real-time conversation
    end

Data Channel

  • Name: oai-events
  • Purpose: Event transmission
  • Format: JSON

Audio Stream

  • Input: addTrack()
  • Output: ontrack event

💬 Conversation Interaction

Conversation Modes

  1. Text-only conversations
  2. Voice conversations
  3. Mixed conversations

Session Management

  • Create session
  • Update session
  • End session
  • Session configuration

Event Types

  • Text events
  • Audio events
  • Function calls
  • Status updates
  • Error events

⚙️ Configuration Options

Audio Configuration

  • Input formats
  • pcm16
  • g711_ulaw
  • g711_alaw
  • Output formats
  • pcm16
  • g711_ulaw
  • g711_alaw
  • Voice types
  • alloy
  • echo
  • shimmer

Model Configuration

  • Temperature
  • Maximum output length
  • System prompt
  • Tool configuration

VAD Configuration

  • Threshold
  • Silence duration
  • Prefix padding

💡 Request Examples

WebSocket Connection ✅

Node.js (ws module)

import WebSocket from "ws";

const url = "wss://your-newapi-server-address/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17";
const ws = new WebSocket(url, {
  headers: {
    "Authorization": "Bearer " + process.env.API_KEY,
    "OpenAI-Beta": "realtime=v1",
  },
});

ws.on("open", function open() {
  console.log("Connected to server.");
});

ws.on("message", function incoming(message) {
  console.log(JSON.parse(message.toString()));
});

Python (websocket-client)

# Requires websocket-client library:
# pip install websocket-client

import os
import json
import websocket

API_KEY = os.environ.get("API_KEY")

url = "wss://your-newapi-server-address/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
headers = [
    "Authorization: Bearer " + API_KEY,
    "OpenAI-Beta: realtime=v1"
]

def on_open(ws):
    print("Connected to server.");

def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))

ws = websocket.WebSocketApp(
    url,
    header=headers,
    on_open=on_open,
    on_message=on_message,
)

ws.run_forever()

Browser (Standard WebSocket)

/*
Note: In browser and other client environments, we recommend using WebRTC.
But in Deno and Cloudflare Workers and other browser-like environments,
you can also use the standard WebSocket interface.
*/

const ws = new WebSocket(
  "wss://your-newapi-server-address/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17",
  [
    "realtime",
    // Authentication
    "openai-insecure-api-key." + API_KEY, 
    // Optional
    "openai-organization." + OPENAI_ORG_ID,
    "openai-project." + OPENAI_PROJECT_ID,
    // Beta protocol, required
    "openai-beta.realtime-v1"
  ]
);

ws.on("open", function open() {
  console.log("Connected to server.");
});

ws.on("message", function incoming(message) {
  console.log(message.data);
});

Message Send/Receive Example

Node.js/Browser
// Receive server events
ws.on("message", function incoming(message) {
  // Need to parse message data from JSON
  const serverEvent = JSON.parse(message.data)
  console.log(serverEvent);
});

// Send events, create JSON data structure conforming to client event format
const event = {
  type: "response.create",
  response: {
    modalities: ["audio", "text"],
    instructions: "Give me a haiku about code.",
  }
};
ws.send(JSON.stringify(event));
Python
# Send client events, serialize dictionary to JSON
def on_open(ws):
    print("Connected to server.");

    event = {
        "type": "response.create",
        "response": {
            "modalities": ["text"],
            "instructions": "Please assist the user."
        }
    }
    ws.send(json.dumps(event))

# Receive messages need to parse message payload from JSON
def on_message(ws, message):
    data = json.loads(message)
    print("Received event:", json.dumps(data, indent=2))

WebSocket Python Audio Example

Example Documentation

This is a Python example of OpenAI Realtime WebSocket voice conversation, supporting real-time voice input and output.

Features
  • 🎤 Real-time Voice Recording: Automatically detects voice input and sends it to the server
  • 🔊 Real-time Audio Playback: Plays AI's voice responses
  • 📝 Text Display: Simultaneously displays AI's text responses
  • 🎯 Automatic Voice Detection: Uses server-side VAD (Voice Activity Detection)
  • 🔄 Bidirectional Communication: Supports continuous conversation
Requirements
  • Python 3.7+
  • Microphone and speakers
  • Stable network connection
Install Dependencies
pip install -r requirements.txt

System Dependencies: On Linux systems, you may need to install additional audio libraries:

# Ubuntu/Debian
sudo apt-get install portaudio19-dev python3-pyaudio

# CentOS/RHEL
sudo yum install portaudio-devel
Configuration

In the openai_realtime_client.py file, ensure the following configuration is correct:

WEBSOCKET_URL = "wss://new-api.weroam.xyz/v1/realtime"
API_KEY = "sk-QMA3eCob2EmDI2graXo4onUupApVnrk8wPXemJ7lSfK4QPa0"
MODEL = "gpt-4o-realtime-preview-2024-12-17"
Usage
  1. Run the program:

    python openai_realtime_client.py
    

  2. Start conversation:

  3. The program will automatically start recording after launch
  4. Speak into the microphone
  5. AI will respond to your voice in real-time

  6. Stop the program:

  7. Press Ctrl+C to stop the program
Technical Details

Audio Configuration: - Sample Rate: 24kHz (OpenAI Realtime API requirement) - Format: PCM16 - Channels: Mono - Encoding: Base64

WebSocket Message Types: - session.update: Session configuration - input_audio_buffer.append: Send audio data - input_audio_buffer.commit: Commit audio buffer - response.audio.delta: Receive audio response - response.text.delta: Receive text response

Voice Activity Detection: Uses server-side VAD configuration: - Threshold: 0.5 - Prefix padding: 300ms - Silence duration: 500ms

Troubleshooting

Common Issues:

  1. Audio Device Issues:

    # Check audio devices
    python -c "import pyaudio; p = pyaudio.PyAudio(); print([p.get_device_info_by_index(i) for i in range(p.get_device_count())])"
    

  2. Permission Issues:

  3. Ensure the program has microphone access permissions
  4. Linux: Check ALSA/PulseAudio configuration

  5. Network Connection Issues:

  6. Check if the WebSocket URL is correct
  7. Ensure the API key is valid
  8. Check firewall settings

Debug Mode:

Enable verbose logging:

logging.basicConfig(level=logging.DEBUG)

Code Structure
├── openai_realtime_client.py  # Main program file
├── requirements.txt           # Python dependencies
└── README.md                  # Documentation

Main Classes and Methods:

  • OpenAIRealtimeClient: Main client class
  • connect(): Connect to WebSocket
  • start_audio_streams(): Start audio streams
  • start_recording(): Start recording
  • handle_response(): Handle responses
  • start_conversation(): Start conversation
Notes
  1. Audio Quality: Ensure use in a quiet environment for best results
  2. Network Latency: Real-time conversation is sensitive to network latency
  3. Resource Usage: Long-running sessions may consume significant CPU and memory
  4. API Limits: Be aware of OpenAI API usage limits and costs
License

This project is for learning and testing purposes only. Please comply with OpenAI's terms of use.

Example Code
#!/usr/bin/env python3
"""
OpenAI Realtime WebSocket Audio Example
Supports real-time voice conversation, including audio recording, sending, and playback
"""

import asyncio
import json
import base64
import websockets
import pyaudio
import wave
import threading
import time
from typing import Optional
import logging

# Configure logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler('websocket_debug.log', encoding='utf-8')
    ]
)
logger = logging.getLogger(__name__)

class OpenAIRealtimeClient:
    def __init__(self, 
                 websocket_url: str,
                 api_key: str,
                 model: str = "gpt-4o-realtime-preview-2024-12-17"):
        self.websocket_url = websocket_url
        self.api_key = api_key
        self.model = model
        self.websocket = None
        self.is_recording = False
        self.is_connected = False

        # Audio configuration
        self.audio_format = pyaudio.paInt16
        self.channels = 1
        self.rate = 24000  # OpenAI Realtime API required sample rate
        self.chunk = 1024
        self.audio = pyaudio.PyAudio()

        # Audio streams
        self.input_stream = None
        self.output_stream = None

    async def connect(self):
        """Connect to WebSocket server"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "OpenAI-Beta": "realtime=v1"
        }

        logger.info("=" * 80)
        logger.info("🚀 Starting WebSocket connection")
        logger.info("=" * 80)
        logger.info(f"Connection URL: {self.websocket_url}")
        logger.info(f"API Key: {self.api_key[:10]}...")
        logger.info(f"Headers: {json.dumps(headers, ensure_ascii=False, indent=2)}")

        try:
            self.websocket = await websockets.connect(
                self.websocket_url,
                additional_headers=headers
            )
            self.is_connected = True
            logger.info("✅ WebSocket connection successful")

            # Send session configuration
            await self.send_session_config()

        except Exception as e:
            logger.error(f"❌ WebSocket connection failed: {e}")
            logger.error(f"Error type: {type(e).__name__}")
            raise

    async def send_session_config(self):
        """Send session configuration"""
        config = {
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful AI assistant that can engage in real-time voice conversations.",
                "voice": "alloy",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "input_audio_transcription": {
                    "model": "whisper-1"
                },
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 500
                },
                "tools": [],
                "tool_choice": "auto",
                "temperature": 0.8,
                "max_response_output_tokens": 4096
            }
        }

        config_json = json.dumps(config, ensure_ascii=False, indent=2)
        logger.info("=" * 60)
        logger.info("📤 Sending session configuration:")
        logger.info(f"Message type: {config['type']}")
        logger.info(f"Configuration content:\n{config_json}")
        logger.info("=" * 60)

        await self.websocket.send(json.dumps(config))
        logger.info("✅ Session configuration sent")

    def start_audio_streams(self):
        """Start audio input and output streams"""
        try:
            # Input stream (microphone)
            self.input_stream = self.audio.open(
                format=self.audio_format,
                channels=self.channels,
                rate=self.rate,
                input=True,
                frames_per_buffer=self.chunk
            )

            # Output stream (speakers)
            self.output_stream = self.audio.open(
                format=self.audio_format,
                channels=self.channels,
                rate=self.rate,
                output=True,
                frames_per_buffer=self.chunk
            )

            logger.info("Audio streams started")

        except Exception as e:
            logger.error(f"Failed to start audio streams: {e}")
            raise

    def stop_audio_streams(self):
        """Stop audio streams"""
        if self.input_stream:
            self.input_stream.stop_stream()
            self.input_stream.close()
            self.input_stream = None

        if self.output_stream:
            self.output_stream.stop_stream()
            self.output_stream.close()
            self.output_stream = None

        logger.info("Audio streams stopped")

    async def start_recording(self):
        """Start recording and send audio data"""
        self.is_recording = True
        logger.info("Starting recording...")

        try:
            while self.is_recording and self.is_connected:
                # Read audio data
                audio_data = self.input_stream.read(self.chunk, exception_on_overflow=False)

                # Encode audio data as base64
                audio_base64 = base64.b64encode(audio_data).decode('utf-8')

                # Send audio data
                message = {
                    "type": "input_audio_buffer.append",
                    "audio": audio_base64
                }

                # Log audio data sending (every 10 times to avoid excessive logging)
                if hasattr(self, '_audio_count'):
                    self._audio_count += 1
                else:
                    self._audio_count = 1

                if self._audio_count % 10 == 0:  # Log every 10 times
                    logger.debug(f"🎤 Sending audio data #{self._audio_count}: length={len(audio_base64)} characters")

                await self.websocket.send(json.dumps(message))

                # Brief delay to avoid excessive sending
                await asyncio.sleep(0.01)

        except Exception as e:
            logger.error(f"Error during recording: {e}")
        finally:
            logger.info("Recording stopped")

    async def stop_recording(self):
        """Stop recording"""
        self.is_recording = False

        # Send recording end signal
        if self.websocket and self.is_connected:
            message = {
                "type": "input_audio_buffer.commit"
            }

            logger.info("=" * 60)
            logger.info("📤 Sending recording end signal:")
            logger.info(f"Message type: {message['type']}")
            logger.info("=" * 60)

            await self.websocket.send(json.dumps(message))
            logger.info("✅ Recording end signal sent")

    async def handle_response(self):
        """Handle WebSocket responses"""
        try:
            async for message in self.websocket:
                data = json.loads(message)
                message_type = data.get("type", "unknown")

                # Log all received messages in detail
                logger.info("=" * 60)
                logger.info("📥 Received WebSocket message:")
                logger.info(f"Message type: {message_type}")

                # Handle different message types
                if message_type == "response.audio.delta":
                    # Handle audio response
                    audio_data = base64.b64decode(data.get("delta", ""))
                    logger.info(f"🎵 Audio data: length={len(audio_data)} bytes")
                    if audio_data and self.output_stream:
                        self.output_stream.write(audio_data)
                        logger.info("✅ Audio data played")

                elif message_type == "response.text.delta":
                    # Handle text response
                    text = data.get("delta", "")
                    logger.info(f"💬 Text delta: '{text}'")
                    if text:
                        print(f"AI: {text}", end="", flush=True)

                elif message_type == "response.text.done":
                    # Text response complete
                    logger.info("✅ Text response complete")
                    print("\n")

                elif message_type == "response.audio.done":
                    # Audio response complete
                    logger.info("✅ Audio response complete")

                elif message_type == "error":
                    # Handle errors
                    error_info = data.get('error', {})
                    logger.error("❌ Server error:")
                    logger.error(f"Error details: {json.dumps(error_info, ensure_ascii=False, indent=2)}")

                elif message_type == "session.created":
                    # Session created successfully
                    logger.info("✅ Session created")

                elif message_type == "session.updated":
                    # Session updated successfully
                    logger.info("✅ Session updated")

                elif message_type == "conversation.item.created":
                    # Conversation item created
                    logger.info("📝 Conversation item created")

                elif message_type == "conversation.item.input_audio_buffer.speech_started":
                    # Speech started
                    logger.info("🎤 Speech start detected")

                elif message_type == "conversation.item.input_audio_buffer.speech_stopped":
                    # Speech stopped
                    logger.info("🔇 Speech stop detected")

                elif message_type == "conversation.item.input_audio_buffer.committed":
                    # Audio buffer committed
                    logger.info("📤 Audio buffer committed")

                else:
                    # Other unknown message types
                    logger.info(f"❓ Unknown message type: {message_type}")

                # Log complete message content (except audio data, as it's too long)
                if message_type != "response.audio.delta":
                    logger.info(f"Complete message content:\n{json.dumps(data, ensure_ascii=False, indent=2)}")

                logger.info("=" * 60)

        except websockets.exceptions.ConnectionClosed:
            logger.info("WebSocket connection closed")
            self.is_connected = False
        except Exception as e:
            logger.error(f"Error handling response: {e}")
            self.is_connected = False

    async def start_conversation(self):
        """Start conversation"""
        try:
            # Start audio streams
            self.start_audio_streams()

            # Create tasks
            response_task = asyncio.create_task(self.handle_response())
            recording_task = asyncio.create_task(self.start_recording())

            logger.info("Conversation started, press Ctrl+C to stop")

            # Wait for tasks to complete
            await asyncio.gather(response_task, recording_task)

        except KeyboardInterrupt:
            logger.info("Stop signal received")
        except Exception as e:
            logger.error(f"Error during conversation: {e}")
        finally:
            await self.cleanup()

    async def cleanup(self):
        """Clean up resources"""
        self.is_recording = False
        self.is_connected = False

        # Stop audio streams
        self.stop_audio_streams()

        # Close WebSocket connection
        if self.websocket:
            await self.websocket.close()
            logger.info("WebSocket connection closed")

        # Terminate PyAudio
        self.audio.terminate()
        logger.info("Resource cleanup complete")

    async def run(self):
        """Run client"""
        try:
            await self.connect()
            await self.start_conversation()
        except Exception as e:
            logger.error(f"Error running client: {e}")
        finally:
            await self.cleanup()


async def main():
    """Main function"""
    # Configuration parameters
    WEBSOCKET_URL = "wss://new-api.weroam.xyz/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
    API_KEY = "sk-EpnduEXFxjAt0AF55W08WBmzqZHlv9f4tmCDWd9TcJqBwVjV"
    MODEL = "gpt-4o-realtime-preview-2024-12-17"

    # Create client
    client = OpenAIRealtimeClient(
        websocket_url=WEBSOCKET_URL,
        api_key=API_KEY,
        model=MODEL
    )

    # Run client
    await client.run()


if __name__ == "__main__":
    print("OpenAI Realtime WebSocket Audio Example")
    print("=" * 50)
    print("Features:")
    print("- Real-time voice conversation")
    print("- Automatic speech recognition")
    print("- Text and audio responses")
    print("- Press Ctrl+C to stop")
    print("=" * 50)

    try:
        asyncio.run(main())
    except KeyboardInterrupt:
        print("\nProgram stopped")
    except Exception as e:
        print(f"Program error: {e}")

⚠️ Error Handling

Common Errors

  1. Connection errors
  2. Network issues
  3. Authentication failures
  4. Configuration errors
  5. Audio errors
  6. Device permissions
  7. Unsupported formats
  8. Codec issues
  9. Session errors
  10. Token expiration
  11. Session timeout
  12. Concurrency limits

Error Recovery

  1. Automatic reconnection
  2. Session recovery
  3. Error retry
  4. Graceful degradation

📝 Event Reference

Common Request Headers

All events need to include the following request headers:

Header Type Description Example Value
Authorization String Authentication token Bearer $API_KEY
OpenAI-Beta String API version realtime=v1

Client Events

session.update

Update the default configuration for the session.

Parameter Type Required Description Example Value/Optional Values
event_id String No Client-generated event identifier event_123
type String No Event type session.update
modalities String array No Modality types the model can respond with ["text", "audio"]
instructions String No System instructions prepended to model calls "Your knowledge cutoff is 2023-10..."
voice String No Voice type used by the model alloy, echo, shimmer
input_audio_format String No Input audio format pcm16, g711_ulaw, g711_alaw
output_audio_format String No Output audio format pcm16, g711_ulaw, g711_alaw
input_audio_transcription.model String No Model used for transcription whisper-1
turn_detection.type String No Voice detection type server_vad
turn_detection.threshold Number No VAD activation threshold (0.0-1.0) 0.8
turn_detection.prefix_padding_ms Integer No Audio duration included before speech starts 500
turn_detection.silence_duration_ms Integer No Silence duration to detect speech stop 1000
tools Array No List of tools available to the model []
tool_choice String No How the model chooses tools auto/none/required
temperature Number No Model sampling temperature 0.8
max_output_tokens String/Integer No Maximum tokens per response "inf"/4096

input_audio_buffer.append

Append audio data to the input audio buffer.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_456
type String No Event type input_audio_buffer.append
audio String No Base64-encoded audio data Base64EncodedAudioData

input_audio_buffer.commit

Commit the audio data in the buffer as a user message.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_789
type String No Event type input_audio_buffer.commit

input_audio_buffer.clear

Clear all audio data from the input audio buffer.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_012
type String No Event type input_audio_buffer.clear

conversation.item.create

Add a new conversation item to the conversation.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_345
type String No Event type conversation.item.create
previous_item_id String No New item will be inserted after this ID null
item.id String No Unique identifier for the conversation item msg_001
item.type String No Type of conversation item message/function_call/function_call_output
item.status String No Status of conversation item completed/in_progress/incomplete
item.role String No Role of message sender user/assistant/system
item.content Array No Message content [text/audio/transcript]
item.call_id String No ID of function call call_001
item.name String No Name of called function function_name
item.arguments String No Arguments for function call {"param": "value"}
item.output String No Output result of function call {"result": "value"}

conversation.item.truncate

Truncate audio content in assistant messages.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_678
type String No Event type conversation.item.truncate
item_id String No ID of assistant message item to truncate msg_002
content_index Integer No Index of content part to truncate 0
audio_end_ms Integer No End time point for audio truncation 1500

conversation.item.delete

Delete the specified conversation item from conversation history.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_901
type String No Event type conversation.item.delete
item_id String No ID of conversation item to delete msg_003

response.create

Trigger response generation.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_234
type String No Event type response.create
response.modalities String array No Modality types for response ["text", "audio"]
response.instructions String No Instructions for the model "Please assist the user."
response.voice String No Voice type used by the model alloy/echo/shimmer
response.output_audio_format String No Output audio format pcm16
response.tools Array No List of tools available to the model ["type", "name", "description"]
response.tool_choice String No How the model chooses tools auto
response.temperature Number No Sampling temperature 0.7
response.max_output_tokens Integer/String No Maximum output tokens 150/"inf"

response.cancel

Cancel ongoing response generation.

Parameter Type Required Description Example Value
event_id String No Client-generated event identifier event_567
type String No Event type response.cancel

Server Events

error

Event returned when an error occurs.

Parameter Type Required Description Example Value
event_id String array No Unique identifier for server event ["event_890"]
type String No Event type error
error.type String No Error type invalid_request_error/server_error
error.code String No Error code invalid_event
error.message String No Human-readable error message "The 'type' field is missing."
error.param String No Parameter related to error null
error.event_id String No ID of related event event_567

conversation.item.input_audio_transcription.completed

Returned when input audio transcription is enabled and transcription succeeds.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_2122
type String No Event type conversation.item.input_audio_transcription.completed
item_id String No ID of user message item msg_003
content_index Integer No Index of content part containing audio 0
transcript String No Transcribed text content "Hello, how are you?"

conversation.item.input_audio_transcription.failed

Returned when input audio transcription is configured but transcription request for user message fails.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_2324
type String array No Event type ["conversation.item.input_audio_transcription.failed"]
item_id String No ID of user message item msg_003
content_index Integer No Index of content part containing audio 0
error.type String No Error type transcription_error
error.code String No Error code audio_unintelligible
error.message String No Human-readable error message "The audio could not be transcribed."
error.param String No Parameter related to error null

conversation.item.truncated

Returned when client truncates previous assistant audio message item.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_2526
type String No Event type conversation.item.truncated
item_id String No ID of truncated assistant message item msg_004
content_index Integer No Index of truncated content part 0
audio_end_ms Integer No Time point when audio was truncated (milliseconds) 1500

conversation.item.deleted

Returned when an item in the conversation is deleted.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_2728
type String No Event type conversation.item.deleted
item_id String No ID of deleted conversation item msg_005

input_audio_buffer.committed

Returned when audio buffer data is committed.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1121
type String No Event type input_audio_buffer.committed
previous_item_id String No New conversation item will be inserted after this ID msg_001
item_id String No ID of user message item to be created msg_002

input_audio_buffer.cleared

Returned when client clears input audio buffer.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1314
type String No Event type input_audio_buffer.cleared

input_audio_buffer.speech_started

In server voice detection mode, returned when voice input is detected.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1516
type String No Event type input_audio_buffer.speech_started
audio_start_ms Integer No Milliseconds from session start to voice detection 1000
item_id String No ID of user message item to be created when voice stops msg_003

input_audio_buffer.speech_stopped

In server voice detection mode, returned when voice input stops.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1718
type String No Event type input_audio_buffer.speech_stopped
audio_start_ms Integer No Milliseconds from session start to voice stop detection 2000
item_id String No ID of user message item to be created msg_003

response.created

Returned when a new response is created.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_2930
type String No Event type response.created
response.id String No Unique identifier for response resp_001
response.object String No Object type realtime.response
response.status String No Status of response in_progress
response.status_details Object No Additional details about status null
response.output String array No List of output items generated by response ["[]"]
response.usage Object No Usage statistics for response null

response.done

Returned when response streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_3132
type String No Event type response.done
response.id String No Unique identifier for response resp_001
response.object String No Object type realtime.response
response.status String No Final status of response completed/cancelled/failed/incomplete
response.status_details Object No Additional details about status null
response.output String array No List of output items generated by response ["[...]"]
response.usage.total_tokens Integer No Total tokens 50
response.usage.input_tokens Integer No Input tokens 20
response.usage.output_tokens Integer No Output tokens 30

response.output_item.added

Returned when a new output item is created during response generation.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_3334
type String No Event type response.output_item.added
response_id String No ID of response the output item belongs to resp_001
output_index String No Index of output item in response 0
item.id String No Unique identifier for output item msg_007
item.object String No Object type realtime.item
item.type String No Type of output item message/function_call/function_call_output
item.status String No Status of output item in_progress/completed
item.role String No Role associated with output item assistant
item.content Array No Content of output item ["type", "text", "audio", "transcript"]

response.output_item.done

Returned when output item streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_3536
type String No Event type response.output_item.done
response_id String No ID of response the output item belongs to resp_001
output_index String No Index of output item in response 0
item.id String No Unique identifier for output item msg_007
item.object String No Object type realtime.item
item.type String No Type of output item message/function_call/function_call_output
item.status String No Final status of output item completed/incomplete
item.role String No Role associated with output item assistant
item.content Array No Content of output item ["type", "text", "audio", "transcript"]

response.content_part.added

Returned when a new content part is added to assistant message item during response generation.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_3738
type String No Event type response.content_part.added
response_id String No ID of response resp_001
item_id String No ID of message item to add content part to msg_007
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
part.type String No Content type text/audio
part.text String No Text content "Hello"
part.audio String No Base64-encoded audio data "base64_encoded_audio_data"
part.transcript String No Transcribed text of audio "Hello"

response.content_part.done

Returned when content part in assistant message item streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_3940
type String No Event type response.content_part.done
response_id String No ID of response resp_001
item_id String No ID of message item to add content part to msg_007
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
part.type String No Content type text/audio
part.text String No Text content "Hello"
part.audio String No Base64-encoded audio data "base64_encoded_audio_data"
part.transcript String No Transcribed text of audio "Hello"

response.text.delta

Returned when text value of "text" type content part is updated.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_4142
type String No Event type response.text.delta
response_id String No ID of response resp_001
item_id String No ID of message item msg_007
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
delta String No Text delta update content "Sure, I can h"

response.text.done

Returned when "text" type content part text streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_4344
type String No Event type response.text.done
response_id String No ID of response resp_001
item_id String No ID of message item msg_007
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
delta String No Final complete text content "Sure, I can help with that."

response.audio_transcript.delta

Returned when transcription content of model-generated audio output is updated.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_4546
type String No Event type response.audio_transcript.delta
response_id String No ID of response resp_001
item_id String No ID of message item msg_008
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
delta String No Transcription text delta update content "Hello, how can I a"

response.audio_transcript.done

Returned when transcription of model-generated audio output streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_4748
type String No Event type response.audio_transcript.done
response_id String No ID of response resp_001
item_id String No ID of message item msg_008
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
transcript String No Final complete transcribed text of audio "Hello, how can I assist you today?"

response.audio.delta

Returned when model-generated audio content is updated.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_4950
type String No Event type response.audio.delta
response_id String No ID of response resp_001
item_id String No ID of message item msg_008
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0
delta String No Base64-encoded audio data delta "Base64EncodedAudioDelta"

response.audio.done

Returned when model-generated audio is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_5152
type String No Event type response.audio.done
response_id String No ID of response resp_001
item_id String No ID of message item msg_008
output_index Integer No Index of output item in response 0
content_index Integer No Index of content part in message item content array 0

Function Calling

response.function_call_arguments.delta

Returned when model-generated function call arguments are updated.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_5354
type String No Event type response.function_call_arguments.delta
response_id String No ID of response resp_002
item_id String No ID of message item fc_001
output_index Integer No Index of output item in response 0
call_id String No ID of function call call_001
delta String No JSON format function call arguments delta "{\"location\": \"San\""

response.function_call_arguments.done

Returned when model-generated function call arguments streaming is complete.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_5556
type String No Event type response.function_call_arguments.done
response_id String No ID of response resp_002
item_id String No ID of message item fc_001
output_index Integer No Index of output item in response 0
call_id String No ID of function call call_001
arguments String No Final complete function call arguments (JSON format) "{\"location\": \"San Francisco\"}"

Other Status Updates

rate_limits.updated

Triggered after each "response.done" event to indicate updated rate limits.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_5758
type String No Event type rate_limits.updated
rate_limits Object array No List of rate limit information [{"name": "requests_per_min", "limit": 60, "remaining": 45, "reset_seconds": 35}]

conversation.created

Returned when conversation is created.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_9101
type String No Event type conversation.created
conversation Object No Conversation resource object {"id": "conv_001", "object": "realtime.conversation"}

conversation.item.created

Returned when conversation item is created.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1920
type String No Event type conversation.item.created
previous_item_id String No ID of previous conversation item msg_002
item Object No Conversation item object {"id": "msg_003", "object": "realtime.item", "type": "message", "status": "completed", "role": "user", "content": [{"type": "text", "text": "Hello"}]}

session.created

Returned when session is created.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_1234
type String No Event type session.created
session Object No Session object {"id": "sess_001", "object": "realtime.session", "model": "gpt-4", "modalities": ["text", "audio"]}

session.updated

Returned when session is updated.

Parameter Type Required Description Example Value
event_id String No Unique identifier for server event event_5678
type String No Event type session.updated
session Object No Updated session object {"id": "sess_001", "object": "realtime.session", "model": "gpt-4", "modalities": ["text", "audio"]}

Rate Limit Event Parameter Table

Parameter Type Required Description Example Value
name String Yes Limit name requests_per_min
limit Integer Yes Limit value 60
remaining Integer Yes Remaining available amount 45
reset_seconds Integer Yes Reset time (seconds) 35

Function Call Parameter Table

Parameter Type Required Description Example Value
type String Yes Function type function
name String Yes Function name get_weather
description String No Function description Get the current weather
parameters Object Yes Function parameter definition {"type": "object", "properties": {...}}

Audio Format Parameter Table

Parameter Type Description Optional Values
sample_rate Integer Sample rate 8000, 16000, 24000, 44100, 48000
channels Integer Number of channels 1 (mono), 2 (stereo)
bits_per_sample Integer Bits per sample 16 (pcm16), 8 (g711)
encoding String Encoding method pcm16, g711_ulaw, g711_alaw

Voice Detection Parameter Table

Parameter Type Description Default Value Range
threshold Float VAD activation threshold 0.5 0.0-1.0
prefix_padding_ms Integer Voice prefix padding (milliseconds) 500 0-5000
silence_duration_ms Integer Silence detection duration (milliseconds) 1000 100-10000

Tool Selection Parameter Table

Parameter Type Description Optional Values
tool_choice String Tool selection method auto, none, required
tools Array Available tools list [{type, name, description, parameters}]

Model Configuration Parameter Table

Parameter Type Description Range/Optional Values Default Value
temperature Float Sampling temperature 0.0-2.0 1.0
max_output_tokens Integer/String Maximum output length 1-4096/"inf" "inf"
modalities String array Response modalities ["text", "audio"] ["text"]
voice String Voice type alloy, echo, shimmer alloy

Event Common Parameter Table

Parameter Type Required Description Example Value
event_id String Yes Unique identifier for event event_123
type String Yes Event type session.update
timestamp Integer No Event timestamp (milliseconds) 1677649363000

Session Status Parameter Table

Parameter Type Description Optional Values
status String Session status active, ended, error
error Object Error information {"type": "error_type", "message": "error message"}
metadata Object Session metadata {"client_id": "web", "session_type": "chat"}

Conversation Item Status Parameter Table

Parameter Type Description Optional Values
status String Conversation item status completed, in_progress, incomplete
role String Sender role user, assistant, system
type String Conversation item type message, function_call, function_call_output

Content Type Parameter Table

Parameter Type Description Optional Values
type String Content type text, audio, transcript
format String Content format plain, markdown, html
encoding String Encoding method utf-8, base64

Response Status Parameter Table

Parameter Type Description Optional Values
status String Response status completed, cancelled, failed, incomplete
status_details Object Status details {"reason": "user_cancelled"}
usage Object Usage statistics {"total_tokens": 50, "input_tokens": 20, "output_tokens": 30}

Audio Transcription Parameter Table

Parameter Type Description Example Value
enabled Boolean Whether transcription is enabled true
model String Transcription model whisper-1
language String Transcription language en, zh, auto
prompt String Transcription prompt "Transcript of a conversation"

Audio Stream Parameter Table

Parameter Type Description Optional Values
chunk_size Integer Audio chunk size (bytes) 1024, 2048, 4096
latency String Latency mode low, balanced, high
compression String Compression method none, opus, mp3

WebRTC Configuration Parameter Table

Parameter Type Description Default Value
ice_servers Array ICE server list [{"urls": "stun:stun.l.google.com:19302"}]
audio_constraints Object Audio constraints {"echoCancellation": true}
connection_timeout Integer Connection timeout (milliseconds) 30000