Voice agents

Voice SDK

Learn how to use the Voice SDK.

The Voice SDK is a Python library that provides additional features optimized for conversational AI, built on top of our Realtime API.

We use it to build our integrations, and it is also available for you to use.

Intelligent segmentation: groups words into meaningful speech segments per speaker.
Turn detection: automatically detects when speakers finish talking.
Speaker management: focus on or ignore specific speakers in multi-speaker scenarios.
Preset configurations: offers ready-to-use settings for conversations, note-taking, and captions.
Simplified event handling: delivers clean, structured segments instead of raw word-level events.

Voice SDK vs Realtime SDK

Use the Voice SDK when:

Building conversational AI or voice agents
You need automatic turn detection
You want speaker-focused transcription
You need ready-to-use presets for common scenarios

Use the Realtime SDK when:

You need the raw stream of word-by-word transcription data
Building custom segmentation logic
You want fine-grained control over every event
Processing audio files or custom workflows

Getting started

1. Create an API key

Create a Speechmatics API key in the portal to access the Voice SDK. Store your key securely as a managed secret.

2. Install dependencies

# Standard installation
pip install speechmatics-voice

# With SMART_TURN (ML-based turn detection)
pip install speechmatics-voice[smart]

3. Quickstart

Here's how to stream microphone audio to the Voice Agent and transcribe finalised segments of speech, with speaker ID:

import asyncio
import os
from speechmatics.rt import Microphone
from speechmatics.voice import VoiceAgentClient, AgentServerMessageType

async def main():
    """Stream microphone audio to Speechmatics Voice Agent using 'scribe' preset"""

    # Audio configuration
    SAMPLE_RATE = 16000         # Hz
    CHUNK_SIZE = 160            # Samples per read
    PRESET = "scribe"           # Configuration preset

    # Create client with preset
    client = VoiceAgentClient(
        api_key=os.getenv("SPEECHMATICS_API_KEY"),
        preset=PRESET
    )

    # Print finalised segments of speech with speaker ID
    @client.on(AgentServerMessageType.ADD_SEGMENT)
    def on_segment(message):
        for segment in message["segments"]:
            speaker = segment["speaker_id"]
            text = segment["text"]
            print(f"{speaker}: {text}")

    # Setup microphone
    mic = Microphone(SAMPLE_RATE, CHUNK_SIZE)
    if not mic.start():
        print("Error: Microphone not available")
        return

    # Connect to the Voice Agent
    await client.connect()

    # Stream microphone audio (interruptable using keyboard)
    try:
        while True:
            audio_chunk = await mic.read(CHUNK_SIZE)
            if not audio_chunk:
                break # Microphone stopped producing data
            await client.send_audio(audio_chunk)
    except KeyboardInterrupt:
        pass
    finally:
        await client.disconnect()

if __name__ == "__main__":
    asyncio.run(main())

Presets - the simplest way to get started

These are purpose-built, optimized configurations, ready for use without further modification:

fast - low latency, fast responses

adaptive - general conversation

smart_turn - complex conversation

external - user handles end of turn

scribe - note-taking

captions - live captioning

To view all available presets:

presets = VoiceAgentConfigPreset.list_presets()

4. Custom configurations

For more control, you can also specify custom configurations or use presets as a starting point and customise with overlays:

Specify configurations in a VoiceAgentConfig object:

from speechmatics.voice import VoiceAgentClient, VoiceAgentConfig, EndOfUtteranceMode

config = VoiceAgentConfig(
    language="en",
    enable_diarization=True,
    max_delay=0.7,
    end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)

client = VoiceAgentClient(api_key=os.getenv("YOUR_API_KEY"), config=config)

Use presets as a starting point and customise with overlays:

from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig

# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
    VoiceAgentConfig(
        language="es",
        max_delay=0.8
    )
)

Note: If no configuration or preset is provided, the client will default to the external preset.

Configuration

Basic parameters

language (str, default: "en")
Language code for transcription (e.g., "en", "es", "fr").
See supported languages.

operating_point (OperatingPoint, default: ENHANCED)
Balance accuracy vs latency. Options: STANDARD or ENHANCED.

domain (str, default: None)
Domain-specific model (e.g., "finance", "medical"). See supported languages and domains.

output_locale (str, default: None)
Output locale for formatting (e.g., "en-GB", "en-US"). See supported languages and locales.

enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers.

Turn detection

end_of_utterance_mode (EndOfUtteranceMode, default: FIXED)
Controls how turn endings are detected:

FIXED: Uses fixed silence threshold.
Fast but may split slow speech.
ADAPTIVE: Adjusts delay based on speech rate, pauses, and disfluencies.
Best for natural conversation.
SMART_TURN: Uses ML model to detect acoustic turn-taking cues.
Requires [smart] extras.
EXTERNAL: Manual control via client.finalize().
For custom turn logic.

end_of_utterance_silence_trigger (float, default: 0.2)
Silence duration in seconds to trigger turn end.

end_of_utterance_max_delay (float, default: 10.0)
Maximum delay before forcing turn end.

max_delay (float, default: 0.7)
Maximum transcription delay for word emission. Defaults to 0.7 seconds, but when using turn detection we recommend 1.0s for better accuracy. Turn detection will ensure finalisation latency is not affected.

Speaker configuration

speaker_sensitivity (float, default: 0.5)
Diarization sensitivity between 0.0 and 1.0. Higher values detect more speakers.

max_speakers (int, default: None)
Limit maximum number of speakers to detect.

prefer_current_speaker (bool, default: False)
Give extra weight to current speaker for word grouping.

speaker_config (SpeakerFocusConfig, default: SpeakerFocusConfig())
Configure speaker focus/ignore rules.

from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode

# Focus only on specific speakers
config = VoiceAgentConfig(
  enable_diarization=True,
  speaker_config=SpeakerFocusConfig(
      focus_speakers=["S1", "S2"],
      focus_mode=SpeakerFocusMode.RETAIN
  )
)

# Ignore specific speakers
config = VoiceAgentConfig(
  enable_diarization=True,
  speaker_config=SpeakerFocusConfig(
      ignore_speakers=["S3"],
      focus_mode=SpeakerFocusMode.IGNORE
  )
)

known_speakers (list[SpeakerIdentifier], default: [])
Pre-enrolled speaker identifiers for speaker identification.

from speechmatics.voice import SpeakerIdentifier

config = VoiceAgentConfig(
  enable_diarization=True,
  known_speakers=[
      SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
      SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
  ]
)

Language and vocabulary

additional_vocab (list[AdditionalVocabEntry], default: [])

Custom vocabulary for domain-specific terms.

from speechmatics.voice import AdditionalVocabEntry

config = VoiceAgentConfig(
  language="en",
  additional_vocab=[
      AdditionalVocabEntry(
          content="Speechmatics",
          sounds_like=["speech matters", "speech matics"]
      ),
      AdditionalVocabEntry(content="API"),
  ]
)

punctuation_overrides (dict, default: None) Custom punctuation rules.

Audio parameters

sample_rate (int, default: 16000)
Audio sample rate in Hz.

audio_encoding (AudioEncoding, default: PCM_S16LE)
Audio encoding format.

Advanced parameters

transcription_update_preset (TranscriptionUpdatePreset, default: COMPLETE)
Controls when to emit updates: COMPLETE, COMPLETE_PLUS_TIMING, WORDS, WORDS_PLUS_TIMING, or TIMING.

speech_segment_config (SpeechSegmentConfig, default: SpeechSegmentConfig())
Fine-tune segment generation and post-processing.

smart_turn_config (SmartTurnConfig, default: None)
Configure SMART_TURN behavior (buffer length, threshold).

include_results (bool, default: False)
Include word-level timing data in segments.

include_partials (bool, default: True)
Emit partial segments. Set to False for final-only output.

Configuration with overlays.

Use presets as a starting point and customize with overlays:

from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig

# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
  VoiceAgentConfig(
      language="es",
      max_delay=0.8
  )
)

Available presets

presets = VoiceAgentConfigPreset.list_presets()
# Output: ['low_latency', 'conversation_adaptive', 'conversation_smart_turn', 'scribe', 'captions']

Configuration serialization

Export and import configurations as JSON:

from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig

# Export preset to JSON
config_json = VoiceAgentConfigPreset.SCRIBE().to_json()

# Load from JSON
config = VoiceAgentConfig.from_json(config_json)

# Or create from JSON string
config = VoiceAgentConfig.from_json('{"language": "en", "enable_diarization": true}')

For more information, see the Voice SDK on github. `

Voice SDK vs Realtime SDK​

Getting started​

1. Create an API key​

2. Install dependencies​

3. Quickstart​

Presets - the simplest way to get started​

4. Custom configurations​

Configuration​

Basic parameters​

Turn detection​

Speaker configuration​

Language and vocabulary​

Audio parameters​

Advanced parameters​

Configuration with overlays.​

Available presets​

Configuration serialization​