Voice SDK
The Voice SDK is a Python library that provides additional features optimized for conversational AI, built on top of our Realtime API.
We use it to build our integrations, and it is also available for you to use.
- Intelligent segmentation: groups words into meaningful speech segments per speaker.
- Turn detection: automatically detects when speakers finish talking.
- Speaker management: focus on or ignore specific speakers in multi-speaker scenarios.
- Preset configurations: offers ready-to-use settings for conversations, note-taking, and captions.
- Simplified event handling: delivers clean, structured segments instead of raw word-level events.
Voice SDK vs Realtime SDK
Use the Voice SDK when:
- Building conversational AI or voice agents
- You need automatic turn detection
- You want speaker-focused transcription
- You need ready-to-use presets for common scenarios
Use the Realtime SDK when:
- You need the raw stream of word-by-word transcription data
- Building custom segmentation logic
- You want fine-grained control over every event
- Processing audio files or custom workflows
Getting started
1. Create an API key
Create a Speechmatics API key in the portal to access the Voice SDK. Store your key securely as a managed secret.
2. Install dependencies
# Standard installation
pip install speechmatics-voice
# With SMART_TURN (ML-based turn detection)
pip install speechmatics-voice[smart]
3. Quickstart
Here's how to stream microphone audio to the Voice Agent and transcribe finalised segments of speech, with speaker ID:
import asyncio
import os
from speechmatics.rt import Microphone
from speechmatics.voice import VoiceAgentClient, AgentServerMessageType
async def main():
"""Stream microphone audio to Speechmatics Voice Agent using 'scribe' preset"""
# Audio configuration
SAMPLE_RATE = 16000 # Hz
CHUNK_SIZE = 160 # Samples per read
PRESET = "scribe" # Configuration preset
# Create client with preset
client = VoiceAgentClient(
api_key=os.getenv("SPEECHMATICS_API_KEY"),
preset=PRESET
)
# Print finalised segments of speech with speaker ID
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
speaker = segment["speaker_id"]
text = segment["text"]
print(f"{speaker}: {text}")
# Setup microphone
mic = Microphone(SAMPLE_RATE, CHUNK_SIZE)
if not mic.start():
print("Error: Microphone not available")
return
# Connect to the Voice Agent
await client.connect()
# Stream microphone audio (interruptable using keyboard)
try:
while True:
audio_chunk = await mic.read(CHUNK_SIZE)
if not audio_chunk:
break # Microphone stopped producing data
await client.send_audio(audio_chunk)
except KeyboardInterrupt:
pass
finally:
await client.disconnect()
if __name__ == "__main__":
asyncio.run(main())
Presets - the simplest way to get started
These are purpose-built, optimized configurations, ready for use without further modification:
fast - low latency, fast responses
adaptive - general conversation
smart_turn - complex conversation
external - user handles end of turn
scribe - note-taking
captions - live captioning
To view all available presets:
presets = VoiceAgentConfigPreset.list_presets()
4. Custom configurations
For more control, you can also specify custom configurations or use presets as a starting point and customise with overlays:
Specify configurations in a VoiceAgentConfig object:
from speechmatics.voice import VoiceAgentClient, VoiceAgentConfig, EndOfUtteranceMode
config = VoiceAgentConfig(
language="en",
enable_diarization=True,
max_delay=0.7,
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)
client = VoiceAgentClient(api_key=os.getenv("YOUR_API_KEY"), config=config)
Use presets as a starting point and customise with overlays:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
VoiceAgentConfig(
language="es",
max_delay=0.8
)
)
Note: If no configuration or preset is provided, the client will default to the external preset.
Configuration
Basic parameters
language (str, default: "en")
Language code for transcription (e.g., "en", "es", "fr").
See supported languages.
operating_point (OperatingPoint, default: ENHANCED)
Balance accuracy vs latency.
Options: STANDARD or ENHANCED.
domain (str, default: None)
Domain-specific model (e.g., "finance", "medical").
See supported languages and domains.
output_locale (str, default: None)
Output locale for formatting (e.g., "en-GB", "en-US").
See supported languages and locales.
enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers.
Turn detection
end_of_utterance_mode (EndOfUtteranceMode, default: FIXED)
Controls how turn endings are detected:
FIXED: Uses fixed silence threshold.
Fast but may split slow speech.ADAPTIVE: Adjusts delay based on speech rate, pauses, and disfluencies.
Best for natural conversation.SMART_TURN: Uses ML model to detect acoustic turn-taking cues.
Requires [smart] extras.EXTERNAL: Manual control via client.finalize().
For custom turn logic.
end_of_utterance_silence_trigger (float, default: 0.2)
Silence duration in seconds to trigger turn end.
end_of_utterance_max_delay (float, default: 10.0)
Maximum delay before forcing turn end.
max_delay (float, default: 0.7)
Maximum transcription delay for word emission.
Defaults to 0.7 seconds, but when using turn detection we recommend 1.0s for better accuracy. Turn detection will ensure finalisation latency is not affected.
Speaker configuration
speaker_sensitivity (float, default: 0.5)
Diarization sensitivity between 0.0 and 1.0.
Higher values detect more speakers.
max_speakers (int, default: None)
Limit maximum number of speakers to detect.
prefer_current_speaker (bool, default: False)
Give extra weight to current speaker for word grouping.
speaker_config (SpeakerFocusConfig, default: SpeakerFocusConfig())
Configure speaker focus/ignore rules.
from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode
# Focus only on specific speakers
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.RETAIN
)
)
# Ignore specific speakers
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
ignore_speakers=["S3"],
focus_mode=SpeakerFocusMode.IGNORE
)
)
known_speakers (list[SpeakerIdentifier], default: [])
Pre-enrolled speaker identifiers for speaker identification.
from speechmatics.voice import SpeakerIdentifier
config = VoiceAgentConfig(
enable_diarization=True,
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
]
)
Language and vocabulary
additional_vocab (list[AdditionalVocabEntry], default: [])
Custom vocabulary for domain-specific terms.
from speechmatics.voice import AdditionalVocabEntry
config = VoiceAgentConfig(
language="en",
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"]
),
AdditionalVocabEntry(content="API"),
]
)
punctuation_overrides (dict, default: None)
Custom punctuation rules.
Audio parameters
sample_rate (int, default: 16000)
Audio sample rate in Hz.
audio_encoding (AudioEncoding, default: PCM_S16LE)
Audio encoding format.
Advanced parameters
transcription_update_preset (TranscriptionUpdatePreset, default: COMPLETE)
Controls when to emit updates: COMPLETE, COMPLETE_PLUS_TIMING, WORDS, WORDS_PLUS_TIMING, or TIMING.
speech_segment_config (SpeechSegmentConfig, default: SpeechSegmentConfig())
Fine-tune segment generation and post-processing.
smart_turn_config (SmartTurnConfig, default: None)
Configure SMART_TURN behavior (buffer length, threshold).
include_results (bool, default: False)
Include word-level timing data in segments.
include_partials (bool, default: True)
Emit partial segments. Set to False for final-only output.
Configuration with overlays.
Use presets as a starting point and customize with overlays:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
VoiceAgentConfig(
language="es",
max_delay=0.8
)
)
Available presets
presets = VoiceAgentConfigPreset.list_presets()
# Output: ['low_latency', 'conversation_adaptive', 'conversation_smart_turn', 'scribe', 'captions']
Configuration serialization
Export and import configurations as JSON:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Export preset to JSON
config_json = VoiceAgentConfigPreset.SCRIBE().to_json()
# Load from JSON
config = VoiceAgentConfig.from_json(config_json)
# Or create from JSON string
config = VoiceAgentConfig.from_json('{"language": "en", "enable_diarization": true}')
For more information, see the Voice SDK on github. `