STT
This module implements Ultravox speech-to-text with a locally-loaded model.
- class pipecat.services.ultravox.stt.AudioBuffer[source]
Bases:
object
Buffer to collect audio frames before processing.
- frames
List of AudioRawFrames to process
- started_at
Timestamp when speech started
- is_processing
Flag to prevent concurrent processing
- class pipecat.services.ultravox.stt.UltravoxModel(model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b')[source]
Bases:
object
Model wrapper for the Ultravox multimodal model.
This class handles loading and running the Ultravox model for speech-to-text.
- Parameters:
model_name (str) – The name or path of the Ultravox model to load
- model_name
The name of the loaded model
- engine
The vLLM engine for model inference
- tokenizer
The tokenizer for the model
- stop_token_ids
Optional token IDs to stop generation
- format_prompt(messages)[source]
Format chat messages into a prompt for the model.
- Parameters:
messages (list) – List of message dictionaries with ‘role’ and ‘content’
- Returns:
Formatted prompt string
- Return type:
str
- async generate(messages, temperature=0.7, max_tokens=100, audio=None)[source]
Generate text from audio input using the model.
- Parameters:
messages (list) – List of message dictionaries
temperature (float) – Sampling temperature
max_tokens (int) – Maximum tokens to generate
audio (ndarray) – Audio data as numpy array
- Yields:
str – JSON chunks of the generated response
- class pipecat.services.ultravox.stt.UltravoxSTTService(*, model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b', hf_token=None, temperature=0.7, max_tokens=100, **kwargs)[source]
Bases:
AIService
Service to transcribe audio using the Ultravox multimodal model.
This service collects audio frames and processes them with Ultravox to generate text transcriptions.
- Parameters:
model_name (str) – The Ultravox model to use (ModelSize enum or string)
hf_token (str | None) – Hugging Face token for model access
temperature (float) – Sampling temperature for generation
max_tokens (int) – Maximum tokens to generate
**kwargs – Additional arguments passed to AIService
- model
The UltravoxModel instance
- buffer
Buffer to collect audio frames
- temperature
Temperature for text generation
- max_tokens
Maximum tokens to generate
- _connection_active
Flag indicating if service is active
- can_generate_metrics()[source]
Indicates whether this service can generate metrics.
- Returns:
True, as this service supports metric generation.
- Return type:
bool
- async start(frame)[source]
Handle service start.
- Parameters:
frame (StartFrame) – StartFrame that triggered this method
- async stop(frame)[source]
Handle service stop.
- Parameters:
frame (EndFrame) – EndFrame that triggered this method
- async cancel(frame)[source]
Handle service cancellation.
- Parameters:
frame (CancelFrame) – CancelFrame that triggered this method
- async process_frame(frame, direction)[source]
Process incoming frames.
This method collects audio frames and processes them when speech ends.
- Parameters:
frame (Frame) – The frame to process
direction (FrameDirection) – Direction of the frame (input/output)