STT

This module implements Ultravox speech-to-text with a locally-loaded model.

class pipecat.services.ultravox.stt.AudioBuffer[source]

Bases: object

Buffer to collect audio frames before processing.

frames: List of AudioRawFrames to process

started_at: Timestamp when speech started

is_processing: Flag to prevent concurrent processing

class pipecat.services.ultravox.stt.UltravoxModel(model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b')[source]

Bases: object

Model wrapper for the Ultravox multimodal model.

This class handles loading and running the Ultravox model for speech-to-text.

Parameters:: model_name (str) – The name or path of the Ultravox model to load

model_name: The name of the loaded model

engine: The vLLM engine for model inference

tokenizer: The tokenizer for the model

stop_token_ids: Optional token IDs to stop generation

format_prompt(messages)[source]

Format chat messages into a prompt for the model.

Parameters:: messages (list) – List of message dictionaries with ‘role’ and ‘content’
Returns:: Formatted prompt string
Return type:: str

async generate(messages, temperature=0.7, max_tokens=100, audio=None)[source]

Generate text from audio input using the model.

Parameters:

messages (list) – List of message dictionaries
temperature (float) – Sampling temperature
max_tokens (int) – Maximum tokens to generate
audio (ndarray) – Audio data as numpy array

Yields:

str – JSON chunks of the generated response

class pipecat.services.ultravox.stt.UltravoxSTTService(*, model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b', hf_token=None, temperature=0.7, max_tokens=100, **kwargs)[source]

Bases: AIService

Service to transcribe audio using the Ultravox multimodal model.

This service collects audio frames and processes them with Ultravox to generate text transcriptions.

Parameters:

model_name (str) – The Ultravox model to use (ModelSize enum or string)
hf_token (str | None) – Hugging Face token for model access
temperature (float) – Sampling temperature for generation
max_tokens (int) – Maximum tokens to generate
**kwargs – Additional arguments passed to AIService

model: The UltravoxModel instance

buffer: Buffer to collect audio frames

temperature: Temperature for text generation

max_tokens: Maximum tokens to generate

_connection_active: Flag indicating if service is active

can_generate_metrics()[source]

Indicates whether this service can generate metrics.

Returns:: True, as this service supports metric generation.
Return type:: bool

async start(frame)[source]

Handle service start.

Parameters:: frame (StartFrame) – StartFrame that triggered this method

async stop(frame)[source]

Handle service stop.

Parameters:: frame (EndFrame) – EndFrame that triggered this method

async cancel(frame)[source]

Handle service cancellation.

Parameters:: frame (CancelFrame) – CancelFrame that triggered this method

async process_frame(frame, direction)[source]

Process incoming frames.

This method collects audio frames and processes them when speech ends.

Parameters:

frame (Frame) – The frame to process
direction (FrameDirection) – Direction of the frame (input/output)