STT

This module implements Ultravox speech-to-text with a locally-loaded model.

class pipecat.services.ultravox.stt.AudioBuffer[source]

Bases: object

Buffer to collect audio frames before processing.

frames

List of AudioRawFrames to process

started_at

Timestamp when speech started

is_processing

Flag to prevent concurrent processing

class pipecat.services.ultravox.stt.UltravoxModel(model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b')[source]

Bases: object

Model wrapper for the Ultravox multimodal model.

This class handles loading and running the Ultravox model for speech-to-text.

Parameters:

model_name (str) – The name or path of the Ultravox model to load

model_name

The name of the loaded model

engine

The vLLM engine for model inference

tokenizer

The tokenizer for the model

stop_token_ids

Optional token IDs to stop generation

format_prompt(messages)[source]

Format chat messages into a prompt for the model.

Parameters:

messages (list) – List of message dictionaries with ‘role’ and ‘content’

Returns:

Formatted prompt string

Return type:

str

async generate(messages, temperature=0.7, max_tokens=100, audio=None)[source]

Generate text from audio input using the model.

Parameters:
  • messages (list) – List of message dictionaries

  • temperature (float) – Sampling temperature

  • max_tokens (int) – Maximum tokens to generate

  • audio (ndarray) – Audio data as numpy array

Yields:

str – JSON chunks of the generated response

class pipecat.services.ultravox.stt.UltravoxSTTService(*, model_name='fixie-ai/ultravox-v0_5-llama-3_1-8b', hf_token=None, temperature=0.7, max_tokens=100, **kwargs)[source]

Bases: AIService

Service to transcribe audio using the Ultravox multimodal model.

This service collects audio frames and processes them with Ultravox to generate text transcriptions.

Parameters:
  • model_name (str) – The Ultravox model to use (ModelSize enum or string)

  • hf_token (str | None) – Hugging Face token for model access

  • temperature (float) – Sampling temperature for generation

  • max_tokens (int) – Maximum tokens to generate

  • **kwargs – Additional arguments passed to AIService

model

The UltravoxModel instance

buffer

Buffer to collect audio frames

temperature

Temperature for text generation

max_tokens

Maximum tokens to generate

_connection_active

Flag indicating if service is active

can_generate_metrics()[source]

Indicates whether this service can generate metrics.

Returns:

True, as this service supports metric generation.

Return type:

bool

async start(frame)[source]

Handle service start.

Parameters:

frame (StartFrame) – StartFrame that triggered this method

async stop(frame)[source]

Handle service stop.

Parameters:

frame (EndFrame) – EndFrame that triggered this method

async cancel(frame)[source]

Handle service cancellation.

Parameters:

frame (CancelFrame) – CancelFrame that triggered this method

async process_frame(frame, direction)[source]

Process incoming frames.

This method collects audio frames and processes them when speech ends.

Parameters:
  • frame (Frame) – The frame to process

  • direction (FrameDirection) – Direction of the frame (input/output)