SttService

Base classes for Speech-to-Text services with continuous and segmented processing.

class pipecat.services.stt_service.STTService(audio_passthrough=True, sample_rate=None, **kwargs)[source]

Bases: AIService

Base class for speech-to-text services.

Provides common functionality for STT services including audio passthrough, muting, settings management, and audio processing. Subclasses must implement the run_stt method to provide actual speech recognition.

Parameters:
  • audio_passthrough – Whether to pass audio frames downstream after processing. Defaults to True.

  • sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.

  • **kwargs – Additional arguments passed to the parent AIService.

property is_muted: bool

Check if the STT service is currently muted.

Returns:

True if the service is muted and will not process audio.

property sample_rate: int

Get the current sample rate for audio processing.

Returns:

The sample rate in Hz.

async set_model(model)[source]

Set the speech recognition model.

Parameters:

model (str) – The name of the model to use for speech recognition.

async set_language(language)[source]

Set the language for speech recognition.

Parameters:

language (Language) – The language to use for speech recognition.

abstractmethod async run_stt(audio)[source]

Run speech-to-text on the provided audio data.

This method must be implemented by subclasses to provide actual speech recognition functionality.

Parameters:

audio (bytes) – Raw audio bytes to transcribe.

Yields:

Frame – Frames containing transcription results (typically TextFrame).

Return type:

AsyncGenerator[Frame, None]

async start(frame)[source]

Start the STT service.

Parameters:

frame (StartFrame) – The start frame containing initialization parameters.

async process_audio_frame(frame, direction)[source]

Process an audio frame for speech recognition.

Parameters:
  • frame (AudioRawFrame) – The audio frame to process.

  • direction (FrameDirection) – The direction of frame processing.

async process_frame(frame, direction)[source]

Process frames, handling VAD events and audio segmentation.

Parameters:
  • frame (Frame) – The frame to process.

  • direction (FrameDirection) – The direction of frame processing.

class pipecat.services.stt_service.SegmentedSTTService(*, sample_rate=None, **kwargs)[source]

Bases: STTService

STT service that processes speech in segments using VAD events.

Uses Voice Activity Detection (VAD) events to detect speech segments and runs speech-to-text only on those segments, rather than continuously.

Requires VAD to be enabled in the pipeline to function properly. Maintains a small audio buffer to account for the delay between actual speech start and VAD detection.

Parameters:
  • sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.

  • **kwargs – Additional arguments passed to the parent STTService.

async start(frame)[source]

Start the segmented STT service and initialize audio buffer.

Parameters:

frame (StartFrame) – The start frame containing initialization parameters.

async process_frame(frame, direction)[source]

Process frames, handling VAD events and audio segmentation.

Parameters:
  • frame (Frame)

  • direction (FrameDirection)

async process_audio_frame(frame, direction)[source]

Process audio frames by buffering them for segmented transcription.

Continuously buffers audio, growing the buffer while user is speaking and maintaining a small buffer when not speaking to account for VAD delay.

Parameters:
  • frame (AudioRawFrame) – The audio frame to process.

  • direction (FrameDirection) – The direction of frame processing.