SttService

Base classes for Speech-to-Text services with continuous and segmented processing.

class pipecat.services.stt_service.STTService(audio_passthrough=True, sample_rate=None, **kwargs)[source]

Bases: AIService

Base class for speech-to-text services.

Provides common functionality for STT services including audio passthrough, muting, settings management, and audio processing. Subclasses must implement the run_stt method to provide actual speech recognition.

Parameters:

audio_passthrough – Whether to pass audio frames downstream after processing. Defaults to True.
sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.
**kwargs – Additional arguments passed to the parent AIService.

property is_muted: bool

Check if the STT service is currently muted.

Returns:: True if the service is muted and will not process audio.

property sample_rate: int

Get the current sample rate for audio processing.

Returns:: The sample rate in Hz.

async set_model(model)[source]

Set the speech recognition model.

Parameters:: model (str) – The name of the model to use for speech recognition.

async set_language(language)[source]

Set the language for speech recognition.

Parameters:: language (Language) – The language to use for speech recognition.

abstractmethod async run_stt(audio)[source]

Run speech-to-text on the provided audio data.

This method must be implemented by subclasses to provide actual speech recognition functionality.

Parameters:: audio (bytes) – Raw audio bytes to transcribe.
Yields:: Frame – Frames containing transcription results (typically TextFrame).
Return type:: AsyncGenerator[Frame, None]

async start(frame)[source]

Start the STT service.

Parameters:: frame (StartFrame) – The start frame containing initialization parameters.

async process_audio_frame(frame, direction)[source]

Process an audio frame for speech recognition.

Parameters:

frame (AudioRawFrame) – The audio frame to process.
direction (FrameDirection) – The direction of frame processing.

async process_frame(frame, direction)[source]

Process frames, handling VAD events and audio segmentation.

Parameters:

frame (Frame) – The frame to process.
direction (FrameDirection) – The direction of frame processing.

class pipecat.services.stt_service.SegmentedSTTService(*, sample_rate=None, **kwargs)[source]

Bases: STTService

STT service that processes speech in segments using VAD events.

Uses Voice Activity Detection (VAD) events to detect speech segments and runs speech-to-text only on those segments, rather than continuously.

Requires VAD to be enabled in the pipeline to function properly. Maintains a small audio buffer to account for the delay between actual speech start and VAD detection.

Parameters:

sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.
**kwargs – Additional arguments passed to the parent STTService.

async start(frame)[source]

Start the segmented STT service and initialize audio buffer.

Parameters:: frame (StartFrame) – The start frame containing initialization parameters.

async process_frame(frame, direction)[source]

Process frames, handling VAD events and audio segmentation.

Parameters:

frame (Frame)
direction (FrameDirection)

async process_audio_frame(frame, direction)[source]

Process audio frames by buffering them for segmented transcription.

Continuously buffers audio, growing the buffer while user is speaking and maintaining a small buffer when not speaking to account for VAD delay.

Parameters:

frame (AudioRawFrame) – The audio frame to process.
direction (FrameDirection) – The direction of frame processing.