SttService
Base classes for Speech-to-Text services with continuous and segmented processing.
- class pipecat.services.stt_service.STTService(audio_passthrough=True, sample_rate=None, **kwargs)[source]
Bases:
AIService
Base class for speech-to-text services.
Provides common functionality for STT services including audio passthrough, muting, settings management, and audio processing. Subclasses must implement the run_stt method to provide actual speech recognition.
- Parameters:
audio_passthrough – Whether to pass audio frames downstream after processing. Defaults to True.
sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.
**kwargs – Additional arguments passed to the parent AIService.
- property is_muted: bool
Check if the STT service is currently muted.
- Returns:
True if the service is muted and will not process audio.
- property sample_rate: int
Get the current sample rate for audio processing.
- Returns:
The sample rate in Hz.
- async set_model(model)[source]
Set the speech recognition model.
- Parameters:
model (str) – The name of the model to use for speech recognition.
- async set_language(language)[source]
Set the language for speech recognition.
- Parameters:
language (Language) – The language to use for speech recognition.
- abstractmethod async run_stt(audio)[source]
Run speech-to-text on the provided audio data.
This method must be implemented by subclasses to provide actual speech recognition functionality.
- Parameters:
audio (bytes) – Raw audio bytes to transcribe.
- Yields:
Frame – Frames containing transcription results (typically TextFrame).
- Return type:
AsyncGenerator[Frame, None]
- async start(frame)[source]
Start the STT service.
- Parameters:
frame (StartFrame) – The start frame containing initialization parameters.
- async process_audio_frame(frame, direction)[source]
Process an audio frame for speech recognition.
- Parameters:
frame (AudioRawFrame) – The audio frame to process.
direction (FrameDirection) – The direction of frame processing.
- async process_frame(frame, direction)[source]
Process frames, handling VAD events and audio segmentation.
- Parameters:
frame (Frame) – The frame to process.
direction (FrameDirection) – The direction of frame processing.
- class pipecat.services.stt_service.SegmentedSTTService(*, sample_rate=None, **kwargs)[source]
Bases:
STTService
STT service that processes speech in segments using VAD events.
Uses Voice Activity Detection (VAD) events to detect speech segments and runs speech-to-text only on those segments, rather than continuously.
Requires VAD to be enabled in the pipeline to function properly. Maintains a small audio buffer to account for the delay between actual speech start and VAD detection.
- Parameters:
sample_rate (int | None) – The sample rate for audio input. If None, will be determined from the start frame.
**kwargs – Additional arguments passed to the parent STTService.
- async start(frame)[source]
Start the segmented STT service and initialize audio buffer.
- Parameters:
frame (StartFrame) – The start frame containing initialization parameters.
- async process_frame(frame, direction)[source]
Process frames, handling VAD events and audio segmentation.
- Parameters:
frame (Frame)
direction (FrameDirection)
- async process_audio_frame(frame, direction)[source]
Process audio frames by buffering them for segmented transcription.
Continuously buffers audio, growing the buffer while user is speaking and maintaining a small buffer when not speaking to account for VAD delay.
- Parameters:
frame (AudioRawFrame) – The audio frame to process.
direction (FrameDirection) – The direction of frame processing.