STT
This module implements Whisper transcription with a locally-downloaded model.
- class pipecat.services.whisper.stt.Model(*values)[source]
Bases:
Enum
Class of basic Whisper model selection options.
- Available models:
- Multilingual models:
TINY: Smallest multilingual model BASE: Basic multilingual model MEDIUM: Good balance for multilingual LARGE: Best quality multilingual DISTIL_LARGE_V2: Fast multilingual
- English-only models:
DISTIL_MEDIUM_EN: Fast English-only
- TINY = 'tiny'
- BASE = 'base'
- MEDIUM = 'medium'
- LARGE = 'large-v3'
- DISTIL_LARGE_V2 = 'Systran/faster-distil-whisper-large-v2'
- DISTIL_MEDIUM_EN = 'Systran/faster-distil-whisper-medium.en'
- class pipecat.services.whisper.stt.MLXModel(*values)[source]
Bases:
Enum
Class of MLX Whisper model selection options.
- Available models:
- Multilingual models:
TINY: Smallest multilingual model MEDIUM: Good balance for multilingual LARGE_V3: Best quality multilingual LARGE_V3_TURBO: Finetuned, pruned Whisper large-v3, much faster, slightly lower quality DISTIL_LARGE_V3: Fast multilingual LARGE_V3_TURBO_Q4: LARGE_V3_TURBO, quantized to Q4
- TINY = 'mlx-community/whisper-tiny'
- MEDIUM = 'mlx-community/whisper-medium-mlx'
- LARGE_V3 = 'mlx-community/whisper-large-v3-mlx'
- LARGE_V3_TURBO = 'mlx-community/whisper-large-v3-turbo'
- DISTIL_LARGE_V3 = 'mlx-community/distil-whisper-large-v3'
- LARGE_V3_TURBO_Q4 = 'mlx-community/whisper-large-v3-turbo-q4'
- pipecat.services.whisper.stt.language_to_whisper_language(language)[source]
Maps pipecat Language enum to Whisper language codes.
- Parameters:
language (Language) – A Language enum value representing the input language.
- Returns:
The corresponding Whisper language code, or None if not supported.
- Return type:
str or None
Note
Only includes languages officially supported by Whisper.
- class pipecat.services.whisper.stt.WhisperSTTService(*, model=Model.DISTIL_MEDIUM_EN, device='auto', compute_type='default', no_speech_prob=0.4, language=Language.EN, **kwargs)[source]
Bases:
SegmentedSTTService
Class to transcribe audio with a locally-downloaded Whisper model.
This service uses Faster Whisper to perform speech-to-text transcription on audio segments. It supports multiple languages and various model sizes.
- Parameters:
model (str | Model) – The Whisper model to use for transcription. Can be a Model enum or string.
device (str) – The device to run inference on (‘cpu’, ‘cuda’, or ‘auto’).
compute_type (str) – The compute type for inference (‘default’, ‘int8’, ‘int8_float16’, etc.).
no_speech_prob (float) – Probability threshold for filtering out non-speech segments.
language (Language) – The default language for transcription.
**kwargs – Additional arguments passed to SegmentedSTTService.
- _device
The device used for inference.
- _compute_type
The compute type for inference.
- _no_speech_prob
Threshold for non-speech filtering.
- _model
The loaded Whisper model instance.
- _settings
Dictionary containing service settings.
- can_generate_metrics()[source]
Indicates whether this service can generate metrics.
- Returns:
True, as this service supports metric generation.
- Return type:
bool
- language_to_service_language(language)[source]
Convert from pipecat Language to Whisper language code.
- Parameters:
language (Language) – The Language enum value to convert.
- Returns:
The corresponding Whisper language code, or None if not supported.
- Return type:
str or None
- async set_language(language)[source]
Set the language for transcription.
- Parameters:
language (Language) – The Language enum value to use for transcription.
- async run_stt(audio)[source]
Transcribes given audio using Whisper.
- Parameters:
audio (bytes) – Raw audio bytes in 16-bit PCM format.
- Yields:
Frame –
- Either a TranscriptionFrame containing the transcribed text
or an ErrorFrame if transcription fails.
- Return type:
AsyncGenerator[Frame, None]
Note
The audio is expected to be 16-bit signed PCM data. The service will normalize it to float32 in the range [-1, 1].
- class pipecat.services.whisper.stt.WhisperSTTServiceMLX(*, model=MLXModel.TINY, no_speech_prob=0.6, language=Language.EN, temperature=0.0, **kwargs)[source]
Bases:
WhisperSTTService
Subclass of WhisperSTTService with MLX Whisper model support.
This service uses MLX Whisper to perform speech-to-text transcription on audio segments. It’s optimized for Apple Silicon and supports multiple languages and quantizations.
- Parameters:
model (str | MLXModel) – The MLX Whisper model to use for transcription. Can be an MLXModel enum or string.
no_speech_prob (float) – Probability threshold for filtering out non-speech segments.
language (Language) – The default language for transcription.
temperature (float) – Temperature for sampling. Can be a float or tuple of floats.
**kwargs – Additional arguments passed to SegmentedSTTService.
- _no_speech_threshold
Threshold for non-speech filtering.
- _temperature
Temperature for sampling.
- _settings
Dictionary containing service settings.
- async run_stt(audio)[source]
Transcribes given audio using MLX Whisper.
- Parameters:
audio (bytes) – Raw audio bytes in 16-bit PCM format.
- Yields:
Frame –
- Either a TranscriptionFrame containing the transcribed text
or an ErrorFrame if transcription fails.
- Return type:
AsyncGenerator[Frame, None]
Note
The audio is expected to be 16-bit signed PCM data. MLX Whisper will handle the conversion internally.