STT

This module implements Whisper transcription with a locally-downloaded model.

class pipecat.services.whisper.stt.Model(*values)[source]

Bases: Enum

Class of basic Whisper model selection options.

Available models:

Multilingual models:: TINY: Smallest multilingual model BASE: Basic multilingual model MEDIUM: Good balance for multilingual LARGE: Best quality multilingual DISTIL_LARGE_V2: Fast multilingual
English-only models:: DISTIL_MEDIUM_EN: Fast English-only

TINY = 'tiny'

BASE = 'base'

MEDIUM = 'medium'

LARGE = 'large-v3'

DISTIL_LARGE_V2 = 'Systran/faster-distil-whisper-large-v2'

DISTIL_MEDIUM_EN = 'Systran/faster-distil-whisper-medium.en'

class pipecat.services.whisper.stt.MLXModel(*values)[source]

Bases: Enum

Class of MLX Whisper model selection options.

Available models:

Multilingual models:: TINY: Smallest multilingual model MEDIUM: Good balance for multilingual LARGE_V3: Best quality multilingual LARGE_V3_TURBO: Finetuned, pruned Whisper large-v3, much faster, slightly lower quality DISTIL_LARGE_V3: Fast multilingual LARGE_V3_TURBO_Q4: LARGE_V3_TURBO, quantized to Q4

TINY = 'mlx-community/whisper-tiny'

MEDIUM = 'mlx-community/whisper-medium-mlx'

LARGE_V3 = 'mlx-community/whisper-large-v3-mlx'

LARGE_V3_TURBO = 'mlx-community/whisper-large-v3-turbo'

DISTIL_LARGE_V3 = 'mlx-community/distil-whisper-large-v3'

LARGE_V3_TURBO_Q4 = 'mlx-community/whisper-large-v3-turbo-q4'

pipecat.services.whisper.stt.language_to_whisper_language(language)[source]

Maps pipecat Language enum to Whisper language codes.

Parameters:: language (Language) – A Language enum value representing the input language.
Returns:: The corresponding Whisper language code, or None if not supported.
Return type:: str or None

Note

Only includes languages officially supported by Whisper.

class pipecat.services.whisper.stt.WhisperSTTService(*, model=Model.DISTIL_MEDIUM_EN, device='auto', compute_type='default', no_speech_prob=0.4, language=Language.EN, **kwargs)[source]

Bases: SegmentedSTTService

Class to transcribe audio with a locally-downloaded Whisper model.

This service uses Faster Whisper to perform speech-to-text transcription on audio segments. It supports multiple languages and various model sizes.

Parameters:

model (str | Model) – The Whisper model to use for transcription. Can be a Model enum or string.
device (str) – The device to run inference on (‘cpu’, ‘cuda’, or ‘auto’).
compute_type (str) – The compute type for inference (‘default’, ‘int8’, ‘int8_float16’, etc.).
no_speech_prob (float) – Probability threshold for filtering out non-speech segments.
language (Language) – The default language for transcription.
**kwargs – Additional arguments passed to SegmentedSTTService.

_device: The device used for inference.

_compute_type: The compute type for inference.

_no_speech_prob: Threshold for non-speech filtering.

_model: The loaded Whisper model instance.

_settings: Dictionary containing service settings.

can_generate_metrics()[source]

Indicates whether this service can generate metrics.

Returns:: True, as this service supports metric generation.
Return type:: bool

language_to_service_language(language)[source]

Convert from pipecat Language to Whisper language code.

Parameters:: language (Language) – The Language enum value to convert.
Returns:: The corresponding Whisper language code, or None if not supported.
Return type:: str or None

async set_language(language)[source]

Set the language for transcription.

Parameters:: language (Language) – The Language enum value to use for transcription.

async run_stt(audio)[source]

Transcribes given audio using Whisper.

Parameters:

audio (bytes) – Raw audio bytes in 16-bit PCM format.

Yields:

Frame –

Either a TranscriptionFrame containing the transcribed text: or an ErrorFrame if transcription fails.

Return type:

AsyncGenerator[Frame, None]

Note

The audio is expected to be 16-bit signed PCM data. The service will normalize it to float32 in the range [-1, 1].

class pipecat.services.whisper.stt.WhisperSTTServiceMLX(*, model=MLXModel.TINY, no_speech_prob=0.6, language=Language.EN, temperature=0.0, **kwargs)[source]

Bases: WhisperSTTService

Subclass of WhisperSTTService with MLX Whisper model support.

This service uses MLX Whisper to perform speech-to-text transcription on audio segments. It’s optimized for Apple Silicon and supports multiple languages and quantizations.

Parameters:

model (str | MLXModel) – The MLX Whisper model to use for transcription. Can be an MLXModel enum or string.
no_speech_prob (float) – Probability threshold for filtering out non-speech segments.
language (Language) – The default language for transcription.
temperature (float) – Temperature for sampling. Can be a float or tuple of floats.
**kwargs – Additional arguments passed to SegmentedSTTService.

_no_speech_threshold: Threshold for non-speech filtering.

_temperature: Temperature for sampling.

_settings: Dictionary containing service settings.

async run_stt(audio)[source]

Transcribes given audio using MLX Whisper.

Parameters:

audio (bytes) – Raw audio bytes in 16-bit PCM format.

Yields:

Frame –

Either a TranscriptionFrame containing the transcribed text: or an ErrorFrame if transcription fails.

Return type:

AsyncGenerator[Frame, None]

Note

The audio is expected to be 16-bit signed PCM data. MLX Whisper will handle the conversion internally.