STT

This module implements Whisper transcription with a locally-downloaded model.

class pipecat.services.whisper.stt.Model(*values)[source]

Bases: Enum

Class of basic Whisper model selection options.

Available models:
Multilingual models:

TINY: Smallest multilingual model BASE: Basic multilingual model MEDIUM: Good balance for multilingual LARGE: Best quality multilingual DISTIL_LARGE_V2: Fast multilingual

English-only models:

DISTIL_MEDIUM_EN: Fast English-only

TINY = 'tiny'
BASE = 'base'
MEDIUM = 'medium'
LARGE = 'large-v3'
DISTIL_LARGE_V2 = 'Systran/faster-distil-whisper-large-v2'
DISTIL_MEDIUM_EN = 'Systran/faster-distil-whisper-medium.en'
class pipecat.services.whisper.stt.MLXModel(*values)[source]

Bases: Enum

Class of MLX Whisper model selection options.

Available models:
Multilingual models:

TINY: Smallest multilingual model MEDIUM: Good balance for multilingual LARGE_V3: Best quality multilingual LARGE_V3_TURBO: Finetuned, pruned Whisper large-v3, much faster, slightly lower quality DISTIL_LARGE_V3: Fast multilingual LARGE_V3_TURBO_Q4: LARGE_V3_TURBO, quantized to Q4

TINY = 'mlx-community/whisper-tiny'
MEDIUM = 'mlx-community/whisper-medium-mlx'
LARGE_V3 = 'mlx-community/whisper-large-v3-mlx'
LARGE_V3_TURBO = 'mlx-community/whisper-large-v3-turbo'
DISTIL_LARGE_V3 = 'mlx-community/distil-whisper-large-v3'
LARGE_V3_TURBO_Q4 = 'mlx-community/whisper-large-v3-turbo-q4'
pipecat.services.whisper.stt.language_to_whisper_language(language)[source]

Maps pipecat Language enum to Whisper language codes.

Parameters:

language (Language) – A Language enum value representing the input language.

Returns:

The corresponding Whisper language code, or None if not supported.

Return type:

str or None

Note

Only includes languages officially supported by Whisper.

class pipecat.services.whisper.stt.WhisperSTTService(*, model=Model.DISTIL_MEDIUM_EN, device='auto', compute_type='default', no_speech_prob=0.4, language=Language.EN, **kwargs)[source]

Bases: SegmentedSTTService

Class to transcribe audio with a locally-downloaded Whisper model.

This service uses Faster Whisper to perform speech-to-text transcription on audio segments. It supports multiple languages and various model sizes.

Parameters:
  • model (str | Model) – The Whisper model to use for transcription. Can be a Model enum or string.

  • device (str) – The device to run inference on (‘cpu’, ‘cuda’, or ‘auto’).

  • compute_type (str) – The compute type for inference (‘default’, ‘int8’, ‘int8_float16’, etc.).

  • no_speech_prob (float) – Probability threshold for filtering out non-speech segments.

  • language (Language) – The default language for transcription.

  • **kwargs – Additional arguments passed to SegmentedSTTService.

_device

The device used for inference.

_compute_type

The compute type for inference.

_no_speech_prob

Threshold for non-speech filtering.

_model

The loaded Whisper model instance.

_settings

Dictionary containing service settings.

can_generate_metrics()[source]

Indicates whether this service can generate metrics.

Returns:

True, as this service supports metric generation.

Return type:

bool

language_to_service_language(language)[source]

Convert from pipecat Language to Whisper language code.

Parameters:

language (Language) – The Language enum value to convert.

Returns:

The corresponding Whisper language code, or None if not supported.

Return type:

str or None

async set_language(language)[source]

Set the language for transcription.

Parameters:

language (Language) – The Language enum value to use for transcription.

async run_stt(audio)[source]

Transcribes given audio using Whisper.

Parameters:

audio (bytes) – Raw audio bytes in 16-bit PCM format.

Yields:

Frame

Either a TranscriptionFrame containing the transcribed text

or an ErrorFrame if transcription fails.

Return type:

AsyncGenerator[Frame, None]

Note

The audio is expected to be 16-bit signed PCM data. The service will normalize it to float32 in the range [-1, 1].

class pipecat.services.whisper.stt.WhisperSTTServiceMLX(*, model=MLXModel.TINY, no_speech_prob=0.6, language=Language.EN, temperature=0.0, **kwargs)[source]

Bases: WhisperSTTService

Subclass of WhisperSTTService with MLX Whisper model support.

This service uses MLX Whisper to perform speech-to-text transcription on audio segments. It’s optimized for Apple Silicon and supports multiple languages and quantizations.

Parameters:
  • model (str | MLXModel) – The MLX Whisper model to use for transcription. Can be an MLXModel enum or string.

  • no_speech_prob (float) – Probability threshold for filtering out non-speech segments.

  • language (Language) – The default language for transcription.

  • temperature (float) – Temperature for sampling. Can be a float or tuple of floats.

  • **kwargs – Additional arguments passed to SegmentedSTTService.

_no_speech_threshold

Threshold for non-speech filtering.

_temperature

Temperature for sampling.

_settings

Dictionary containing service settings.

async run_stt(audio)[source]

Transcribes given audio using MLX Whisper.

Parameters:

audio (bytes) – Raw audio bytes in 16-bit PCM format.

Yields:

Frame

Either a TranscriptionFrame containing the transcribed text

or an ErrorFrame if transcription fails.

Return type:

AsyncGenerator[Frame, None]

Note

The audio is expected to be 16-bit signed PCM data. MLX Whisper will handle the conversion internally.