TTS

pipecat.services.elevenlabs.tts.language_to_elevenlabs_language(language)[source]
Parameters:

language (Language)

Return type:

str | None

pipecat.services.elevenlabs.tts.output_format_from_sample_rate(sample_rate)[source]
Parameters:

sample_rate (int)

Return type:

str

pipecat.services.elevenlabs.tts.build_elevenlabs_voice_settings(settings)[source]

Build voice settings dictionary for ElevenLabs based on provided settings.

Parameters:

settings (Dict[str, Any]) – Dictionary containing voice settings parameters

Returns:

Dictionary of voice settings or None if no valid settings are provided

Return type:

Dict[str, float | bool] | None

pipecat.services.elevenlabs.tts.calculate_word_times(alignment_info, cumulative_time)[source]
Parameters:
  • alignment_info (Mapping[str, Any])

  • cumulative_time (float)

Return type:

List[Tuple[str, float]]

class pipecat.services.elevenlabs.tts.ElevenLabsTTSService(*, api_key, voice_id, model='eleven_flash_v2_5', url='wss://api.elevenlabs.io', sample_rate=None, params=None, **kwargs)[source]

Bases: AudioContextWordTTSService

Parameters:
  • api_key (str)

  • voice_id (str)

  • model (str)

  • url (str)

  • sample_rate (int | None)

  • params (InputParams | None)

class InputParams(*, language=None, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, speed=None, auto_mode=True, enable_ssml_parsing=None, enable_logging=None)[source]

Bases: BaseModel

Parameters:
  • language (Language | None)

  • stability (float | None)

  • similarity_boost (float | None)

  • style (float | None)

  • use_speaker_boost (bool | None)

  • speed (float | None)

  • auto_mode (bool | None)

  • enable_ssml_parsing (bool | None)

  • enable_logging (bool | None)

language: Language | None
stability: float | None
similarity_boost: float | None
style: float | None
use_speaker_boost: bool | None
speed: float | None
auto_mode: bool | None
enable_ssml_parsing: bool | None
enable_logging: bool | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

can_generate_metrics()[source]
Return type:

bool

language_to_service_language(language)[source]

Convert a language to the service-specific language format.

Parameters:

language (Language) – The language to convert.

Returns:

The service-specific language identifier, or None if not supported.

Return type:

str | None

async set_model(model)[source]

Set the TTS model to use.

Parameters:

model (str) – The name of the TTS model.

async start(frame)[source]

Start the audio context TTS service.

Parameters:

frame (StartFrame) – The start frame containing initialization parameters.

async stop(frame)[source]

Stop the audio context TTS service.

Parameters:

frame (EndFrame) – The end frame.

async cancel(frame)[source]

Cancel the audio context TTS service.

Parameters:

frame (CancelFrame) – The cancel frame.

async flush_audio()[source]

Flush any buffered audio data.

async push_frame(frame, direction=FrameDirection.DOWNSTREAM)[source]

Push a frame downstream with TTS-specific handling.

Parameters:
  • frame (Frame) – The frame to push.

  • direction (FrameDirection) – The direction to push the frame.

async run_tts(text)[source]

Run text-to-speech synthesis on the provided text.

This method must be implemented by subclasses to provide actual TTS functionality.

Parameters:

text (str) – The text to synthesize into speech.

Yields:

Frame – Audio frames containing the synthesized speech.

Return type:

AsyncGenerator[Frame, None]

class pipecat.services.elevenlabs.tts.ElevenLabsHttpTTSService(*, api_key, voice_id, aiohttp_session, model='eleven_flash_v2_5', base_url='https://api.elevenlabs.io', sample_rate=None, params=None, **kwargs)[source]

Bases: WordTTSService

ElevenLabs Text-to-Speech service using HTTP streaming with word timestamps.

Parameters:
  • api_key (str) – ElevenLabs API key

  • voice_id (str) – ID of the voice to use

  • aiohttp_session (ClientSession) – aiohttp ClientSession

  • model (str) – Model ID (default: “eleven_flash_v2_5” for low latency)

  • base_url (str) – API base URL

  • sample_rate (int | None) – Output sample rate

  • params (InputParams | None) – Additional parameters for voice configuration

class InputParams(*, language=None, optimize_streaming_latency=None, stability=None, similarity_boost=None, style=None, use_speaker_boost=None, speed=None)[source]

Bases: BaseModel

Parameters:
  • language (Language | None)

  • optimize_streaming_latency (int | None)

  • stability (float | None)

  • similarity_boost (float | None)

  • style (float | None)

  • use_speaker_boost (bool | None)

  • speed (float | None)

language: Language | None
optimize_streaming_latency: int | None
stability: float | None
similarity_boost: float | None
style: float | None
use_speaker_boost: bool | None
speed: float | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

language_to_service_language(language)[source]

Convert pipecat Language to ElevenLabs language code.

Parameters:

language (Language)

Return type:

str | None

can_generate_metrics()[source]

Indicate that this service can generate usage metrics.

Return type:

bool

async start(frame)[source]

Initialize the service upon receiving a StartFrame.

Parameters:

frame (StartFrame)

async push_frame(frame, direction=FrameDirection.DOWNSTREAM)[source]

Push a frame downstream with TTS-specific handling.

Parameters:
  • frame (Frame) – The frame to push.

  • direction (FrameDirection) – The direction to push the frame.

calculate_word_times(alignment_info)[source]

Calculate word timing from character alignment data.

Example input data: {

“characters”: [” “, “H”, “e”, “l”, “l”, “o”, “ “, “w”, “o”, “r”, “l”, “d”], “character_start_times_seconds”: [0.0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], “character_end_times_seconds”: [0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

}

Would produce word times (with cumulative_time=0): [(“Hello”, 0.1), (“world”, 0.5)]

Parameters:

alignment_info (Mapping[str, Any]) – Character timing data from ElevenLabs

Returns:

List of (word, timestamp) pairs

Return type:

List[Tuple[str, float]]

async run_tts(text)[source]

Generate speech from text using ElevenLabs streaming API with timestamps.

Makes a request to the ElevenLabs API to generate audio and timing data. Tracks the duration of each utterance to ensure correct sequencing. Includes previous text as context for better prosody continuity.

Parameters:

text (str) – Text to convert to speech

Yields:

Audio and control frames

Return type:

AsyncGenerator[Frame, None]