TTS

Cartesia text-to-speech service implementations.

pipecat.services.cartesia.tts.language_to_cartesia_language(language)[source]

Convert a Language enum to Cartesia language code.

Parameters:: language (Language) – The Language enum value to convert.
Returns:: The corresponding Cartesia language code, or None if not supported.
Return type:: str | None

class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key, voice_id, cartesia_version='2025-04-16', url='wss://api.cartesia.ai/tts/websocket', model='sonic-2', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, text_aggregator=None, **kwargs)[source]

Bases: AudioContextWordTTSService

Cartesia TTS service with WebSocket streaming and word timestamps.

Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including speed and emotion controls.

Parameters:

api_key (str) – Cartesia API key for authentication.
voice_id (str) – ID of the voice to use for synthesis.
cartesia_version (str) – API version string for Cartesia service.
url (str) – WebSocket URL for Cartesia TTS API.
model (str) – TTS model to use (e.g., “sonic-2”).
sample_rate (int | None) – Audio sample rate. If None, uses default.
encoding (str) – Audio encoding format.
container (str) – Audio container format.
params (InputParams | None) – Additional input parameters for voice customization.
text_aggregator (BaseTextAggregator | None) – Custom text aggregator for processing input text.
**kwargs – Additional arguments passed to the parent service.

class InputParams(*, language=Language.EN, speed='', emotion=[])[source]

Bases: BaseModel

Input parameters for Cartesia TTS configuration.

Parameters:

language (Language | None) – Language to use for synthesis.
speed (str | float | None) – Voice speed control (string or float).
emotion (List[str] | None) – List of emotion controls (deprecated).

language: Language | None

speed: str | float | None

emotion: List[str] | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

can_generate_metrics()[source]

Check if this service can generate processing metrics.

Returns:: True, as Cartesia service supports metrics generation.
Return type:: bool

async set_model(model)[source]

Set the TTS model.

Parameters:: model (str) – The model name to use for synthesis.

language_to_service_language(language)[source]

Convert a Language enum to Cartesia language format.

Parameters:: language (Language) – The language to convert.
Returns:: The Cartesia-specific language code, or None if not supported.
Return type:: str | None

async start(frame)[source]

Start the Cartesia TTS service.

Parameters:: frame (StartFrame) – The start frame containing initialization parameters.

async stop(frame)[source]

Stop the Cartesia TTS service.

Parameters:: frame (EndFrame) – The end frame.

async cancel(frame)[source]

Stop the Cartesia TTS service.

Parameters:: frame (CancelFrame) – The end frame.

async flush_audio()[source]: Flush any pending audio and finalize the current context.

async run_tts(text)[source]

Generate speech from text using Cartesia’s streaming API.

Parameters:: text (str) – The text to synthesize into speech.
Yields:: Frame – Audio frames containing the synthesized speech.
Return type:: AsyncGenerator[Frame, None]

class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key, voice_id, model='sonic-2', base_url='https://api.cartesia.ai', cartesia_version='2024-11-13', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, **kwargs)[source]

Bases: TTSService

Cartesia HTTP-based TTS service.

Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Parameters:

api_key (str) – Cartesia API key for authentication.
voice_id (str) – ID of the voice to use for synthesis.
model (str) – TTS model to use (e.g., “sonic-2”).
base_url (str) – Base URL for Cartesia HTTP API.
cartesia_version (str) – API version string for Cartesia service.
sample_rate (int | None) – Audio sample rate. If None, uses default.
encoding (str) – Audio encoding format.
container (str) – Audio container format.
params (InputParams | None) – Additional input parameters for voice customization.
**kwargs – Additional arguments passed to the parent TTSService.

class InputParams(*, language=Language.EN, speed='', emotion=<factory>)[source]

Bases: BaseModel

Input parameters for Cartesia HTTP TTS configuration.

Parameters:

language (Language | None) – Language to use for synthesis.
speed (str | float | None) – Voice speed control (string or float).
emotion (List[str] | None) – List of emotion controls (deprecated).

language: Language | None

speed: str | float | None

emotion: List[str] | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

can_generate_metrics()[source]

Check if this service can generate processing metrics.

Returns:: True, as Cartesia HTTP service supports metrics generation.
Return type:: bool

language_to_service_language(language)[source]

Convert a Language enum to Cartesia language format.

Parameters:: language (Language) – The language to convert.
Returns:: The Cartesia-specific language code, or None if not supported.
Return type:: str | None

async start(frame)[source]

Start the Cartesia HTTP TTS service.

Parameters:: frame (StartFrame) – The start frame containing initialization parameters.

async stop(frame)[source]

Stop the Cartesia HTTP TTS service.

Parameters:: frame (EndFrame) – The end frame.

async cancel(frame)[source]

Cancel the Cartesia HTTP TTS service.

Parameters:: frame (CancelFrame) – The cancel frame.

async run_tts(text)[source]

Generate speech from text using Cartesia’s HTTP API.

Parameters:: text (str) – The text to synthesize into speech.
Yields:: Frame – Audio frames containing the synthesized speech.
Return type:: AsyncGenerator[Frame, None]