TTS

Cartesia text-to-speech service implementations.

pipecat.services.cartesia.tts.language_to_cartesia_language(language)[source]

Convert a Language enum to Cartesia language code.

Parameters:

language (Language) – The Language enum value to convert.

Returns:

The corresponding Cartesia language code, or None if not supported.

Return type:

str | None

class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key, voice_id, cartesia_version='2025-04-16', url='wss://api.cartesia.ai/tts/websocket', model='sonic-2', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, text_aggregator=None, **kwargs)[source]

Bases: AudioContextWordTTSService

Cartesia TTS service with WebSocket streaming and word timestamps.

Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including speed and emotion controls.

Parameters:
  • api_key (str) – Cartesia API key for authentication.

  • voice_id (str) – ID of the voice to use for synthesis.

  • cartesia_version (str) – API version string for Cartesia service.

  • url (str) – WebSocket URL for Cartesia TTS API.

  • model (str) – TTS model to use (e.g., “sonic-2”).

  • sample_rate (int | None) – Audio sample rate. If None, uses default.

  • encoding (str) – Audio encoding format.

  • container (str) – Audio container format.

  • params (InputParams | None) – Additional input parameters for voice customization.

  • text_aggregator (BaseTextAggregator | None) – Custom text aggregator for processing input text.

  • **kwargs – Additional arguments passed to the parent service.

class InputParams(*, language=Language.EN, speed='', emotion=[])[source]

Bases: BaseModel

Input parameters for Cartesia TTS configuration.

Parameters:
  • language (Language | None) – Language to use for synthesis.

  • speed (str | float | None) – Voice speed control (string or float).

  • emotion (List[str] | None) – List of emotion controls (deprecated).

language: Language | None
speed: str | float | None
emotion: List[str] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

can_generate_metrics()[source]

Check if this service can generate processing metrics.

Returns:

True, as Cartesia service supports metrics generation.

Return type:

bool

async set_model(model)[source]

Set the TTS model.

Parameters:

model (str) – The model name to use for synthesis.

language_to_service_language(language)[source]

Convert a Language enum to Cartesia language format.

Parameters:

language (Language) – The language to convert.

Returns:

The Cartesia-specific language code, or None if not supported.

Return type:

str | None

async start(frame)[source]

Start the Cartesia TTS service.

Parameters:

frame (StartFrame) – The start frame containing initialization parameters.

async stop(frame)[source]

Stop the Cartesia TTS service.

Parameters:

frame (EndFrame) – The end frame.

async cancel(frame)[source]

Stop the Cartesia TTS service.

Parameters:

frame (CancelFrame) – The end frame.

async flush_audio()[source]

Flush any pending audio and finalize the current context.

async run_tts(text)[source]

Generate speech from text using Cartesia’s streaming API.

Parameters:

text (str) – The text to synthesize into speech.

Yields:

Frame – Audio frames containing the synthesized speech.

Return type:

AsyncGenerator[Frame, None]

class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key, voice_id, model='sonic-2', base_url='https://api.cartesia.ai', cartesia_version='2024-11-13', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, **kwargs)[source]

Bases: TTSService

Cartesia HTTP-based TTS service.

Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.

Parameters:
  • api_key (str) – Cartesia API key for authentication.

  • voice_id (str) – ID of the voice to use for synthesis.

  • model (str) – TTS model to use (e.g., “sonic-2”).

  • base_url (str) – Base URL for Cartesia HTTP API.

  • cartesia_version (str) – API version string for Cartesia service.

  • sample_rate (int | None) – Audio sample rate. If None, uses default.

  • encoding (str) – Audio encoding format.

  • container (str) – Audio container format.

  • params (InputParams | None) – Additional input parameters for voice customization.

  • **kwargs – Additional arguments passed to the parent TTSService.

class InputParams(*, language=Language.EN, speed='', emotion=<factory>)[source]

Bases: BaseModel

Input parameters for Cartesia HTTP TTS configuration.

Parameters:
  • language (Language | None) – Language to use for synthesis.

  • speed (str | float | None) – Voice speed control (string or float).

  • emotion (List[str] | None) – List of emotion controls (deprecated).

language: Language | None
speed: str | float | None
emotion: List[str] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

can_generate_metrics()[source]

Check if this service can generate processing metrics.

Returns:

True, as Cartesia HTTP service supports metrics generation.

Return type:

bool

language_to_service_language(language)[source]

Convert a Language enum to Cartesia language format.

Parameters:

language (Language) – The language to convert.

Returns:

The Cartesia-specific language code, or None if not supported.

Return type:

str | None

async start(frame)[source]

Start the Cartesia HTTP TTS service.

Parameters:

frame (StartFrame) – The start frame containing initialization parameters.

async stop(frame)[source]

Stop the Cartesia HTTP TTS service.

Parameters:

frame (EndFrame) – The end frame.

async cancel(frame)[source]

Cancel the Cartesia HTTP TTS service.

Parameters:

frame (CancelFrame) – The cancel frame.

async run_tts(text)[source]

Generate speech from text using Cartesia’s HTTP API.

Parameters:

text (str) – The text to synthesize into speech.

Yields:

Frame – Audio frames containing the synthesized speech.

Return type:

AsyncGenerator[Frame, None]