TTS
Cartesia text-to-speech service implementations.
- pipecat.services.cartesia.tts.language_to_cartesia_language(language)[source]
Convert a Language enum to Cartesia language code.
- Parameters:
language (Language) – The Language enum value to convert.
- Returns:
The corresponding Cartesia language code, or None if not supported.
- Return type:
str | None
- class pipecat.services.cartesia.tts.CartesiaTTSService(*, api_key, voice_id, cartesia_version='2025-04-16', url='wss://api.cartesia.ai/tts/websocket', model='sonic-2', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, text_aggregator=None, **kwargs)[source]
Bases:
AudioContextWordTTSService
Cartesia TTS service with WebSocket streaming and word timestamps.
Provides text-to-speech using Cartesia’s streaming WebSocket API. Supports word-level timestamps, audio context management, and various voice customization options including speed and emotion controls.
- Parameters:
api_key (str) – Cartesia API key for authentication.
voice_id (str) – ID of the voice to use for synthesis.
cartesia_version (str) – API version string for Cartesia service.
url (str) – WebSocket URL for Cartesia TTS API.
model (str) – TTS model to use (e.g., “sonic-2”).
sample_rate (int | None) – Audio sample rate. If None, uses default.
encoding (str) – Audio encoding format.
container (str) – Audio container format.
params (InputParams | None) – Additional input parameters for voice customization.
text_aggregator (BaseTextAggregator | None) – Custom text aggregator for processing input text.
**kwargs – Additional arguments passed to the parent service.
- class InputParams(*, language=Language.EN, speed='', emotion=[])[source]
Bases:
BaseModel
Input parameters for Cartesia TTS configuration.
- Parameters:
language (Language | None) – Language to use for synthesis.
speed (str | float | None) – Voice speed control (string or float).
emotion (List[str] | None) – List of emotion controls (deprecated).
- language: Language | None
- speed: str | float | None
- emotion: List[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- can_generate_metrics()[source]
Check if this service can generate processing metrics.
- Returns:
True, as Cartesia service supports metrics generation.
- Return type:
bool
- async set_model(model)[source]
Set the TTS model.
- Parameters:
model (str) – The model name to use for synthesis.
- language_to_service_language(language)[source]
Convert a Language enum to Cartesia language format.
- Parameters:
language (Language) – The language to convert.
- Returns:
The Cartesia-specific language code, or None if not supported.
- Return type:
str | None
- async start(frame)[source]
Start the Cartesia TTS service.
- Parameters:
frame (StartFrame) – The start frame containing initialization parameters.
- async stop(frame)[source]
Stop the Cartesia TTS service.
- Parameters:
frame (EndFrame) – The end frame.
- async cancel(frame)[source]
Stop the Cartesia TTS service.
- Parameters:
frame (CancelFrame) – The end frame.
- async flush_audio()[source]
Flush any pending audio and finalize the current context.
- async run_tts(text)[source]
Generate speech from text using Cartesia’s streaming API.
- Parameters:
text (str) – The text to synthesize into speech.
- Yields:
Frame – Audio frames containing the synthesized speech.
- Return type:
AsyncGenerator[Frame, None]
- class pipecat.services.cartesia.tts.CartesiaHttpTTSService(*, api_key, voice_id, model='sonic-2', base_url='https://api.cartesia.ai', cartesia_version='2024-11-13', sample_rate=None, encoding='pcm_s16le', container='raw', params=None, **kwargs)[source]
Bases:
TTSService
Cartesia HTTP-based TTS service.
Provides text-to-speech using Cartesia’s HTTP API for simpler, non-streaming synthesis. Suitable for use cases where streaming is not required and simpler integration is preferred.
- Parameters:
api_key (str) – Cartesia API key for authentication.
voice_id (str) – ID of the voice to use for synthesis.
model (str) – TTS model to use (e.g., “sonic-2”).
base_url (str) – Base URL for Cartesia HTTP API.
cartesia_version (str) – API version string for Cartesia service.
sample_rate (int | None) – Audio sample rate. If None, uses default.
encoding (str) – Audio encoding format.
container (str) – Audio container format.
params (InputParams | None) – Additional input parameters for voice customization.
**kwargs – Additional arguments passed to the parent TTSService.
- class InputParams(*, language=Language.EN, speed='', emotion=<factory>)[source]
Bases:
BaseModel
Input parameters for Cartesia HTTP TTS configuration.
- Parameters:
language (Language | None) – Language to use for synthesis.
speed (str | float | None) – Voice speed control (string or float).
emotion (List[str] | None) – List of emotion controls (deprecated).
- language: Language | None
- speed: str | float | None
- emotion: List[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- can_generate_metrics()[source]
Check if this service can generate processing metrics.
- Returns:
True, as Cartesia HTTP service supports metrics generation.
- Return type:
bool
- language_to_service_language(language)[source]
Convert a Language enum to Cartesia language format.
- Parameters:
language (Language) – The language to convert.
- Returns:
The Cartesia-specific language code, or None if not supported.
- Return type:
str | None
- async start(frame)[source]
Start the Cartesia HTTP TTS service.
- Parameters:
frame (StartFrame) – The start frame containing initialization parameters.
- async stop(frame)[source]
Stop the Cartesia HTTP TTS service.
- Parameters:
frame (EndFrame) – The end frame.
- async cancel(frame)[source]
Cancel the Cartesia HTTP TTS service.
- Parameters:
frame (CancelFrame) – The cancel frame.
- async run_tts(text)[source]
Generate speech from text using Cartesia’s HTTP API.
- Parameters:
text (str) – The text to synthesize into speech.
- Yields:
Frame – Audio frames containing the synthesized speech.
- Return type:
AsyncGenerator[Frame, None]