Gemini

pipecat.services.gemini_multimodal_live.gemini.language_to_gemini_language(language)[source]

Maps a Language enum value to a Gemini Live supported language code.

Source: https://ai.google.dev/api/generate-content#MediaResolution

Returns None if the language is not supported by Gemini Live.

Parameters:: language (Language)
Return type:: str | None

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveContext(messages=None, tools=NOT_GIVEN, tool_choice=NOT_GIVEN)[source]

Bases: OpenAILLMContext

Parameters:

messages (List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | None)
tools (List[ChatCompletionToolParam] | NotGiven | ToolsSchema)
tool_choice (Literal['none', 'auto', 'required'] | ~openai.types.chat.chat_completion_named_tool_choice_param.ChatCompletionNamedToolChoiceParam | ~openai.NotGiven)

static upgrade(obj)[source]

Parameters:: obj (OpenAILLMContext)
Return type:: GeminiMultimodalLiveContext

extract_system_instructions()[source]

get_messages_for_initializing_history()[source]

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveUserContextAggregator(context, *, params=None, **kwargs)[source]

Bases: OpenAIUserContextAggregator

Parameters:

context (OpenAILLMContext)
params (LLMUserAggregatorParams | None)

async process_frame(frame, direction)[source]

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveAssistantContextAggregator(context, *, params=None, **kwargs)[source]

Bases: OpenAIAssistantContextAggregator

Parameters:

context (OpenAILLMContext)
params (LLMAssistantAggregatorParams | None)

async process_frame(frame, direction)[source]

Parameters:

frame (Frame)
direction (FrameDirection)

async handle_user_image_frame(frame)[source]

Handle a user image frame from a function call request.

Marks the associated function call as completed and adds the image to the context for processing.

Parameters:: frame (UserImageRawFrame) – Frame containing the user image and request context.

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveContextAggregatorPair(_user: pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveUserContextAggregator, _assistant: pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveAssistantContextAggregator)[source]

Bases: object

Parameters:

_user (GeminiMultimodalLiveUserContextAggregator)
_assistant (GeminiMultimodalLiveAssistantContextAggregator)

user()[source]

Return type:: GeminiMultimodalLiveUserContextAggregator

assistant()[source]

Return type:: GeminiMultimodalLiveAssistantContextAggregator

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalModalities(*values)[source]

Bases: Enum

TEXT = 'TEXT'

AUDIO = 'AUDIO'

class pipecat.services.gemini_multimodal_live.gemini.GeminiMediaResolution(*values)[source]

Bases: str, Enum

Media resolution options for Gemini Multimodal Live.

UNSPECIFIED = 'MEDIA_RESOLUTION_UNSPECIFIED'

LOW = 'MEDIA_RESOLUTION_LOW'

MEDIUM = 'MEDIA_RESOLUTION_MEDIUM'

HIGH = 'MEDIA_RESOLUTION_HIGH'

class pipecat.services.gemini_multimodal_live.gemini.GeminiVADParams(*, disabled=None, start_sensitivity=None, end_sensitivity=None, prefix_padding_ms=None, silence_duration_ms=None)[source]

Bases: BaseModel

Voice Activity Detection parameters.

Parameters:

disabled (bool | None)
start_sensitivity (StartSensitivity | None)
end_sensitivity (EndSensitivity | None)
prefix_padding_ms (int | None)
silence_duration_ms (int | None)

disabled: bool | None

start_sensitivity: StartSensitivity | None

end_sensitivity: EndSensitivity | None

prefix_padding_ms: int | None

silence_duration_ms: int | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.ContextWindowCompressionParams(*, enabled=False, trigger_tokens=None)[source]

Bases: BaseModel

Parameters for context window compression.

Parameters:

enabled (bool)
trigger_tokens (int | None)

enabled: bool

trigger_tokens: int | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.InputParams(*, frequency_penalty=None, max_tokens=4096, presence_penalty=None, temperature=None, top_k=None, top_p=None, modalities=GeminiMultimodalModalities.AUDIO, language=Language.EN_US, media_resolution=GeminiMediaResolution.UNSPECIFIED, vad=None, context_window_compression=None, extra=<factory>)[source]

Bases: BaseModel

Parameters:

frequency_penalty (float | None)
max_tokens (int | None)
presence_penalty (float | None)
temperature (float | None)
top_k (int | None)
top_p (float | None)
modalities (GeminiMultimodalModalities | None)
language (Language | None)
media_resolution (GeminiMediaResolution | None)
vad (GeminiVADParams | None)
context_window_compression (ContextWindowCompressionParams | None)
extra (Dict[str, Any] | None)

frequency_penalty: float | None

max_tokens: int | None

presence_penalty: float | None

temperature: float | None

top_k: int | None

top_p: float | None

modalities: GeminiMultimodalModalities | None

language: Language | None

media_resolution: GeminiMediaResolution | None

vad: GeminiVADParams | None

context_window_compression: ContextWindowCompressionParams | None

extra: Dict[str, Any] | None

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveLLMService(*, api_key, base_url='generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent', model='models/gemini-2.0-flash-live-001', voice_id='Charon', start_audio_paused=False, start_video_paused=False, system_instruction=None, tools=None, params=None, inference_on_context_initialization=True, **kwargs)[source]

Bases: LLMService

Provides access to Google’s Gemini Multimodal Live API.

This service enables real-time conversations with Gemini, supporting both text and audio modalities. It handles voice transcription, streaming audio responses, and tool usage.

Parameters:

api_key (str) – Google AI API key
base_url (str, optional) – API endpoint base URL. Defaults to “generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent”.
model (str, optional) – Model identifier to use. Defaults to “models/gemini-2.0-flash-live-001”.
voice_id (str, optional) – TTS voice identifier. Defaults to “Charon”.
start_audio_paused (bool, optional) – Whether to start with audio input paused. Defaults to False.
start_video_paused (bool, optional) – Whether to start with video input paused. Defaults to False.
system_instruction (str, optional) – System prompt for the model. Defaults to None.
tools (Union[List[dict], ToolsSchema], optional) – Tools/functions available to the model. Defaults to None.
params (InputParams, optional) – Configuration parameters for the model. Defaults to InputParams().
inference_on_context_initialization (bool, optional) – Whether to generate a response when context is first set. Defaults to True.

adapter_class: alias of GeminiLLMAdapter

can_generate_metrics()[source]

Return type:: bool

set_audio_input_paused(paused)[source]

Parameters:: paused (bool)

set_video_input_paused(paused)[source]

Parameters:: paused (bool)

set_model_modalities(modalities)[source]

Parameters:: modalities (GeminiMultimodalModalities)

set_language(language)[source]

Set the language for generation.

Parameters:: language (Language)

async set_context(context)[source]

Set the context explicitly from outside the pipeline.

This is useful when initializing a conversation because in server-side VAD mode we might not have a way to trigger the pipeline. This sends the history to the server. The inference_on_context_initialization flag controls whether to set the turnComplete flag when we do this. Without that flag, the model will not respond. This is often what we want when setting the context at the beginning of a conversation.

Parameters:: context (OpenAILLMContext)

async start(frame)[source]

Start the LLM service.

Parameters:: frame (StartFrame) – The start frame.

async stop(frame)[source]

Stop the LLM service.

Parameters:: frame (EndFrame) – The end frame.

async cancel(frame)[source]

Cancel the LLM service.

Parameters:: frame (CancelFrame) – The cancel frame.

async process_frame(frame, direction)[source]

Process a frame.

Parameters:

frame (Frame) – The frame to process.
direction (FrameDirection) – The direction of frame processing.

async send_client_event(event)[source]

create_context_aggregator(context, *, user_params=LLMUserAggregatorParams(aggregation_timeout=0.5), assistant_params=LLMAssistantAggregatorParams(expect_stripped_words=True))[source]

Create an instance of GeminiMultimodalLiveContextAggregatorPair from an OpenAILLMContext. Constructor keyword arguments for both the user and assistant aggregators can be provided.

Parameters:

context (OpenAILLMContext) – The LLM context.
user_params (LLMUserAggregatorParams, optional) – User aggregator parameters.
assistant_params (LLMAssistantAggregatorParams, optional) – User aggregator parameters.

Returns:

A pair of context aggregators, one for the user and one for the assistant, encapsulated in an GeminiMultimodalLiveContextAggregatorPair.

Return type:

GeminiMultimodalLiveContextAggregatorPair