Gemini

pipecat.services.gemini_multimodal_live.gemini.language_to_gemini_language(language)[source]

Maps a Language enum value to a Gemini Live supported language code.

Source: https://ai.google.dev/api/generate-content#MediaResolution

Returns None if the language is not supported by Gemini Live.

Parameters:

language (Language)

Return type:

str | None

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveContext(messages=None, tools=NOT_GIVEN, tool_choice=NOT_GIVEN)[source]

Bases: OpenAILLMContext

Parameters:
  • messages (List[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | None)

  • tools (List[ChatCompletionToolParam] | NotGiven | ToolsSchema)

  • tool_choice (Literal['none', 'auto', 'required'] | ~openai.types.chat.chat_completion_named_tool_choice_param.ChatCompletionNamedToolChoiceParam | ~openai.NotGiven)

static upgrade(obj)[source]
Parameters:

obj (OpenAILLMContext)

Return type:

GeminiMultimodalLiveContext

extract_system_instructions()[source]
get_messages_for_initializing_history()[source]
class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveUserContextAggregator(context, *, params=None, **kwargs)[source]

Bases: OpenAIUserContextAggregator

Parameters:
  • context (OpenAILLMContext)

  • params (LLMUserAggregatorParams | None)

async process_frame(frame, direction)[source]
class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveAssistantContextAggregator(context, *, params=None, **kwargs)[source]

Bases: OpenAIAssistantContextAggregator

Parameters:
  • context (OpenAILLMContext)

  • params (LLMAssistantAggregatorParams | None)

async process_frame(frame, direction)[source]
Parameters:
  • frame (Frame)

  • direction (FrameDirection)

async handle_user_image_frame(frame)[source]

Handle a user image frame from a function call request.

Marks the associated function call as completed and adds the image to the context for processing.

Parameters:

frame (UserImageRawFrame) – Frame containing the user image and request context.

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveContextAggregatorPair(_user: pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveUserContextAggregator, _assistant: pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveAssistantContextAggregator)[source]

Bases: object

Parameters:
  • _user (GeminiMultimodalLiveUserContextAggregator)

  • _assistant (GeminiMultimodalLiveAssistantContextAggregator)

user()[source]
Return type:

GeminiMultimodalLiveUserContextAggregator

assistant()[source]
Return type:

GeminiMultimodalLiveAssistantContextAggregator

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalModalities(*values)[source]

Bases: Enum

TEXT = 'TEXT'
AUDIO = 'AUDIO'
class pipecat.services.gemini_multimodal_live.gemini.GeminiMediaResolution(*values)[source]

Bases: str, Enum

Media resolution options for Gemini Multimodal Live.

UNSPECIFIED = 'MEDIA_RESOLUTION_UNSPECIFIED'
LOW = 'MEDIA_RESOLUTION_LOW'
MEDIUM = 'MEDIA_RESOLUTION_MEDIUM'
HIGH = 'MEDIA_RESOLUTION_HIGH'
class pipecat.services.gemini_multimodal_live.gemini.GeminiVADParams(*, disabled=None, start_sensitivity=None, end_sensitivity=None, prefix_padding_ms=None, silence_duration_ms=None)[source]

Bases: BaseModel

Voice Activity Detection parameters.

Parameters:
  • disabled (bool | None)

  • start_sensitivity (StartSensitivity | None)

  • end_sensitivity (EndSensitivity | None)

  • prefix_padding_ms (int | None)

  • silence_duration_ms (int | None)

disabled: bool | None
start_sensitivity: StartSensitivity | None
end_sensitivity: EndSensitivity | None
prefix_padding_ms: int | None
silence_duration_ms: int | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.ContextWindowCompressionParams(*, enabled=False, trigger_tokens=None)[source]

Bases: BaseModel

Parameters for context window compression.

Parameters:
  • enabled (bool)

  • trigger_tokens (int | None)

enabled: bool
trigger_tokens: int | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.InputParams(*, frequency_penalty=None, max_tokens=4096, presence_penalty=None, temperature=None, top_k=None, top_p=None, modalities=GeminiMultimodalModalities.AUDIO, language=Language.EN_US, media_resolution=GeminiMediaResolution.UNSPECIFIED, vad=None, context_window_compression=None, extra=<factory>)[source]

Bases: BaseModel

Parameters:
  • frequency_penalty (float | None)

  • max_tokens (int | None)

  • presence_penalty (float | None)

  • temperature (float | None)

  • top_k (int | None)

  • top_p (float | None)

  • modalities (GeminiMultimodalModalities | None)

  • language (Language | None)

  • media_resolution (GeminiMediaResolution | None)

  • vad (GeminiVADParams | None)

  • context_window_compression (ContextWindowCompressionParams | None)

  • extra (Dict[str, Any] | None)

frequency_penalty: float | None
max_tokens: int | None
presence_penalty: float | None
temperature: float | None
top_k: int | None
top_p: float | None
modalities: GeminiMultimodalModalities | None
language: Language | None
media_resolution: GeminiMediaResolution | None
vad: GeminiVADParams | None
context_window_compression: ContextWindowCompressionParams | None
extra: Dict[str, Any] | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class pipecat.services.gemini_multimodal_live.gemini.GeminiMultimodalLiveLLMService(*, api_key, base_url='generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent', model='models/gemini-2.0-flash-live-001', voice_id='Charon', start_audio_paused=False, start_video_paused=False, system_instruction=None, tools=None, params=None, inference_on_context_initialization=True, **kwargs)[source]

Bases: LLMService

Provides access to Google’s Gemini Multimodal Live API.

This service enables real-time conversations with Gemini, supporting both text and audio modalities. It handles voice transcription, streaming audio responses, and tool usage.

Parameters:
  • api_key (str) – Google AI API key

  • base_url (str, optional) – API endpoint base URL. Defaults to “generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent”.

  • model (str, optional) – Model identifier to use. Defaults to “models/gemini-2.0-flash-live-001”.

  • voice_id (str, optional) – TTS voice identifier. Defaults to “Charon”.

  • start_audio_paused (bool, optional) – Whether to start with audio input paused. Defaults to False.

  • start_video_paused (bool, optional) – Whether to start with video input paused. Defaults to False.

  • system_instruction (str, optional) – System prompt for the model. Defaults to None.

  • tools (Union[List[dict], ToolsSchema], optional) – Tools/functions available to the model. Defaults to None.

  • params (InputParams, optional) – Configuration parameters for the model. Defaults to InputParams().

  • inference_on_context_initialization (bool, optional) – Whether to generate a response when context is first set. Defaults to True.

adapter_class

alias of GeminiLLMAdapter

can_generate_metrics()[source]
Return type:

bool

set_audio_input_paused(paused)[source]
Parameters:

paused (bool)

set_video_input_paused(paused)[source]
Parameters:

paused (bool)

set_model_modalities(modalities)[source]
Parameters:

modalities (GeminiMultimodalModalities)

set_language(language)[source]

Set the language for generation.

Parameters:

language (Language)

async set_context(context)[source]

Set the context explicitly from outside the pipeline.

This is useful when initializing a conversation because in server-side VAD mode we might not have a way to trigger the pipeline. This sends the history to the server. The inference_on_context_initialization flag controls whether to set the turnComplete flag when we do this. Without that flag, the model will not respond. This is often what we want when setting the context at the beginning of a conversation.

Parameters:

context (OpenAILLMContext)

async start(frame)[source]

Start the LLM service.

Parameters:

frame (StartFrame) – The start frame.

async stop(frame)[source]

Stop the LLM service.

Parameters:

frame (EndFrame) – The end frame.

async cancel(frame)[source]

Cancel the LLM service.

Parameters:

frame (CancelFrame) – The cancel frame.

async process_frame(frame, direction)[source]

Process a frame.

Parameters:
  • frame (Frame) – The frame to process.

  • direction (FrameDirection) – The direction of frame processing.

async send_client_event(event)[source]
create_context_aggregator(context, *, user_params=LLMUserAggregatorParams(aggregation_timeout=0.5), assistant_params=LLMAssistantAggregatorParams(expect_stripped_words=True))[source]

Create an instance of GeminiMultimodalLiveContextAggregatorPair from an OpenAILLMContext. Constructor keyword arguments for both the user and assistant aggregators can be provided.

Parameters:
  • context (OpenAILLMContext) – The LLM context.

  • user_params (LLMUserAggregatorParams, optional) – User aggregator parameters.

  • assistant_params (LLMAssistantAggregatorParams, optional) – User aggregator parameters.

Returns:

A pair of context aggregators, one for the user and one for the assistant, encapsulated in an GeminiMultimodalLiveContextAggregatorPair.

Return type:

GeminiMultimodalLiveContextAggregatorPair