Skip to content

Audio understanding

Audio can be sent to the model using an AudioPart.

types.ts
interface AudioPart {
type: "audio";
/**
* The base64-encoded audio data.
*/
audio_data: string;
format: AudioFormat;
/**
* The sample rate of the audio. E.g. 44100, 48000.
*/
sample_rate?: number;
/**
* The number of channels of the audio. E.g. 1, 2.
*/
channels?: number;
/**
* The transcript of the audio.
*/
transcript?: string;
/**
* ID of the audio part, if applicable
*/
id?: string;
}
type AudioFormat =
| "wav"
| "mp3"
| "linear16"
| "flac"
| "mulaw"
| "alaw"
| "aac"
| "opus";

This enables use cases such as:

  • Transcribing audio to text
  • Summarizing spoken content
  • Analyzing sentiment in speech

summarize-audio

summarize-audio.ts
import { getModel } from "./get-model.ts";
const audioUrl = "https://archive.org/download/MLKDream/MLKDream.ogg";
const audioRes = await fetch(audioUrl);
const audio = await audioRes.arrayBuffer();
const model = getModel("google", "gemini-2.0-flash");
const response = await model.generate({
messages: [
{
role: "user",
content: [
{
type: "text",
text: "What is this speech about?",
},
{
type: "audio",
audio_data: Buffer.from(audio).toString("base64"),
format: "opus",
},
],
},
],
});
console.dir(response, { depth: null });