Skip to content

Audio generation

Audio modality is represented as AudioParts.

types.ts
interface AudioPart {
type: "audio";
/**
* The base64-encoded audio data.
*/
audio_data: string;
format: AudioFormat;
/**
* The sample rate of the audio. E.g. 44100, 48000.
*/
sample_rate?: number;
/**
* The number of channels of the audio. E.g. 1, 2.
*/
channels?: number;
/**
* The transcript of the audio.
*/
transcript?: string;
/**
* ID of the audio part, if applicable
*/
id?: string;
}
type AudioFormat =
| "wav"
| "mp3"
| "linear16"
| "flac"
| "mulaw"
| "alaw"
| "aac"
| "opus";

To ensure the audio can be played correctly, the application code must consider the provided audio format, sample_rate, and channels.

Depending on the provider, you may have to pass additional parameters to their API, such as voice or format, using the audio field.

types.ts
interface AudioOptions {
/**
* The desired audio format.
*/
format?: AudioFormat;
/**
* The provider-specific voice ID to use for audio generation.
*/
voice?: string;
/**
* The language code for the audio generation.
*/
language?: string;
}

To generate audio, specify audio in the input modalities.

generate-audio.ts
import audioContext from "audio-context";
import decodeAudio from "audio-decode";
import play from "audio-play";
import { getModel } from "./get-model.ts";
const model = getModel("openai-chat-completion", "gpt-4o-audio-preview");
const response = await model.generate({
modalities: ["text", "audio"],
audio: {
format: "mp3",
voice: "alloy",
},
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Is a golden retriever a good family dog?",
},
],
},
],
});
console.dir(response, { depth: null });
const audioPart = response.content.find((part) => part.type === "audio");
if (audioPart) {
const audioBuffer = await decodeAudio(
Buffer.from(audioPart.audio_data, "base64"),
);
const playback = play(
audioBuffer,
{ context: audioContext } as unknown as play.Options,
() => {
console.log("Playback finished");
},
);
playback.play();
}

Audio generation can also be streamed using the stream() method. In streamed responses, AudioPart will be represented as AudioPartDelta:

types.ts
interface AudioPartDelta {
type: "audio";
/**
* The base64-encoded audio data.
*/
audio_data?: string;
format?: AudioFormat;
/**
* The sample rate of the audio. E.g. 44100, 48000.
*/
sample_rate?: number;
/**
* The number of channels of the audio. E.g. 1, 2.
*/
channels?: number;
/**
* The transcript of the audio.
*/
transcript?: string;
/**
* The ID of the audio part, if applicable
*/
id?: string;
}
type AudioFormat =
| "wav"
| "mp3"
| "linear16"
| "flac"
| "mulaw"
| "alaw"
| "aac"
| "opus";

Individual audio_data chunks can be played back as they are received. They can be combined to create the final audio output.

stream-audio.ts
import Speaker from "speaker";
import { getModel } from "./get-model.ts";
let speaker: Speaker | undefined;
const model = getModel("openai-chat-completion", "gpt-4o-audio-preview");
const response = model.stream({
modalities: ["text", "audio"],
audio: {
format: "linear16",
voice: "alloy",
},
messages: [
{
role: "user",
content: [
{
type: "text",
text: "Is a golden retriever a good family dog?",
},
],
},
],
});
let current = await response.next();
while (!current.done) {
console.dir(current.value, { depth: null });
const part = current.value.delta?.part;
if (part?.type === "audio") {
if (part.audio_data) {
speaker =
speaker ??
new Speaker({
sampleRate: part.sample_rate ?? 24000,
bitDepth: 16,
channels: part.channels ?? 1,
});
speaker.write(Buffer.from(part.audio_data, "base64"));
}
}
current = await response.next();
}