Text-to-Speech with Orpheus TTS models

Open In Colab

Orpheus TTS is a series of open-source text-to-speech models using the same architecture as Llama 3. As a result, the model can run just like any other Llama-based LLM. In this cookbook, we show how to use Orpheus TTS for text-to-speech applications.

Note that currently the SNAC audio codec has to be run on the client-side.

Hello world

First, import the needed dependencies and define a helper function to convert tokens into audio data.

!pip install torchaudio snac openai soundfile &> /dev/null

from IPython.display import Audio, display

import torch
import openai
import snac
import re
import torchaudio

AUDIO_TOKENS_REGEX = re.compile(r"<custom_token_(\d+)>")

# Convert tokens into audio data
def convert_to_audio(audio_ids, model):
  audio_ids = torch.tensor(audio_ids, dtype=torch.int32).reshape(-1, 7)
  codes_0 = audio_ids[:, 0].unsqueeze(0)
  codes_1 = torch.stack((audio_ids[:, 1], audio_ids[:, 4])).t().flatten().unsqueeze(0)
  codes_2 = (
    torch.stack((audio_ids[:, 2], audio_ids[:, 3], audio_ids[:, 5], audio_ids[:, 6]))
    .t()
    .flatten()
    .unsqueeze(0)
  )

  with torch.inference_mode():
    audio_hat = model.decode([codes_0, codes_1, codes_2])

  return audio_hat[0]

Next, prompt the model to generate audio.

Last updated