Text-to-Speech with Orpheus TTS models
Orpheus TTS is a series of open-source text-to-speech models using the same architecture as Llama 3. As a result, the model can run just like any other Llama-based LLM. In this cookbook, we show how to use Orpheus TTS for text-to-speech applications.
Note that currently the SNAC audio codec has to be run on the client-side.
Hello world
First, import the needed dependencies and define a helper function to convert tokens into audio data.
!pip install torchaudio snac openai soundfile &> /dev/null
from IPython.display import Audio, display
import torch
import openai
import snac
import re
import torchaudio
AUDIO_TOKENS_REGEX = re.compile(r"<custom_token_(\d+)>")
# Convert tokens into audio data
def convert_to_audio(audio_ids, model):
audio_ids = torch.tensor(audio_ids, dtype=torch.int32).reshape(-1, 7)
codes_0 = audio_ids[:, 0].unsqueeze(0)
codes_1 = torch.stack((audio_ids[:, 1], audio_ids[:, 4])).t().flatten().unsqueeze(0)
codes_2 = (
torch.stack((audio_ids[:, 2], audio_ids[:, 3], audio_ids[:, 5], audio_ids[:, 6]))
.t()
.flatten()
.unsqueeze(0)
)
with torch.inference_mode():
audio_hat = model.decode([codes_0, codes_1, codes_2])
return audio_hat[0]Next, prompt the model to generate audio.
Last updated