Speech & Audio

egaki handles text-to-speech, voice cloning, and audio stem separation.

Text-to-speech

1egaki speech "Hello, this is a test." -o hello.mp3

Specify a model and voice:

1234egaki speech "Welcome to the future of video." \
  --model sonic-3.5 \
  --voice <voice-id> \
  -o narration.mp3

Read from stdin

Pipe text from a file:

1cat script.txt | egaki speech --stdin -o narration.mp3

Speed control

Cartesia models support --speed from 0.6 to 1.5:

1egaki speech "Speak faster." --speed 1.3 -o fast.mp3

Available providers

Provider	Models	Notes
OpenAI	`tts-1`, `tts-1-hd`	Standard quality
Cartesia	`sonic-3.5`, `sonic-3`	Best quality, speed control
ElevenLabs	`eleven_v3`, `eleven_multilingual_v2`, `eleven_flash_v2_5`	Multilingual

Voice cloning

Clone a voice from an audio clip and get a reusable voice ID:

1egaki voice clone recording.mp3 --name "my-voice" --json

Workflow: clone from a song

Separate vocals first, then clone:

12345678# 1. Separate vocals
egaki demucs song.mp3 --stems vocals -o stems/

# 2. Clone the isolated voice
egaki voice clone stems/song-vocals.mp3 --name "singer" --json

# 3. Generate speech with the cloned voice
egaki speech "Your text here." --voice <voice-id> -m sonic-3.5 -o output.mp3

Providers

Cartesia (default): instant cloning, up to 10s of audio, free
ElevenLabs: longer clips, --remove-background-noise option

12345egaki voice clone noisy-audio.mp3 \
  --provider elevenlabs \
  --name "clean-voice" \
  --remove-background-noise \
  --json

Audio stem separation (demucs)

Separate a song into individual stems using fal.ai's Demucs model:

1egaki demucs song.mp3 --stems vocals,other -o stems/

Available stems

vocals, drums, bass, other, guitar, piano

Models

Model	Stems	Best for
`htdemucs`	vocals, drums, bass, other	General purpose
`htdemucs_ft`	vocals, drums, bass, other	Fine-tuned, higher quality
`htdemucs_6s` (default)	vocals, drums, bass, other, guitar, piano	6-stem separation

1egaki demucs song.mp3 --model htdemucs_6s --stems vocals,guitar -o stems/