Agent-readable docs index: /llms.txt. Full docs in one file: /llms-full.txt. Download /docs.zip to grep all markdown files locally.

Transcription

Transcribe audio files to text with word-level timestamps. Useful for generating captions in egaki videos.

Basic usage

egaki transcribe recording.mp3
Specify a model:
egaki transcribe recording.mp3 --model whisper-1

Models with word timestamps

Not all models return word-level timing. These do:
ModelProviderNotes
whisper-1OpenAIReliable, good accuracy
ink-whisperCartesiaCheapest option
scribe_v1ElevenLabsHigh accuracy
nova-3DeepgramFast
whisper-large-v3GroqFast, open weights
whisper-large-v3-turboGroqFastest
distil-whisper-large-v3-enGroqEnglish only, very fast
gpt-4o-transcribe and gpt-4o-mini-transcribe do not support word timestamps. The OpenAI API rejects verbose_json for these models.

Output formats

By default, transcription prints the text and word timestamps to stdout. Save to a file:
egaki transcribe recording.mp3 -o transcript.json

Using timestamps for video captions

The typical workflow for adding captions to an egaki video:
# 1. Generate TTS narration egaki speech "Your narration text." --voice <id> -m sonic-3.5 -o public/narration.mp3 # 2. Transcribe to get word timestamps egaki transcribe public/narration.mp3 --model whisper-1 # 3. Use the timestamps in your MDX video
Convert each word's startSecond to frame delays using FPS:
<Caption words={[ { word: "Your", delay: 0 }, { word: "narration", delay: 0.26 * FPS }, { word: "text.", delay: 0.48 * FPS }, ]} />
When you regenerate the TTS audio, always re-transcribe and update the delays.