I have XLRS, dyslexia, and ADHD.
I built Rift because every voice tool I tried fought how my brain works. This one doesn’t.
Voice to text. Text to voice. Entirely on your Mac. Nothing leaves your machine. Ever.
Why I built Rift
A congenital eye condition that makes reading on screens harder. I depend on text-to-speech for hours a day — I wanted voices that don’t create their own fatigue.
Text doesn’t cooperate with my brain. I needed a way to speak instead of type and listen instead of read — without sending my voice to someone’s cloud.
My thinking doesn’t follow a straight line. Most voice tools cut you off after two seconds of silence. Rift waits until you’re actually done.
The features I built for myself turn out to help everyone.
What is XLRS?
X-linked retinoschisis is a rare genetic condition that affects the retina’s layers and central vision. It’s uncorrectable with glasses. Symptoms vary; prolonged screen reading often causes extra strain and fatigue.
Voice to Text
Speak naturally. A local model cleans and merges as you go — not just raw transcription.
You decide
when you're done.
My thoughts don’t follow a timer. ADHD means I pause mid-sentence to find the right word — other tools treated that pause as “done.”
No auto-endpointing
Speak. Pause. Think. Rift waits.
Other apps cut you off after 2 seconds of silence.
Others
"The quick brown—"
Cut off after pause
Rift
"The quick brown fox jumps over the lazy dog."
You press stop when ready
0ms
First-word capture
Your first word is never lost.
A 250ms lead-in buffer starts recording before you even finish pressing the button.
Buffered
Button pressed
Recording
"Hel—" is already captured
0s
Rolling context window
The model considers the last 25 seconds of audio.
It understands context, not just isolated words.
Live paste
Text appears in your app as you speak.
Real-time streaming with final reconciliation when you stop.
The quick brown fox jumps over the lazy dog.
Auto-fix
Hallucination detection
If the first transcription guess is wrong, Rift detects it and auto-replaces.
No manual cleanup. No re-recording.
Real-time
Streaming transcription
Audio is processed in chunks as you speak.
No waiting for you to finish.
Silence polish
~5 seconds of quiet
After a few seconds of silence while dictating, Rift can polish what you already pasted — fillers, lists, grammar — using the same on-device model that powers final polish. Pauses aren’t wasted.
Polish modes
Verbatim keeps your words. Clean fixes obvious issues. Professional tightens tone more aggressively. You choose how much help you want.
Audio cues
Soft tones mark recording start and stop — so I get confirmation even when I’m not looking at the screen.
Hold to talk · Auto-send
Toggle or hold-to-talk dictation — pick what fits your hands and attention. Optional auto-send after paste (e.g. Return in chat apps) reduces friction after a burst of speech.
Text to Voice
Select text. Multiple engines. Natural speech — including code.
First word in
150 milliseconds.
I can’t always read the screen for long stretches. When audio is how I read, the first syllable can’t arrive late.
0ms
First-word latency
You hear the first word before the sentence finishes generating.
No loading spinners. No waiting.
Seamless
Clause-level streaming
The next sentence is synthesized while the current one plays.
No gaps. No stutters. Continuous audio.
0ms
Audio poll rate
The audio buffer is checked every 20 milliseconds.
Imperceptible latency between chunks.
50 checks per second
Pause anywhere
Tap to pause mid-syllable. Tap again to resume from the exact position.
Your place is never lost.
Tap to pause
0.5× – 2×
Playback speed
Speed up for skimming. Slow down for comprehension.
Adjust in real-time without restarting.
Code Talk
IDEs, terminals, docs
In Cursor, VS Code, Terminal, or developer sites, Rift detects context and can transform technical text into speakable phrasing before TTS — e.g. CSS overflow-x: hidden becomes “overflow-x set to hidden.” I read a lot of code with my ears.
Engines & voices
Kokoro (stable) and Chatterbox variants (including MLX fast paths) — pick what sounds right. 14+ voices across engines. Download extra models from the tray when you need them.
⌃3 — Show / hide / pause
Global shortcuts: ⌃1 read selection, ⌃2 dictation, ⌃3 show or hide the widget and pause audio. Your flow stays one keystroke away.
How it works
Two pipelines. Local speech models. A local language model for merge, correction, and polish. Zero cloud for your voice and text.
Start dictation
Capture
Core Audio streams from your microphone with a 250ms lead-in buffer. Your first word is never lost.
Process
Parakeet runs on the Neural Engine and GPU via MLX. 25 seconds of rolling context. Real-time streaming.
Paste
Text appears at your cursor as you speak. Final reconciliation when you stop. On-device Gemma 4 polishes your text — see Intelligence.
Speak selected text
Select
Highlight text in any app or copy to clipboard. Rift reads whatever you give it.
Synthesize
Kokoro or Chatterbox generates audio clause-by-clause. First word in 150ms. Next sentence ready before current ends. Code Talk may run an LLM transform first in developer contexts.
Play
Audio streams to system output. Pause anywhere, resume from exact position. 0.5× to 2× speed.
Four phases of local intelligence
Rift runs a local language models (Gemma 4 + Qwen3, via MLX) next to Parakeet and TTS. It’s not just transcription — it’s understanding and cleanup that never leaves your Mac.
- Merge — New words fold into what came before. Fewer duplicates and jumps as the recognizer updates.
- Correct — Grammar, punctuation, and light formatting in real time. Numbers and phrasing stay intentional.
- Extract — When the model revises earlier audio, only genuinely new words are appended.
- Polish — On pause or stop (and silence polish), fillers can be trimmed, lists formatted, sentences smoothed — per your polish mode.
A fast Qwen3 0.6B tier handles real-time phases; a deeper Gemma 4 E4B tier powers polish and Code Talk transforms. All on-device.
Privacy.
That's Rift.
Your voice never leaves your Mac. Ever. When assistive tech is how you read and write, that isn’t abstract — it’s dignity.
Zero file I/O
Audio is synthesized directly to memory. Nothing is written to disk. Nothing persists after you close the app.
Who Rift is for
The same design choices that help me help anyone who wants patient dictation, natural TTS, and privacy.
Dyslexia
I think better out loud than on paper. Rift turns speech into text without fighting me — and reads it back when I need to hear what I wrote.
ADHD
My brain takes detours. Rift doesn’t punish pauses, restarts, or nonlinear thinking — and live paste keeps the feedback loop tight.
Low vision
I can’t always read the screen. Rift reads to me — fast first word, adjustable speed, pause anywhere — with voices I can listen to for hours.
Motor differences
Hold-to-talk, global shortcuts, and no forced auto-cutoff mean less precise timing and fewer repeated keypresses.
Writers & thinkers
If you think by talking, Rift captures voice privately — on your Mac, under your control.
See the patience in action
A simplified replay: streaming text, a long pause, then an auto-fix. Skip to transcript
Ready
Demo transcript
Recording starts → text streams in → a 3s pause (other tools might have ended) → speech resumes → a wrong word auto-corrects.
Performance
Tested on real hardware. Real workloads.
M1 MacBook Air
M3 MacBook Pro
M4 Mac Studio
How Rift compares
| Feature | Rift | Whisper.cpp | macOS Dictation |
|---|---|---|---|
| On-device | Yes | Yes | Partial |
| No auto-cutoff | Yes | No | No |
| Live paste | Yes | No | Yes |
| First-word buffer | 250ms | None | None |
| Local LLM polish | Yes | No | No |
| TTS included | Yes | No | Basic |
| TTS latency (first word) | ~150ms | N/A | ~500ms |
| Voice & text privacy | 100% local | 100% local | Cloud fallback |
Requirements
- macOS Sonoma 14.0+
- Chip Apple Silicon
- RAM 8GB minimum
- Disk ~2GB
The visual metaphor
Nothing escapes.
A black hole where your data goes in — and stays in.
The Singularity
Your Mac is the center of gravity. All processing happens here — voice recognition, text synthesis, everything. No servers. No cloud. One machine.
The Accretion Disk
Your voice flows in like matter spiraling toward the event horizon. It gets captured, processed, transformed. The warm glow is energy being released as computation.
The Event Horizon
The point of no return — but in a good way. Once your words enter Rift, they never leave your machine. No telemetry, no uploads, no exceptions.
Gravitational Lensing
Just as light bends around a black hole, your voice bends into text. Text bends into voice. Transformation through the most powerful force — local compute.
How the visualization works +
Raymarching
Volumetric rendering via signed distance functions. The sphere-traced shader calculates 128 iterations per pixel to simulate photon paths.
Schwarzschild geodesics
Light follows the curved spacetime geometry of a non-rotating black hole. The photon sphere appears as a bright ring at 1.5× the event horizon radius.
Keplerian disk
Accretion disk particles orbit according to Kepler's laws. Inner particles orbit faster, creating the characteristic spiral structure.
ACES tonemapping
Film-industry-standard color grading compresses the HDR luminance into displayable range while preserving the fiery accretion glow.
Visualization based on Singularity by MisterPrada
Frequently asked
Does it work offline?
Yes for voice and text — all STT, TTS, and LLM work runs on your Mac. Rift does not send your speech or transcripts to the cloud. Optional Check for Updates and first-run model downloads use the network; you can use the app fully offline after models are cached.
What languages are supported?
Currently English only. The underlying Parakeet model supports multiple languages, and we're working on enabling them in future updates.
What voices and TTS engines are available?
Kokoro ships with multiple built-in voices. Chatterbox variants (including MLX fast options) add more voices and can be downloaded from the app when needed. Custom voice cloning is not available yet.
What is Code Talk?
In IDEs, terminals, and docs sites, Rift can transform technical text into natural speech before TTS — so code and symbols are spoken clearly instead of letter-by-letter noise.
What is Silence Polish?
When you pause for a few seconds while dictating, Rift can use that silence to clean up pasted text (fillers, lists, light grammar) using the on-device model — without sending anything off your Mac.
Is my voice data stored anywhere?
Never. Audio is processed in memory and discarded immediately. Nothing is written to disk or sent anywhere.
Why is the first run slow?
On first launch, Rift downloads and caches the ML models (~2GB). Subsequent launches are instant.
Does it work on Intel Macs?
No. Rift requires Apple Silicon (M1 or later) for the MLX machine learning framework.
Is Rift open source?
Yes. The full source code is available on GitHub under the MIT license.
How do I install it?
Download the DMG, drag to Applications, and launch. Apple Silicon (M1+) required. If macOS shows a security warning, check the installation guide for a quick fix. First launch downloads ~2GB of ML models.
The Technology
Built different.
Four pillars — speech in, intelligence in the middle, speech out — all on Apple Silicon. No cloud for your content.
The Foundation
MLX
Apple's machine learning framework. Runs entirely on your Mac's Neural Engine and GPU.
Voice to Text
Parakeet
NVIDIA's state-of-the-art speech recognition, optimized for Apple Silicon.
Text to Voice
Kokoro
Neural TTS with natural voices; Chatterbox variants optional. Real-time synthesis.
Intelligence
Gemma 4 + Qwen3
Local LLMs for merge, correct, extract, polish, and Code Talk. Fast Qwen3 0.6B + Gemma 4 E4B for deep cleanup — all via MLX.
Rift
Your voice. Your Mac. Nothing else.
Download for macOSApple Silicon (M1+) · macOS 14+ · English