Why offline
Cloud speech-to-text is a solved problem if you can send the audio. The interesting work is when you cannot. A legal interview covered by privilege. A medical session. Internal recordings you would prefer not to hand to a third party for transcription, let alone for the third party's training data.
The market for on-device speech is real and has been narrow on tooling. The off-the-shelf desktop apps are either slow, paid subscriptions that still ping a server, or locked to a specific hardware vendor. I wanted a tool that was honest about its architecture: the audio file enters the process, the transcript comes out, nothing crosses the boundary in either direction.
The technical choices
Whisper via CTranslate2
OpenAI's Whisper models are the obvious starting point. The obvious deployment, the reference PyTorch model, is also slow on CPU. CTranslate2 reimplements transformer inference in optimized C++ with quantization, dramatically cutting both memory and latency.
In practice the same audio file that takes a reference Whisper model a minute on CPU runs in a fraction of that with faster-whisper / CTranslate2 at int8 quantization. The quality drop at int8 on the medium and large models is small enough that a human reviewer rarely catches a difference compared to the float16 baseline.
Real-time waveform feedback
The thing that made the tool feel professional instead of academic was not the transcription quality. It was the waveform.
During recording or playback, the UI renders a live waveform driven from the audio buffer. Spikes show where speech is present. Silent gaps are visible. The user can see which slice of the recording is being processed and seek directly to a region. Pairing waveform-driven seeking with the segmented output Whisper produces gave the UI an editing-board feel rather than a black-box-with-a-progress-bar feel.
# waveform.py — buffer-driven peaks
def compute_peaks(samples: np.ndarray, target_buckets: int) -> np.ndarray:
chunk = max(1, len(samples) // target_buckets)
chunks = np.array_split(samples[: chunk * target_buckets], target_buckets)
return np.array([np.abs(c).max() for c in chunks], dtype=np.float32)
Segment-level output, not a wall of text
Whisper returns segments with start and end timestamps. The transcript view in the app respects those boundaries — each segment is its own clickable line. Clicking a segment seeks the audio to its start. Editing a segment edits only that segment's text. The user is never staring at an unbroken paragraph and trying to find where their attention was three minutes ago.
Packaging the offline promise
A tool that claims to be offline cannot ship as a Python script that pulls models from the internet on first run. The packaging had to make the offline promise legible.
- Models are downloaded once, into a known local cache directory under the user's data path. The download is an explicit, one-time UI step, not an invisible side-effect.
- After the first download, the app runs with the network off. I verify this by literally disabling the network adapter during integration tests on the build machine.
- No telemetry. No analytics. No update-check ping. The app does not call out, period. That is the product.
What surprised me
CPU was enough
I designed the architecture assuming I would eventually need to ship a GPU path. For the audiences this tool serves — attorneys with modern laptops, clinicians with workstations — the CPU path is already faster than real-time on the medium-quality model. The GPU path is a future-me problem, not a today-me problem.
The hard part was the editing experience
Transcription accuracy plateaued early. The remaining work was all about giving the human reviewer good tools to fix the mistakes Whisper made — keyboard navigation across segments, undo on edits, fast playback for the segment under the cursor, export to plain text or DOCX. The interface is now most of the value of the tool.
What I would do next
- Speaker diarization. Whisper doesn't separate speakers. A diarization step layered on top would turn the tool from a transcription tool into a meeting tool.
- Vocabulary biasing. Domain-specific names — case numbers, medical terms, proper nouns — are where on-device transcription still fumbles. A user-managed glossary that reweights the decoder would close most of that gap.
- Cross-platform installers. Currently focused on Windows. macOS and Linux installer pipelines are the next unglamorous deliverable.