back to portfolio
// Case Study 03  ·  on-device AI

Offline AI Transcriber — speech-to-text without the cloud

A local-first speech-to-text tool built on Whisper via CTranslate2. Zero cloud dependency. Real-time waveform feedback, fast enough on a modern CPU to compete with cloud APIs, and built for the contexts where you cannot ship audio off the machine in the first place.

Role
Sole engineer — research, build, packaging
Stack
Python, CTranslate2, faster-whisper, PyQt, NumPy
Status
Working tool, used in private engagements

Why offline

Cloud speech-to-text is a solved problem if you can send the audio. The interesting work is when you cannot. A legal interview covered by privilege. A medical session. Internal recordings you would prefer not to hand to a third party for transcription, let alone for the third party's training data.

The market for on-device speech is real and has been narrow on tooling. The off-the-shelf desktop apps are either slow, paid subscriptions that still ping a server, or locked to a specific hardware vendor. I wanted a tool that was honest about its architecture: the audio file enters the process, the transcript comes out, nothing crosses the boundary in either direction.

The technical choices

Whisper via CTranslate2

OpenAI's Whisper models are the obvious starting point. The obvious deployment, the reference PyTorch model, is also slow on CPU. CTranslate2 reimplements transformer inference in optimized C++ with quantization, dramatically cutting both memory and latency.

In practice the same audio file that takes a reference Whisper model a minute on CPU runs in a fraction of that with faster-whisper / CTranslate2 at int8 quantization. The quality drop at int8 on the medium and large models is small enough that a human reviewer rarely catches a difference compared to the float16 baseline.

Real-time waveform feedback

The thing that made the tool feel professional instead of academic was not the transcription quality. It was the waveform.

During recording or playback, the UI renders a live waveform driven from the audio buffer. Spikes show where speech is present. Silent gaps are visible. The user can see which slice of the recording is being processed and seek directly to a region. Pairing waveform-driven seeking with the segmented output Whisper produces gave the UI an editing-board feel rather than a black-box-with-a-progress-bar feel.

# waveform.py — buffer-driven peaks
def compute_peaks(samples: np.ndarray, target_buckets: int) -> np.ndarray:
    chunk = max(1, len(samples) // target_buckets)
    chunks = np.array_split(samples[: chunk * target_buckets], target_buckets)
    return np.array([np.abs(c).max() for c in chunks], dtype=np.float32)

Segment-level output, not a wall of text

Whisper returns segments with start and end timestamps. The transcript view in the app respects those boundaries — each segment is its own clickable line. Clicking a segment seeks the audio to its start. Editing a segment edits only that segment's text. The user is never staring at an unbroken paragraph and trying to find where their attention was three minutes ago.

Packaging the offline promise

A tool that claims to be offline cannot ship as a Python script that pulls models from the internet on first run. The packaging had to make the offline promise legible.

What surprised me

CPU was enough

I designed the architecture assuming I would eventually need to ship a GPU path. For the audiences this tool serves — attorneys with modern laptops, clinicians with workstations — the CPU path is already faster than real-time on the medium-quality model. The GPU path is a future-me problem, not a today-me problem.

The hard part was the editing experience

Transcription accuracy plateaued early. The remaining work was all about giving the human reviewer good tools to fix the mistakes Whisper made — keyboard navigation across segments, undo on edits, fast playback for the segment under the cursor, export to plain text or DOCX. The interface is now most of the value of the tool.

What I would do next

Need on-device AI integration? I take on offline / on-prem speech, NLP, and inference pipeline work end-to-end.
Start a conversation