The tech behind DijiFlow: Whisper, CoreML and Apple Silicon, explained simply
Apple Silicon4 min read

The tech behind DijiFlow: Whisper, CoreML and Apple Silicon, explained simply

How DijiFlow Dictate turns your voice into text entirely on your device, using Whisper, CoreML and Apple Silicon. Explained in plain language.

Most dictation feels like magic until you ask the obvious question: where does my voice actually go? With DijiFlow Dictate, the honest answer is nowhere. You speak, text appears at your cursor, and not one word travels to a server. No account, no upload, no telemetry. That is not a privacy promise bolted on at the end — it falls out of how the app is built.

Three well-understood pieces make it work: Whisper, the open speech model that does the listening; CoreML, the framework that runs it efficiently on a Mac; and Apple Silicon, the chip that makes it feel instant. No prior knowledge needed — here is each one in plain terms.

  • ~12 MB
    app download
  • 300 MB–6 GB
    speech model, downloaded once
  • Neural Engine
    where the work actually runs

Whisper: turning sound into words

At the heart of DijiFlow Dictate is Whisper, a family of open-source speech recognition models from OpenAI. A speech model is, in plain terms, a very large pattern-matcher trained on enormous amounts of audio paired with its transcript. From that data it learns how the sounds people make line up with the words they mean — across accents, background noise, and the natural pauses of real speech.

When you dictate, Whisper predicts the most likely sequence of words from your microphone audio, and it is genuinely good at it. On clear speech it reaches around 98% accuracy, and the most capable version, Whisper large-v3, handles up to 90+ languages. Because it reads context rather than matching one word at a time, it copes with the messy way people actually talk.

Why the model is a separate download

This is the part that surprises people: the app and the intelligence are two different files. DijiFlow Dictate itself is tiny — about 12 MB. The Whisper speech models are the heavy part, ranging from roughly 300 MB to 6 GB depending on which you pick. Larger models are generally more accurate on difficult audio but ask more of your hardware, so you choose the balance of speed and accuracy that suits you.

You download a model once; after that, transcription needs no internet at all. That one-time step is exactly why your voice can stay on your machine.

CoreML: running the model the efficient way

A speech model is only useful if it runs quickly without draining your battery. That is the job of CoreML, Apple's framework for running machine-learning models on its devices. Think of it as a translator and traffic controller: it takes a model like Whisper and works out how to run it using the most suitable parts of your hardware.

DijiFlow Dictate uses WhisperKit, an open-source runtime that compiles Whisper to run through CoreML. That means the model is optimized specifically for Apple hardware instead of running as generic, slower code, so dictation keeps pace with natural speech while staying light on system resources. And it all happens locally — CoreML is not a cloud service. It is part of the operating system that lets apps run intelligent features privately and offline.

Apple Silicon: the chip that makes it instant

The last piece is the hardware. On modern Macs that means Apple Silicon — the M-series chips in machines running macOS 14 or later. These chips include a dedicated Neural Engine, a section of silicon built specifically to run machine-learning models fast and with very little power, with the GPU available through Metal when extra horsepower helps.

You configure none of this. CoreML spreads the work across the right hardware automatically; you just speak, and the chip handles it in real time. That is the quiet advantage of on-device design: the same silicon that makes your Mac feel responsive is what makes private dictation practical.

The whole pipeline, start to finish

Put the three pieces in order and the round trip is short — and entirely local.

  1. You speak

    Audio from your microphone is captured on the device, never streamed anywhere.

  2. Whisper runs via CoreML on the Neural Engine

    The model turns sound into words right there on Apple Silicon, in real time.

  3. Text lands at your cursor

    Your words appear in whatever app you are already in. Nothing is sent away, so there is nothing to leak.

Key takeaway

The model lives on your machine, so transcription is just local computation — there is no server in the loop to store, intercept, or quietly retain your voice.

Download once, then offline forever

Most voice tools are cloud services wearing an app icon: they need a connection and an account every time, because the model that understands you lives on someone else's hardware. DijiFlow Dictate flips that — you install once, and the work moves to your chip.

How it behaves DijiFlow (on-device) Cloud dictation
Works after a one-time download
Transcribes with no internet
No account required
Audio stays on your device

And beyond the Mac

The same on-device approach extends to Windows 10 and 11, where DijiFlow Dictate runs on AMD, Intel, and NVIDIA GPUs. NVIDIA hardware needs CUDA and a current driver, but the principle is identical: your speech is transcribed locally, and nothing is sent away.

No trick, just good engineering

There is nothing exotic happening here. DijiFlow Dictate is built on open, well-understood technology — Whisper for the speech model, WhisperKit and CoreML for the runtime, and Apple Silicon for the hardware. The decision that matters is keeping all of it on your device, so you get the convenience of modern dictation without ever handing your voice to anyone, across Free, Trial, and Pro.

If you would rather feel it than read about it, you can try private, on-device dictation free for 30 days on the Pro plan.

DijiFlow DictateDijiFlow Dictate

The DijiFlow Dictate Team

Notes on private, on-device dictation and getting more done with your voice.

Start dictating hands-free today.

Private, 100% on-device voice-to-text in 90+ languages — free forever, Pro when you need more.