How-To
Build a Local Voice Assistant with Home Assistant
Step-by-step guide to building a fully local voice assistant using Whisper, Ollama, and Home Assistant — zero cloud calls, complete privacy.
Quick answer: Can I build a fully local voice assistant that rivals Alexa or Google?
Yes. Using Whisper.cpp for speech-to-text (under 200 MB RAM), Ollama for local LLM intent processing, and Home Assistant for device control, you can build a voice pipeline that makes zero external API calls. Over 17,000 Home Assistant users already run local STT/LLM pipelines. A CPU-only setup on an i5-6400 with 8 GB RAM handles basic commands; an RTX 3060 12 GB enables 2–5x faster inference for conversational interactions.
Executive Summary
Every voice command sent to Alexa or Google Assistant is recorded, transmitted to remote servers, and processed alongside metadata that reveals far more than the words you spoke — your accent, emotional state, age range, household composition, and daily routines are all extractable from audio streams1. For privacy-conscious households, this is not an acceptable trade-off for the convenience of turning off a light by voice.
The good news: in 2026, the local voice assistant stack has matured to the point where a fully self-hosted pipeline — speech recognition, natural language understanding, and device control — runs on modest hardware with response times that approach cloud assistants. The three key components are Whisper.cpp for speech-to-text (STT), Ollama for running local large language models (LLMs), and Home Assistant for orchestrating device control across 2,000+ integrations.
This guide walks through hardware selection, software installation, model choices, performance tuning, and privacy implications. The goal is a working voice assistant that processes every interaction locally, with zero audio streaming to external servers.
Bottom line: A local voice assistant is no longer a science project. With the right hardware and an afternoon of setup, you get privacy-respecting voice control that handles 90%+ of typical smart home commands — and it keeps getting better as open-source models improve monthly.
1) The Architecture: How the Three Components Work Together
A local voice pipeline has three stages, each handled by a dedicated component. Understanding how they connect is essential for troubleshooting and optimization.
Stage 1 — Wake Word Detection: A lightweight model (openWakeWord or Porcupine) runs continuously on minimal resources, listening for a trigger phrase like “Hey Jarvis” or “OK Home.” This uses less than 50 MB of RAM and near-zero CPU. When the wake word is detected, the microphone stream is routed to Stage 2.
Stage 2 — Speech-to-Text (STT): Whisper.cpp receives the audio buffer and transcribes it into text. Whisper is OpenAI’s open-source speech recognition model, and Whisper.cpp is a highly optimized C/C++ port that runs efficiently on CPUs. The base.en model (under 200 MB RAM) handles English transcription with word-error rates comparable to commercial cloud STT services. Larger models (small.en at 500 MB, medium.en at 1.5 GB) improve accuracy for accented speech and noisy environments.
Stage 3 — Intent Processing + Device Control: The transcribed text is passed to a local LLM running in Ollama, which parses the intent (“turn off the living room lights”) and generates a structured command. Home Assistant’s Assist API receives this command and executes the appropriate action through its integration with physical devices — Zigbee bulbs, Z-Wave locks, Wi-Fi plugs, or anything in its 2,000+ device ecosystem.
| Pipeline stage | Component | Resource usage | Latency |
|---|---|---|---|
| Wake word | openWakeWord | < 50 MB RAM | < 100 ms |
| Speech-to-text | Whisper.cpp (base.en) | < 200 MB RAM | 1–3 s (CPU) |
| Intent + command | Ollama (Phi-3-mini) | 2.5 GB RAM | 0.5–2 s (CPU) |
| Device execution | Home Assistant | Negligible | < 200 ms |
| Total end-to-end | ~3 GB RAM | 2–5 s (CPU) |
The entire pipeline runs inside your LAN. No DNS lookups to external AI endpoints, no audio uploads, no metadata leakage. Even your ISP cannot observe voice assistant traffic because there is none.
2) Hardware Requirements: Three Tiers
The beauty of this stack is its hardware flexibility. You can start on a Raspberry Pi and scale up to a GPU-accelerated server as your needs grow.
Budget Tier: Raspberry Pi 5 (~$100 total)
The RPi 5 with 8 GB RAM is the minimum viable platform. It runs Whisper.cpp tiny.en and quantized 3B-parameter models (like Phi-3-mini Q4) with acceptable latency for simple commands. Expect 4–8 second total response times. This tier is best for single-room setups with basic commands like lights, switches, and thermostats.
Mid-Range Tier: Intel NUC or Mini-PC ($200–400)
An Intel NUC 12 or equivalent mini-PC with an i5-6400 (or newer) and 8–16 GB RAM provides a significant performance jump. Whisper.cpp base.en runs in real-time, and Ollama can serve 7B-parameter models (Llama 3 8B at 9.2 GB, Qwen 2.5 7B) with 2–4 second response times. This is the sweet spot for most households: reliable, responsive, and powerful enough for multi-room voice control.
Enthusiast Tier: GPU-Accelerated Server ($500–1,000+)
An NVIDIA RTX 3060 12 GB (or RTX 4060) delivers 2–5x faster inference across both Whisper and Ollama. Whisper medium.en transcribes in under 1 second, and 7B-parameter LLMs respond in 0.5–1.5 seconds — approaching Alexa-level responsiveness. This tier supports simultaneous multi-room voice processing, larger context windows for conversational interactions, and experimental features like voice-driven automation creation.
| Hardware tier | CPU / GPU | RAM | Whisper model | LLM model | Response time |
|---|---|---|---|---|---|
| Budget (RPi 5) | ARM Cortex-A76 | 8 GB | tiny.en | Phi-3-mini Q4 (2.5 GB) | 4–8 s |
| Mid-range (NUC) | i5-6400+ | 8–16 GB | base.en | Llama 3 8B (9.2 GB) | 2–4 s |
| Enthusiast (GPU) | i7-8700K + RTX 3060 12 GB | 16–32 GB | medium.en | Qwen 2.5 7B | 1–2 s |
Even the RPi 5 can run quantized 3B–7B models using GGUF format in Ollama. The 4-bit quantized Phi-3-mini fits in 2.5 GB and handles structured intent parsing well. For conversational depth (follow-up questions, multi-step automations), upgrade to 7B+ models on mid-range or GPU hardware.
3) Software Installation: Step-by-Step
This section covers the practical installation path. We assume Home Assistant is already running (either as HAOS, a Docker container, or a supervised install).
Install Whisper.cpp Add-on
In Home Assistant, navigate to Settings → Add-ons → Add-on Store. Search for “Whisper” and install the official Whisper add-on. Configure the model size (base.en recommended for mid-range hardware) and language. Start the add-on — it automatically exposes a Wyoming protocol endpoint that Home Assistant’s voice pipeline can consume.
Install Ollama
Ollama runs as a separate service on the same machine or a dedicated server. Install via the official one-liner:
curl -fsSL https://ollama.com/install.sh | sh
Pull your preferred model:
ollama pull phi3:mini
For Home Assistant integration, install the Ollama Conversation custom component via HACS or configure the built-in OpenAI-compatible conversation agent pointing to http://localhost:11434.
Configure the Voice Pipeline
In Home Assistant, go to Settings → Voice Assistants → Add Assistant. Select:
- Speech-to-text: Whisper (local add-on)
- Conversation agent: Ollama (local)
- Text-to-speech: Piper (local, optional — adds spoken responses)
Assign this pipeline to your voice satellite devices (ESP32-S3-BOX, ATOM Echo, or any Wyoming-compatible microphone).
Model Selection Guide
Choosing the right LLM model is critical for balancing response quality and hardware constraints.
| Model | Size (GGUF Q4) | RAM required | Best for | Notes |
|---|---|---|---|---|
| Phi-3-mini | 2.5 GB | 4 GB | Basic commands, structured intent | Fastest on low-end hardware |
| Llama 3 8B | 4.7 GB | 8 GB | Conversational, multi-step commands | Best quality-to-size ratio |
| Qwen 2.5 7B | 4.4 GB | 8 GB | Multilingual households | Strong non-English performance |
| Mistral 7B | 4.1 GB | 8 GB | General purpose | Good instruction following |
| Llama 3 70B (Q4) | 40 GB | 48 GB | Power users with high-end GPUs | Near-cloud quality |
For most households, Llama 3 8B on mid-range hardware provides the best balance of response quality, latency, and resource usage.
4) Privacy Comparison: Local Pipeline vs Cloud Assistants
The privacy advantages of a local pipeline extend far beyond “my audio stays home.” Understanding exactly what cloud assistants capture helps quantify the value of going local.
What Alexa and Google Assistant collect from every interaction1:
- Full audio recording (stored 3–18 months by default)
- Transcribed text and parsed intent
- Device metadata: model, firmware, network environment
- Acoustic fingerprint: extractable age range, gender, accent, and emotional state
- Temporal patterns: wake/sleep schedule, occupancy, meal times
- Cross-device correlation: linking voice data with shopping, browsing, and location history
What a local pipeline collects: Nothing external. Audio is processed in RAM, intent is parsed locally, and the command is executed on your LAN. No recordings are stored unless you explicitly configure retention. No metadata leaves your network. Your ISP sees zero voice-related traffic.
The metadata dimension is often underestimated. Research published by Northeastern University demonstrated that smart speaker audio can infer household size, socioeconomic indicators, and health conditions from acoustic analysis alone2. None of this data extraction is possible when audio never leaves your hardware.
| Privacy dimension | Alexa / Google | Local pipeline (Whisper + Ollama + HA) |
|---|---|---|
| Audio recording storage | Cloud (months) | None (RAM only, unless configured) |
| Voice metadata extraction | Age, accent, emotion | Impossible (local processing) |
| Behavioral pattern logging | Yes (routines, schedules) | No external logging |
| Cross-service data linking | Yes (shopping, ads) | No vendor accounts |
| Third-party data sharing | Per privacy policy | None |
| Law enforcement access | Subpoena / warrant | Physical device seizure only |
Over 17,000 Home Assistant users now run local STT/LLM voice pipelines according to Home Assistant analytics3. This is not a fringe experiment — it is a growing movement toward voice interaction without surveillance.
5) Performance Tuning and Optimization
Out-of-the-box performance is good, but targeted tuning can significantly improve response times and accuracy.
Whisper Optimization
- Use quantized models: Whisper.cpp supports INT8 and Q5 quantization, reducing model size and inference time by 30–40% with minimal accuracy loss.
- Set language explicitly: Configuring
language: enin the Whisper add-on skips automatic language detection, saving 200–500 ms per utterance. - Adjust VAD sensitivity: Voice Activity Detection (VAD) determines when you have finished speaking. Tuning the silence threshold prevents premature cutoffs on slower speakers and unnecessary waiting on fast speakers.
Ollama Optimization
- System prompt engineering: A well-crafted system prompt that constrains the LLM to smart-home intents dramatically reduces token generation. Instead of generating paragraphs, the model outputs structured commands in 10–30 tokens.
- Context window limiting: Set
num_ctxto 2048 (or lower) for simple command processing. Smaller context windows reduce memory pressure and generation time. - GPU offloading: If you have a dedicated GPU, set
num_gpulayers in Ollama’s modelfile to offload transformer layers to VRAM, achieving 2–5x speedup.
Network Optimization
- Wired microphone satellites: ESP32-S3-BOX devices on Ethernet (via USB-Ethernet adapters) eliminate Wi-Fi jitter that causes audio dropouts.
- Dedicated VLAN: Isolate voice assistant traffic on a dedicated VLAN to prevent bandwidth contention with video streams from security cameras.
| Optimization | Impact | Effort |
|---|---|---|
| Whisper quantized model | 30–40% faster STT | Low (config change) |
| Explicit language setting | 200–500 ms saved | Low (config change) |
| LLM system prompt tuning | 50–70% fewer tokens | Medium (prompt engineering) |
| GPU layer offloading | 2–5x faster LLM | Low (if GPU available) |
| Wired microphone satellites | Eliminates audio jitter | Medium (hardware change) |
6) Practical Limitations and Honest Assessment
A local voice assistant in 2026 is genuinely capable, but it is not a perfect replacement for Alexa or Google Assistant in every scenario. Honest assessment helps set expectations.
What works well: Simple device control (lights, switches, thermostats, locks), status queries (“Is the garage door open?”), timer and alarm setting, basic scene activation, and structured multi-step commands (“Turn off all downstairs lights and lock the front door”).
What is still improving: Complex conversational follow-ups, music playback integration (requires additional setup with Music Assistant), natural-sounding TTS (Piper is good but not Alexa-quality), multi-language switching mid-sentence, and large-vocabulary proper noun recognition (contact names, obscure device names).
What does not work locally: Third-party cloud skills (ordering pizza, calling Uber, real-time weather from APIs). Some of these can be replicated with local alternatives (weather from a self-hosted API, shopping lists via HA integrations), but the ecosystem is thinner than commercial platforms.
| Capability | Cloud assistants | Local pipeline |
|---|---|---|
| Basic device control | Excellent | Excellent |
| Conversational follow-ups | Excellent | Good (improving) |
| Third-party skills/actions | Extensive | Limited |
| Music playback | Native | Requires Music Assistant |
| TTS voice quality | Excellent | Good (Piper) |
| Multi-language | Excellent | Good (model-dependent) |
| Privacy | Poor | Excellent |
The trajectory matters: local models improve every month. Llama 3 and Phi-3 represent a generational leap over what was available 12 months ago, and the trend shows no signs of slowing.
Privacy comparison: local voice pipeline vs cloud assistants
| Product | Cloud required | Local storage | Mandatory account | Offline control | Score / 10 |
|---|---|---|---|---|---|
| Whisper + Ollama + Home Assistant | No | Full (RAM-only processing) | No | Excellent | 9.7 |
| Amazon Alexa | Yes | None | Yes (Amazon) | None | 3.5 |
| Google Assistant | Yes | None | Yes (Google) | None | 3.8 |
Build checklist: local voice assistant setup
- Choose hardware tier based on your response-time expectations and budget.
- Install Home Assistant (HAOS recommended for beginners, Docker for advanced users).
- Install and configure the Whisper add-on with your preferred model size.
- Install Ollama and pull a suitable LLM model (Phi-3-mini for budget, Llama 3 8B for mid-range).
- Configure a voice pipeline in Home Assistant connecting Whisper → Ollama → device control.
- Set up at least one microphone satellite (ATOM Echo or ESP32-S3-BOX).
- Tune Whisper VAD sensitivity and Ollama system prompt for your household's speech patterns.
- Test offline: disconnect internet and verify the full pipeline still processes commands.
FAQ
Frequently Asked Questions
Can a Raspberry Pi 5 really run a local voice assistant?
Yes. The RPi 5 with 8 GB RAM runs Whisper.cpp tiny.en and quantized 3B-parameter LLMs like Phi-3-mini Q4. Expect 4–8 second response times for simple commands. It works well for single-room setups with basic device control. For multi-room or conversational use, upgrade to a mini-PC or GPU-equipped server.
How does the response time compare to Alexa or Google Assistant?
On GPU-accelerated hardware (RTX 3060 12 GB), total end-to-end response time is 1–2 seconds — within the range users find acceptable. On CPU-only mid-range hardware, expect 2–4 seconds. Cloud assistants typically respond in 1–3 seconds, so the gap has narrowed significantly. The trade-off is privacy, not speed.
Do I need to retrain Whisper on my voice?
No. Whisper models are pre-trained on 680,000 hours of multilingual audio and generalize well across accents and speaking styles. The base.en model handles most English accents without any fine-tuning. For non-English languages, select the appropriate multilingual model variant.
What microphone hardware works best for wake-word detection?
The ESP32-S3-BOX-3 ($45) is the most popular dedicated satellite — it includes a microphone array, speaker, and display. The M5Stack ATOM Echo ($13) is a minimal, affordable option. Both support the Wyoming protocol natively and integrate seamlessly with Home Assistant’s voice pipeline.
Can I use this alongside Alexa or Google Assistant as a fallback?
Yes. Home Assistant supports multiple voice pipelines simultaneously. You can configure a local pipeline as the default and keep a cloud assistant in a specific room as a fallback for complex queries or music playback. The local pipeline handles privacy-sensitive commands while the cloud assistant handles tasks that require internet services.
Primary Sources Table
| ID | Title / Description | Direct URL |
|---|---|---|
| [1] | Northeastern University — Smart Speaker Privacy and Acoustic Metadata Extraction Research | https://www.northeastern.edu/news/smart-speaker-privacy-research/ |
| [2] | Whisper.cpp — High-performance C/C++ port of OpenAI’s Whisper speech recognition model | https://github.com/ggerganov/whisper.cpp |
| [3] | Home Assistant Voice Pipeline Analytics — community adoption data for local STT and LLM pipelines | https://analytics.home-assistant.io/ |
| [4] | Ollama Documentation — local LLM serving, model management, and API reference | https://ollama.com/docs |
| [5] | Home Assistant Year of the Voice — official roadmap and progress updates for local voice control | https://www.home-assistant.io/blog/2024/12/voice-chapter-6/ |
Conclusion
Building a local voice assistant is one of the most impactful privacy upgrades you can make to a smart home. The Whisper + Ollama + Home Assistant stack eliminates audio surveillance, metadata extraction, and cloud dependency in a single architectural decision. The hardware investment is modest ($100–500 depending on tier), the software is entirely free and open-source, and the community of 17,000+ local voice users provides a deep well of configuration examples and troubleshooting support.
The technology is ready. The models are capable. The only remaining variable is whether you value your household’s voice privacy enough to spend an afternoon setting it up.
For further reading, explore our guides on Local Voice Assistant Without Amazon or Google 2026, Best Open Source Smart Home Software for Privacy Advocates, and Privacy-Focused Smart Home Hubs Compatible with Home Assistant.
Footnotes
-
Northeastern University research on smart speaker acoustic metadata extraction, including age, emotion, and household composition inference. ↩ ↩2
-
Whisper.cpp — high-performance C/C++ implementation of OpenAI’s Whisper model for local speech-to-text processing. ↩
-
Home Assistant analytics showing 17,000+ users running local STT/LLM voice pipelines as of early 2026. ↩