How-To

Build a Local Voice Assistant with Home Assistant

Step-by-step guide to building a fully local voice assistant using Whisper, Ollama, and Home Assistant — zero cloud calls, complete privacy.

Local-Only Research Desk Mar 02, 2026

Keywords: local voice assistant Home Assistant, Whisper STT self-hosted, Ollama local LLM smart home, private voice control offline, voice assistant without Alexa Google, Home Assistant voice pipeline 2026, Whisper.cpp low RAM

Quick answer: Can I build a fully local voice assistant that rivals Alexa or Google?

Yes. Using Whisper.cpp for speech-to-text (under 200 MB RAM), Ollama for local LLM intent processing, and Home Assistant for device control, you can build a voice pipeline that makes zero external API calls. Over 17,000 Home Assistant users already run local STT/LLM pipelines. A CPU-only setup on an i5-6400 with 8 GB RAM handles basic commands; an RTX 3060 12 GB enables 2–5x faster inference for conversational interactions.

Executive Summary

Every voice command sent to Alexa or Google Assistant is recorded, transmitted to remote servers, and processed alongside metadata that reveals far more than the words you spoke — your accent, emotional state, age range, household composition, and daily routines are all extractable from audio streams¹. For privacy-conscious households, this is not an acceptable trade-off for the convenience of turning off a light by voice.

The good news: in 2026, the local voice assistant stack has matured to the point where a fully self-hosted pipeline — speech recognition, natural language understanding, and device control — runs on modest hardware with response times that approach cloud assistants. The three key components are Whisper.cpp for speech-to-text (STT), Ollama for running local large language models (LLMs), and Home Assistant for orchestrating device control across 2,000+ integrations.

This guide walks through hardware selection, software installation, model choices, performance tuning, and privacy implications. The goal is a working voice assistant that processes every interaction locally, with zero audio streaming to external servers.

Bottom line: A local voice assistant is no longer a science project. With the right hardware and an afternoon of setup, you get privacy-respecting voice control that handles 90%+ of typical smart home commands — and it keeps getting better as open-source models improve monthly.

1) The Architecture: How the Three Components Work Together

A local voice pipeline has three stages, each handled by a dedicated component. Understanding how they connect is essential for troubleshooting and optimization.

Stage 1 — Wake Word Detection: A lightweight model (openWakeWord or Porcupine) runs continuously on minimal resources, listening for a trigger phrase like “Hey Jarvis” or “OK Home.” This uses less than 50 MB of RAM and near-zero CPU. When the wake word is detected, the microphone stream is routed to Stage 2.

Stage 2 — Speech-to-Text (STT): Whisper.cpp receives the audio buffer and transcribes it into text. Whisper is OpenAI’s open-source speech recognition model, and Whisper.cpp is a highly optimized C/C++ port that runs efficiently on CPUs. The base.en model (under 200 MB RAM) handles English transcription with word-error rates comparable to commercial cloud STT services. Larger models (small.en at 500 MB, medium.en at 1.5 GB) improve accuracy for accented speech and noisy environments.

Stage 3 — Intent Processing + Device Control: The transcribed text is passed to a local LLM running in Ollama, which parses the intent (“turn off the living room lights”) and generates a structured command. Home Assistant’s Assist API receives this command and executes the appropriate action through its integration with physical devices — Zigbee bulbs, Z-Wave locks, Wi-Fi plugs, or anything in its 2,000+ device ecosystem.

Pipeline stage	Component	Resource usage	Latency
Wake word	openWakeWord	< 50 MB RAM	< 100 ms
Speech-to-text	Whisper.cpp (base.en)	< 200 MB RAM	1–3 s (CPU)
Intent + command	Ollama (Phi-3-mini)	2.5 GB RAM	0.5–2 s (CPU)
Device execution	Home Assistant	Negligible	< 200 ms
Total end-to-end		~3 GB RAM	2–5 s (CPU)

The entire pipeline runs inside your LAN. No DNS lookups to external AI endpoints, no audio uploads, no metadata leakage. Even your ISP cannot observe voice assistant traffic because there is none.

2) Hardware Requirements: Three Tiers

The beauty of this stack is its hardware flexibility. You can start on a Raspberry Pi and scale up to a GPU-accelerated server as your needs grow.

Budget Tier: Raspberry Pi 5 (~$100 total)

The RPi 5 with 8 GB RAM is the minimum viable platform. It runs Whisper.cpp tiny.en and quantized 3B-parameter models (like Phi-3-mini Q4) with acceptable latency for simple commands. Expect 4–8 second total response times. This tier is best for single-room setups with basic commands like lights, switches, and thermostats.

Mid-Range Tier: Intel NUC or Mini-PC ($200–400)

An Intel NUC 12 or equivalent mini-PC with an i5-6400 (or newer) and 8–16 GB RAM provides a significant performance jump. Whisper.cpp base.en runs in real-time, and Ollama can serve 7B-parameter models (Llama 3 8B at 9.2 GB, Qwen 2.5 7B) with 2–4 second response times. This is the sweet spot for most households: reliable, responsive, and powerful enough for multi-room voice control.

Enthusiast Tier: GPU-Accelerated Server ($500–1,000+)

An NVIDIA RTX 3060 12 GB (or RTX 4060) delivers 2–5x faster inference across both Whisper and Ollama. Whisper medium.en transcribes in under 1 second, and 7B-parameter LLMs respond in 0.5–1.5 seconds — approaching Alexa-level responsiveness. This tier supports simultaneous multi-room voice processing, larger context windows for conversational interactions, and experimental features like voice-driven automation creation.

Hardware tier	CPU / GPU	RAM	Whisper model	LLM model	Response time
Budget (RPi 5)	ARM Cortex-A76	8 GB	tiny.en	Phi-3-mini Q4 (2.5 GB)	4–8 s
Mid-range (NUC)	i5-6400+	8–16 GB	base.en	Llama 3 8B (9.2 GB)	2–4 s
Enthusiast (GPU)	i7-8700K + RTX 3060 12 GB	16–32 GB	medium.en	Qwen 2.5 7B	1–2 s

Even the RPi 5 can run quantized 3B–7B models using GGUF format in Ollama. The 4-bit quantized Phi-3-mini fits in 2.5 GB and handles structured intent parsing well. For conversational depth (follow-up questions, multi-step automations), upgrade to 7B+ models on mid-range or GPU hardware.

3) Software Installation: Step-by-Step

This section covers the practical installation path. We assume Home Assistant is already running (either as HAOS, a Docker container, or a supervised install).

Install Whisper.cpp Add-on

In Home Assistant, navigate to Settings → Add-ons → Add-on Store. Search for “Whisper” and install the official Whisper add-on. Configure the model size (base.en recommended for mid-range hardware) and language. Start the add-on — it automatically exposes a Wyoming protocol endpoint that Home Assistant’s voice pipeline can consume.

Install Ollama

Ollama runs as a separate service on the same machine or a dedicated server. Install via the official one-liner:

curl -fsSL https://ollama.com/install.sh | sh

Pull your preferred model:

ollama pull phi3:mini

For Home Assistant integration, install the Ollama Conversation custom component via HACS or configure the built-in OpenAI-compatible conversation agent pointing to http://localhost:11434.

Configure the Voice Pipeline

In Home Assistant, go to Settings → Voice Assistants → Add Assistant. Select:

Speech-to-text: Whisper (local add-on)
Conversation agent: Ollama (local)
Text-to-speech: Piper (local, optional — adds spoken responses)

Assign this pipeline to your voice satellite devices (ESP32-S3-BOX, ATOM Echo, or any Wyoming-compatible microphone).

Model Selection Guide

Choosing the right LLM model is critical for balancing response quality and hardware constraints.

Model	Size (GGUF Q4)	RAM required	Best for	Notes
Phi-3-mini	2.5 GB	4 GB	Basic commands, structured intent	Fastest on low-end hardware
Llama 3 8B	4.7 GB	8 GB	Conversational, multi-step commands	Best quality-to-size ratio
Qwen 2.5 7B	4.4 GB	8 GB	Multilingual households	Strong non-English performance
Mistral 7B	4.1 GB	8 GB	General purpose	Good instruction following
Llama 3 70B (Q4)	40 GB	48 GB	Power users with high-end GPUs	Near-cloud quality

For most households, Llama 3 8B on mid-range hardware provides the best balance of response quality, latency, and resource usage.

4) Privacy Comparison: Local Pipeline vs Cloud Assistants

The privacy advantages of a local pipeline extend far beyond “my audio stays home.” Understanding exactly what cloud assistants capture helps quantify the value of going local.

What Alexa and Google Assistant collect from every interaction¹:

Full audio recording (stored 3–18 months by default)
Transcribed text and parsed intent
Device metadata: model, firmware, network environment
Acoustic fingerprint: extractable age range, gender, accent, and emotional state
Temporal patterns: wake/sleep schedule, occupancy, meal times
Cross-device correlation: linking voice data with shopping, browsing, and location history

What a local pipeline collects: Nothing external. Audio is processed in RAM, intent is parsed locally, and the command is executed on your LAN. No recordings are stored unless you explicitly configure retention. No metadata leaves your network. Your ISP sees zero voice-related traffic.

The metadata dimension is often underestimated. Research published by Northeastern University demonstrated that smart speaker audio can infer household size, socioeconomic indicators, and health conditions from acoustic analysis alone². None of this data extraction is possible when audio never leaves your hardware.

Privacy dimension	Alexa / Google	Local pipeline (Whisper + Ollama + HA)
Audio recording storage	Cloud (months)	None (RAM only, unless configured)
Voice metadata extraction	Age, accent, emotion	Impossible (local processing)
Behavioral pattern logging	Yes (routines, schedules)	No external logging
Cross-service data linking	Yes (shopping, ads)	No vendor accounts
Third-party data sharing	Per privacy policy	None
Law enforcement access	Subpoena / warrant	Physical device seizure only

Over 17,000 Home Assistant users now run local STT/LLM voice pipelines according to Home Assistant analytics³. This is not a fringe experiment — it is a growing movement toward voice interaction without surveillance.

5) Performance Tuning and Optimization

Out-of-the-box performance is good, but targeted tuning can significantly improve response times and accuracy.

Whisper Optimization

Use quantized models: Whisper.cpp supports INT8 and Q5 quantization, reducing model size and inference time by 30–40% with minimal accuracy loss.
Set language explicitly: Configuring language: en in the Whisper add-on skips automatic language detection, saving 200–500 ms per utterance.
Adjust VAD sensitivity: Voice Activity Detection (VAD) determines when you have finished speaking. Tuning the silence threshold prevents premature cutoffs on slower speakers and unnecessary waiting on fast speakers.

Ollama Optimization

System prompt engineering: A well-crafted system prompt that constrains the LLM to smart-home intents dramatically reduces token generation. Instead of generating paragraphs, the model outputs structured commands in 10–30 tokens.
Context window limiting: Set num_ctx to 2048 (or lower) for simple command processing. Smaller context windows reduce memory pressure and generation time.
GPU offloading: If you have a dedicated GPU, set num_gpu layers in Ollama’s modelfile to offload transformer layers to VRAM, achieving 2–5x speedup.

Network Optimization

Wired microphone satellites: ESP32-S3-BOX devices on Ethernet (via USB-Ethernet adapters) eliminate Wi-Fi jitter that causes audio dropouts.
Dedicated VLAN: Isolate voice assistant traffic on a dedicated VLAN to prevent bandwidth contention with video streams from security cameras.

Optimization	Impact	Effort
Whisper quantized model	30–40% faster STT	Low (config change)
Explicit language setting	200–500 ms saved	Low (config change)
LLM system prompt tuning	50–70% fewer tokens	Medium (prompt engineering)
GPU layer offloading	2–5x faster LLM	Low (if GPU available)
Wired microphone satellites	Eliminates audio jitter	Medium (hardware change)

6) Practical Limitations and Honest Assessment

A local voice assistant in 2026 is genuinely capable, but it is not a perfect replacement for Alexa or Google Assistant in every scenario. Honest assessment helps set expectations.

What works well: Simple device control (lights, switches, thermostats, locks), status queries (“Is the garage door open?”), timer and alarm setting, basic scene activation, and structured multi-step commands (“Turn off all downstairs lights and lock the front door”).

What is still improving: Complex conversational follow-ups, music playback integration (requires additional setup with Music Assistant), natural-sounding TTS (Piper is good but not Alexa-quality), multi-language switching mid-sentence, and large-vocabulary proper noun recognition (contact names, obscure device names).

What does not work locally: Third-party cloud skills (ordering pizza, calling Uber, real-time weather from APIs). Some of these can be replicated with local alternatives (weather from a self-hosted API, shopping lists via HA integrations), but the ecosystem is thinner than commercial platforms.

Capability	Cloud assistants	Local pipeline
Basic device control	Excellent	Excellent
Conversational follow-ups	Excellent	Good (improving)
Third-party skills/actions	Extensive	Limited
Music playback	Native	Requires Music Assistant
TTS voice quality	Excellent	Good (Piper)
Multi-language	Excellent	Good (model-dependent)
Privacy	Poor	Excellent

The trajectory matters: local models improve every month. Llama 3 and Phi-3 represent a generational leap over what was available 12 months ago, and the trend shows no signs of slowing.

Privacy comparison: local voice pipeline vs cloud assistants

Product	Cloud required	Local storage	Mandatory account	Offline control	Score / 10
Whisper + Ollama + Home Assistant	No	Full (RAM-only processing)	No	Excellent	9.7
Amazon Alexa	Yes	None	Yes (Amazon)	None	3.5
Google Assistant	Yes	None	Yes (Google)	None	3.8

Build checklist: local voice assistant setup

Choose hardware tier based on your response-time expectations and budget.
Install Home Assistant (HAOS recommended for beginners, Docker for advanced users).
Install and configure the Whisper add-on with your preferred model size.
Install Ollama and pull a suitable LLM model (Phi-3-mini for budget, Llama 3 8B for mid-range).
Configure a voice pipeline in Home Assistant connecting Whisper → Ollama → device control.
Set up at least one microphone satellite (ATOM Echo or ESP32-S3-BOX).
Tune Whisper VAD sensitivity and Ollama system prompt for your household's speech patterns.
Test offline: disconnect internet and verify the full pipeline still processes commands.

Diagram showing a local voice assistant pipeline with Whisper for speech-to-text, Ollama for intent processing, and Home Assistant for device control — all running on local hardware. — A complete local voice pipeline: microphone → Whisper STT → Ollama LLM → Home Assistant device control, with zero external API calls.

FAQ

Frequently Asked Questions

Can a Raspberry Pi 5 really run a local voice assistant?

Yes. The RPi 5 with 8 GB RAM runs Whisper.cpp tiny.en and quantized 3B-parameter LLMs like Phi-3-mini Q4. Expect 4–8 second response times for simple commands. It works well for single-room setups with basic device control. For multi-room or conversational use, upgrade to a mini-PC or GPU-equipped server.

How does the response time compare to Alexa or Google Assistant?

On GPU-accelerated hardware (RTX 3060 12 GB), total end-to-end response time is 1–2 seconds — within the range users find acceptable. On CPU-only mid-range hardware, expect 2–4 seconds. Cloud assistants typically respond in 1–3 seconds, so the gap has narrowed significantly. The trade-off is privacy, not speed.

Do I need to retrain Whisper on my voice?

No. Whisper models are pre-trained on 680,000 hours of multilingual audio and generalize well across accents and speaking styles. The base.en model handles most English accents without any fine-tuning. For non-English languages, select the appropriate multilingual model variant.

What microphone hardware works best for wake-word detection?

The ESP32-S3-BOX-3 ($45) is the most popular dedicated satellite — it includes a microphone array, speaker, and display. The M5Stack ATOM Echo ($13) is a minimal, affordable option. Both support the Wyoming protocol natively and integrate seamlessly with Home Assistant’s voice pipeline.

Can I use this alongside Alexa or Google Assistant as a fallback?

Yes. Home Assistant supports multiple voice pipelines simultaneously. You can configure a local pipeline as the default and keep a cloud assistant in a specific room as a fallback for complex queries or music playback. The local pipeline handles privacy-sensitive commands while the cloud assistant handles tasks that require internet services.

Primary Sources Table

ID	Title / Description	Direct URL
[1]	Northeastern University — Smart Speaker Privacy and Acoustic Metadata Extraction Research	https://www.northeastern.edu/news/smart-speaker-privacy-research/
[2]	Whisper.cpp — High-performance C/C++ port of OpenAI’s Whisper speech recognition model	https://github.com/ggerganov/whisper.cpp
[3]	Home Assistant Voice Pipeline Analytics — community adoption data for local STT and LLM pipelines	https://analytics.home-assistant.io/
[4]	Ollama Documentation — local LLM serving, model management, and API reference	https://ollama.com/docs
[5]	Home Assistant Year of the Voice — official roadmap and progress updates for local voice control	https://www.home-assistant.io/blog/2024/12/voice-chapter-6/

Conclusion

Building a local voice assistant is one of the most impactful privacy upgrades you can make to a smart home. The Whisper + Ollama + Home Assistant stack eliminates audio surveillance, metadata extraction, and cloud dependency in a single architectural decision. The hardware investment is modest ($100–500 depending on tier), the software is entirely free and open-source, and the community of 17,000+ local voice users provides a deep well of configuration examples and troubleshooting support.

The technology is ready. The models are capable. The only remaining variable is whether you value your household’s voice privacy enough to spend an afternoon setting it up.

For further reading, explore our guides on Local Voice Assistant Without Amazon or Google 2026, Best Open Source Smart Home Software for Privacy Advocates, and Privacy-Focused Smart Home Hubs Compatible with Home Assistant.

Northeastern University research on smart speaker acoustic metadata extraction, including age, emotion, and household composition inference. ↩ ↩²
Whisper.cpp — high-performance C/C++ implementation of OpenAI’s Whisper model for local speech-to-text processing. ↩
Home Assistant analytics showing 17,000+ users running local STT/LLM voice pipelines as of early 2026. ↩

Quick answer: Can I build a fully local voice assistant that rivals Alexa or Google?

Executive Summary

1) The Architecture: How the Three Components Work Together

2) Hardware Requirements: Three Tiers

Budget Tier: Raspberry Pi 5 (~$100 total)

Mid-Range Tier: Intel NUC or Mini-PC ($200–400)

Enthusiast Tier: GPU-Accelerated Server ($500–1,000+)

3) Software Installation: Step-by-Step

Install Whisper.cpp Add-on

Install Ollama

Configure the Voice Pipeline

Model Selection Guide

4) Privacy Comparison: Local Pipeline vs Cloud Assistants

5) Performance Tuning and Optimization

Whisper Optimization

Ollama Optimization

Network Optimization

6) Practical Limitations and Honest Assessment

Privacy comparison: local voice pipeline vs cloud assistants

Build checklist: local voice assistant setup

FAQ

Frequently Asked Questions

Primary Sources Table

Conclusion

Footnotes

Related guides