Skip to main content

Controlling Elgato Key Lights with Voice Agent 

18 May 2026 · Originally published on x.com

I have an Elgato Air Light sitting on my desk. It’s great for video recording and calls. But every time I want to turn it on, adjust the brightness, or change the color temperature, I have to reach for my phone, open the app, and tap through menus. It’s a small friction, but it adds up.

I also had an M5Stack Core2 gathering dust — an ESP32-based device with a built-in microphone, speaker, and touch screen. I kept thinking: what if I could just talk to it? “Turn on my light.” “Make it warmer.” “Dim it to 30 percent.”

That’s when I thought, why not use the Cloudflare Agent SDK to build an agent. I started building my agent with custom functions to handle audio input and output. The M5Stack would connect to the agent deployed on the edge, send the audio chunks, the agent would process this, perform the action, and stream the audio response back to the device. This was working fine, but it was a lot of code, fragile code. If I switched the Text-To-Speech (TTS) or Speech-To-Text(STT) models, I would have to update the code to handle encoding, and decoding. This wasn’t fun at all.

Then in April, 2026, Cloudflare announced Cloudflare’s Voice SDK. The SDK turns an agent into a real-time voice agent with streaming speech-to-text, text-to-speech, and conversation history. Combine that with Workers AI for the LLM and Cloudflare Mesh for reaching local devices from the edge, and I had everything I needed.

In this article, I’ll walk you through how I built a voice-controlled smart light system — from the ESP32 firmware to the Worker running on Cloudflare’s edge, and all the gotchas I hit along the way. The article will focus more on the Cloudflare stack, and not the device code.

What I built

A voice assistant running on the M5Stack Core2 that can:

  • Have natural conversations using streaming speech-to-text and text-to-speech
  • Control my Elgato Air Light on the local network — turn it on/off, adjust brightness and color temperature
  • Do all processing on Cloudflare’s edge — the ESP32 is just a microphone, speaker, and display

Here’s the architecture. Click each node to see what it does:

Voice-controlled smart home architecture ESP32 audio flows to a Cloudflare Worker Durable Object, then through Workers AI and Cloudflare Mesh to control an Elgato light on the local network. CLOUDFLARE EDGE HOME NETWORK PCM16 audio PCM16 speech LLM + tools private fetch ESP32 mic + speaker VoiceAgent Durable Object @cloudflare/voice Workers AI Flux + Kimi Aura TTS VPC Network binding: MESH Mesh node Raspberry Pi Elgato Light HTTP :9123

When I say “turn on my light,” the LLM recognizes the intent, calls a tool function, which reaches the Elgato light through Cloudflare Mesh — and then speaks back “Done, I’ve turned on your light.”

Prerequisites

Before you follow along, here’s what you’ll need:

The Voice SDK: withVoice

The @cloudflare/voice SDK provides a withVoice mixin that turns any Cloudflare Agent (Durable Object) into a real-time voice agent. It handles:

  • Continuous streaming STT (speech-to-text) via the Flux model
  • Sentence-level TTS (text-to-speech) via Deepgram Aura
  • Conversation history persistence in SQLite
  • Interruption handling (new speech cancels in-progress TTS)
  • A WebSocket protocol that clients connect to

This is where things got exciting for me. The SDK abstracts away so much of the complexity that the core server code is surprisingly compact.

Here’s what a single spoken turn looks like inside the Worker:

Voice turn pipeline A speech turn flows from PCM audio through STT, onTurn, Workers AI, TTS, and back to the device as audio chunks. ONE SPOKEN TURN audio frames transcript text stream sentences PCM16 chunks ESP32 PCM16 Flux STT turn detection onTurn() LLM handoff Workers AI Kimi + tools Chunk text Aura TTS

The server

import { Agent, routeAgentRequest, type Connection } from 'agents';
import { withVoice, WorkersAIFluxSTT, type VoiceTurnContext } from '@cloudflare/voice';
import { streamText } from 'ai';
import { createWorkersAI } from 'workers-ai-provider';
const VoiceAgentBase = withVoice(Agent, { audioFormat: 'pcm16' });
export class VoiceAgent extends VoiceAgentBase<Env> {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new PCM16TTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersAi('@cf/moonshotai/kimi-k2.6'),
system: 'You are a helpful voice assistant. Keep responses concise.',
messages: [
...context.messages.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{ role: 'user', content: transcript },
],
abortSignal: context.signal,
});
return result.textStream;
}
}

The onTurn method is called whenever the user finishes speaking. It receives the transcript and returns a text stream — the SDK handles converting that text to speech and streaming the audio back. Make sure to append the current transcript to context.messages when building the message list for the LLM.

PCM16 TTS: Why I needed a custom class

This was the first gotcha I hit. The built-in WorkersAITTS class sends { text, speaker } to the Deepgram model (it defaults to aura-1), which outputs MP3 by default. The ESP32 doesn’t have an MP3 decoder (or at least what the coding agents told me), so I needed raw PCM16 audio instead.

The fix: a custom TTS class that calls the aura-2-en model directly and passes encoding: "linear16", sample_rate: 24000, and container: "none":

class PCM16TTS {
#ai: Ai;
constructor(ai: Ai) {
this.#ai = ai;
}
async synthesize(text: string, signal?: AbortSignal): Promise<ArrayBuffer | null> {
const resp = await this.#ai.run(
'@cf/deepgram/aura-2-en' as any,
{
text,
speaker: 'luna',
encoding: 'linear16',
sample_rate: 24000,
container: 'none',
} as any,
{ returnRawResponse: true, ...(signal ? { signal } : {}) }
);
return await (resp as Response).arrayBuffer();
}
}

Chunking audio for the ESP32

TTS generates audio per-sentence. A short sentence like “Hi! How can I help you?” produces ~90KB of PCM16 data. The ESP32 WebSocket library has a maximum frame size (WEBSOCKETS_MAX_DATA_SIZE), and the device has limited heap (~170KB free). Sending a single 90KB frame works but leaves little headroom.

The afterSynthesize hook lets me chunk audio into smaller frames before sending:

const AUDIO_CHUNK_SIZE = 4096;
afterSynthesize(audio: ArrayBuffer | null, _text: string, connection: Connection) {
if (!audio) return null;
const src = new Uint8Array(audio);
for (let offset = 0; offset < src.byteLength; offset += AUDIO_CHUNK_SIZE) {
const end = Math.min(offset + AUDIO_CHUNK_SIZE, src.byteLength);
connection.send(src.slice(offset, end));
}
return null; // returning null tells the SDK we handled sending ourselves
}

The WebSocket protocol

The withVoice SDK defines a WebSocket protocol between the client and the server. Here’s the full message flow:

Client → Server

MessageWhen
{"type":"hello","protocol_version":1}On connect
{"type":"start_call","preferred_format":"pcm16"}User taps to start
Binary PCM16 frames (16kHz, 16-bit, mono)Continuously while in call
{"type":"end_call"}User taps to end

Server → Client

MessageDescription
welcomeConnection acknowledged
statusState changes: idle, listening, thinking, speaking
transcriptFinal transcript with role: "user" or "assistant"
transcript_interimPartial STT result while user is speaking
transcript_start/delta/endStreaming LLM response tokens
audio_configAudio format info (format, sampleRate)
metricsTiming info (llm_ms, tts_ms, first_audio_ms)
Binary framesPCM16 audio during speaking status

The ESP32 client

The M5Stack Core2 has a built-in microphone, speaker, display, and touch screen. The firmware does the following:

  1. Connects to WiFi, then opens a WebSocket to the Worker
  2. Sends hello and waits for welcome
  3. On touch: sends start_call, receives the agent’s greeting, then begins streaming mic audio as binary PCM16 frames
  4. Receives status updates, transcripts, and audio — plays audio through the speaker using triple-buffered playRaw()
  5. On touch again: sends end_call

This part took the most debugging. The ESP32 is a constrained device, and the M5Stack Core2 has some quirks that weren’t obvious from the documentation.

Gotcha: mic reinit after speaker playback

The M5Stack Core2 has separate I2S buses for the microphone and speaker, but Speaker.playRaw() disrupts the mic’s I2S state. After playback stops, the mic produces silence. This one took me a while to figure out — I kept thinking my WebSocket connection was dropping, but the mic was just… silent.

The fix: fully tear down and reinitialize the mic after each playback session:

void stop_playback() {
M5.Speaker.stop();
M5.Speaker.end();
is_playing = false;
// Restart mic — Speaker.playRaw disrupts the mic I2S bus
M5.Mic.end();
delay(10);
auto mic_cfg = M5.Mic.config();
mic_cfg.sample_rate = 16000;
mic_cfg.magnification = 16;
M5.Mic.config(mic_cfg);
M5.Mic.begin();
}

Gotcha: WebSocket Host header

The ESP32 WebSocket library sends Host: hostname:443 in the header, but routeAgentRequest (which uses partyserver internally) expects just Host: hostname. The extra :443 causes routing to fail silently — no error, no log, just a connection that never reaches the Durable Object.

Important: You need to patch the WebSocketsClient library to omit the port when it’s 443 or 80.

Gotcha: Durable Object path routing

routeAgentRequest converts Durable Object binding names to kebab-case for URL routing. The binding VoiceAgent maps to path /agents/voice-agent/default, not /agents/VoiceAgent/default. The coding agents spent an embarrassing amount of time on this one.

Greeting on call start

One nice touch: the agent can speak immediately when a call begins by implementing onCallStart:

async onCallStart(connection: Connection) {
await this.speak(connection, 'Hi! How can I help you?');
}

That means start_call can produce server audio before the user says anything. It makes the experience feel much more natural.

Adding smart home control: Elgato Air Light via Mesh

Now for the interesting part. I wanted to say “turn on my light” and have the Worker control the Elgato Air light sitting on my local network.

The challenge

The Elgato Air Light exposes a REST API on the local network (http://<ip>:9123/elgato/lights). But the Worker runs on Cloudflare’s edge — it can’t reach 192.168.x.x directly.

The solution: Cloudflare Mesh + VPC Networks

Why Cloudflare Mesh and not Cloudflare Tunnel?

If you’ve used Cloudflare before, you might be wondering: why not just use Cloudflare Tunnel? Both connect your private network to Cloudflare, but they solve different problems.

Cloudflare Tunnel (cloudflared) is designed for publishing specific services to the internet. You configure a public hostname (like light.example.com), and Tunnel proxies inbound traffic from the internet to your local service. It’s great for “I want my app reachable at this URL.” But each service needs its own tunnel route, and the Worker can’t initiate arbitrary requests to any local IP — it can only reach the services you’ve explicitly published.

Cloudflare Mesh (formerly WARP Connector) is designed for private network connectivity. A Mesh node advertises CIDR routes, making an entire subnet reachable. With a VPC Network binding, your Worker gets a MESH.fetch() that can reach any IP and port in the advertised range — no per-service configuration needed.

Cloudflare TunnelCloudflare Mesh
Traffic directionInbound to origin — clients connect to published servicesBidirectional — any participant can initiate
AddressingBy public hostnameBy private IP (every participant gets a Mesh IP)
Worker accessReach specific published servicesReach any IP/port in the advertised subnet
Connectorcloudflaredwarp-cli
ProtocolsHTTP/S, TCP, SSH, RDP, SMBTCP, UDP, ICMP
Best forExposing apps to the internetPrivate network connectivity, VPN replacement

For this project, the Worker needs to call the Elgato’s local REST API at 192.168.x.x:9123 — a private IP that shouldn’t be exposed publicly. Mesh gives the Worker outbound access to the entire local subnet with a single binding. If I add more smart devices later, they’re automatically reachable too — no new tunnel routes to configure.

This is the same approach I used in my previous article to expose OpenClaw to the internet, but this time using Mesh instead of Tunnels.

The flow:

Worker calls env.MESH.fetch("http://192.168.x.x:9123/elgato/lights")
→ Cloudflare routes to Mesh network
→ Mesh node on local LAN receives the request
→ Forwards to Elgato at 192.168.x.x:9123
→ Response flows back the same path

Setting up the Mesh node

Mesh nodes require Linux. I used a Raspberry Pi 4 sitting on the same local network as the Elgato light.

Step 1: Create a Mesh node in the Cloudflare dashboard

Go to Networking > Mesh and select Add a node. Name it (e.g. home-network) and copy the connector token.

Step 2: Install the WARP client on the Raspberry Pi

SSH into the Pi and run:

Terminal window
# Add Cloudflare's GPG key and repo
curl -fsSL https://pkg.cloudflareclient.com/pubkey.gpg \
| sudo gpg --yes --dearmor -o /usr/share/keyrings/cloudflare-warp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/cloudflare-warp-archive-keyring.gpg] https://pkg.cloudflareclient.com/ $(lsb_release -cs) main" \
| sudo tee /etc/apt/sources.list.d/cloudflare-client.list
sudo apt-get update && sudo apt-get install -y cloudflare-warp

Step 3: Register as a Mesh connector and connect

Terminal window
sudo warp-cli connector new <YOUR_TOKEN>
sudo warp-cli connect

Verify:

Terminal window
sudo warp-cli status
# Should show: Status update: Connected

The node should appear as Online in the Mesh dashboard with a Mesh IP assigned.

Step 4: Add a CIDR route

In the Mesh dashboard, go to your node > Routes tab > Add route: 192.168.x.0/24. This tells Cloudflare that this Mesh node can forward traffic to devices on the local 192.168.x.x subnet — including the Elgato light.

Step 5: Configure NAT/MASQUERADE on the Mesh node

Important: By default, traffic from your Worker arrives at the Mesh node with a source IP in the 100.96.0.0/12 WARP range. When the Mesh node forwards this to a local device (like your Elgato), that device will try to reply to its default gateway (your router) instead of the Mesh node, causing connection timeouts.

You need to configure the Mesh node to rewrite the source IP before forwarding to local devices. I cover the exact nftables commands in the Gotchas section below. However, if your application is running on the same machine as the Mesh node, you don’t need to set this up.

The Elgato REST API

The Elgato Key Light / Air Light exposes a simple HTTP API on port 9123:

EndpointMethodDescription
/elgato/lightsGETGet current state
/elgato/lightsPUTSet state
/elgato/accessory-infoGETDevice info

The state payload:

{
"numberOfLights": 1,
"lights": [{
"on": 1,
"brightness": 50,
"temperature": 200
}]
}
  • on: 1 = on, 0 = off
  • brightness: 0–100
  • temperature: 143–344 (mirek scale — 143 = ~7000K cool white, 344 = ~2900K warm white)

Adding tool calling to the voice agent

Now that I had a way to reach the Elgato from the Worker, I needed the LLM to call the right API based on what I say. The Vercel AI SDK supports tool calling — you define tools with descriptions and parameters, and the LLM decides when to call them based on user intent.

The kimi-k2.6 model on Workers AI supports multi-turn tool calling natively. When you pass tools to streamText, the SDK:

  1. Sends tool definitions to the LLM
  2. When the LLM returns a tool call, executes the execute function
  3. Feeds the result back to the LLM
  4. The LLM generates a natural language response

The textStream returned to onTurn only contains the final spoken text — all the tool calling happens transparently.

Wrangler config

wrangler.jsonc
{
"compatibility_flags": ["nodejs_compat"],
"compatibility_date": "2025-09-21",
"migrations": [
{
"new_sqlite_classes": ["VoiceAgent"],
"tag": "v1"
}
],
"durable_objects": {
"bindings": [
{
"class_name": "VoiceAgent",
"name": "VoiceAgent"
}
]
},
"ai": {
"binding": "AI"
},
"vpc_networks": [
{
"binding": "MESH",
"network_id": "cf1:network",
"remote": true
}
],
"vars": {
"ELGATO_IP": "192.168.8.187"
}
}

Tool definitions

agent.ts
import { tool } from 'ai';
import { z } from 'zod/v4';
const ELGATO_PORT = 9123;
function elgatoUrl(env: Env, path: string) {
return `http://${env.ELGATO_IP}:${ELGATO_PORT}${path}`;
}
function elgatoTools(env: Env) {
const fetchLight = async (init?: RequestInit) => {
try {
return await env.MESH.fetch(elgatoUrl(env, '/elgato/lights'), init);
} catch {
return new Response(JSON.stringify({ error: 'Light unreachable via Mesh' }), {
status: 503,
headers: { 'Content-Type': 'application/json' },
});
}
};
const putLight = (body: object) =>
fetchLight({
method: 'PUT',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
});
const getLightState = async () => {
const res = await fetchLight();
return (await res.json()) as {
numberOfLights: number;
lights: Array<{ on: number; brightness: number; temperature: number }>;
};
};
return {
get_light_status: tool({
description: 'Get the current status of the desk light',
inputSchema: z.object({}),
execute: async () => {
const res = await fetchLight();
return await res.json();
},
}),
turn_light_on: tool({
description: 'Turn the desk light on',
inputSchema: z.object({}),
execute: async () => {
const state = await getLightState();
const light = state.lights[0];
light.on = 1;
const res = await putLight({ numberOfLights: 1, lights: [light] });
return { success: res.ok };
},
}),
turn_light_off: tool({
description: 'Turn the desk light off',
inputSchema: z.object({}),
execute: async () => {
const state = await getLightState();
const light = state.lights[0];
light.on = 0;
const res = await putLight({ numberOfLights: 1, lights: [light] });
return { success: res.ok };
},
}),
set_light_brightness: tool({
description: 'Set the desk light brightness (0-100)',
inputSchema: z.object({
brightness: z.number().min(0).max(100),
}),
execute: async ({ brightness }) => {
const state = await getLightState();
const light = state.lights[0];
light.brightness = brightness;
const res = await putLight({
numberOfLights: 1,
lights: [light],
});
return { success: res.ok, brightness };
},
}),
set_light_temperature: tool({
description: 'Set the color temperature (143=cool to 344=warm)',
inputSchema: z.object({
temperature: z.number().min(143).max(344),
}),
execute: async ({ temperature }) => {
const state = await getLightState();
const light = state.lights[0];
light.temperature = temperature;
const res = await putLight({
numberOfLights: 1,
lights: [light],
});
return { success: res.ok, temperature };
},
}),
};
}

Then in onTurn:

agent.ts
import { stepCountIs } from 'ai';
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersAi = createWorkersAI({ binding: this.env.AI });
const messages = [
...context.messages.map(m => ({
role: m.role as 'user' | 'assistant',
content: m.content,
})),
{ role: 'user', content: transcript },
];
const result = streamText({
model: workersAi('@cf/moonshotai/kimi-k2.6'),
system: `You are a helpful voice assistant that can also control the desk light.
When asked about the light, use the available tools. Keep responses concise and natural.`,
tools: elgatoTools(this.env),
messages,
abortSignal: context.signal,
stopWhen: stepCountIs(5),
});
return result.textStream;
}

That’s it. The LLM handles intent recognition. When I say “make it brighter,” the model calls set_light_brightness. When I say “what’s the weather,” it just responds normally. No keyword parsing, no intent classification system — the LLM figures it out.

Note: The stopWhen: stepCountIs(5) option gives the model enough room for the tool-call → tool-result → final-answer loop, while preventing an accidental unbounded tool loop. In my Worker, I also log tool-call start/finish and step summaries so Mesh or schema failures are visible in Worker logs.

What’s running where

ComponentWhereWhat it does
ESP32 firmwareM5Stack Core2 on my deskMic input, speaker output, touch UI, WebSocket client
VoiceAgentCloudflare Worker (Durable Object)STT, LLM, TTS, tool execution, conversation history
Workers AICloudflare edgeFlux STT, kimi-k2.6 LLM, Deepgram aura-2-en TTS
Mesh nodeRaspberry Pi 4 on local LANWARP connector bridging Cloudflare to local network
Elgato Air LightLocal network (192.168.8.187:9123)HTTP API for light control

Gotchas and lessons learned

I hit a lot of issues building this. Here’s a summary of everything I ran into, including some I already mentioned above.

  1. The built-in WorkersAITTS defaults to MP3. If your client can’t decode MP3, you need a custom TTS class that explicitly requests encoding: "linear16". I covered this earlier in the PCM16 TTS section.

  2. routeAgentRequest uses kebab-case paths. The Durable Object binding VoiceAgent maps to URL path /agents/voice-agent/default, not /agents/VoiceAgent/default.

  3. ESP32 mic needs reinit after speaker playback. On the M5Stack Core2, Speaker.playRaw() disrupts the mic I2S bus. You must call Speaker.end(), Mic.end(), then Mic.begin() to restore it.

  4. WebSocket Host header matters. The ESP32 WebSocket library sends Host: hostname:443, which breaks routeAgentRequest routing. Patch the library to omit standard ports.

  5. afterSynthesize returning null is valid. You can use it to chunk large TTS audio into smaller WebSocket frames — just send them yourself via connection.send() and return null so the SDK doesn’t double-send.

  6. Tool calling needs a bounded multi-step loop. Define tools with execute functions, pass them to streamText, and use stopWhen: stepCountIs(5) so the SDK can run the tool-call → execute → feed-result → generate-response loop. The textStream only yields the final spoken text.

  7. Mesh routing requires NAT/MASQUERADE on the Mesh node. If your Worker gets HandshakeTimeoutError when calling a local device via env.MESH.fetch(), the issue is asymmetric routing. When a packet arrives from the Worker, its source IP is in the 100.96.0.0/12 WARP range. The local device replies to its default gateway (your router), not back to the Mesh node.

    The official Cloudflare docs recommend solving this by either making the Mesh node the subnet’s default gateway, or adding a static route on your router that points 100.96.0.0/12 to the Mesh node. The coding agent went with a different approach: rewriting the source IP before forwarding to local devices using nftables:

    Cloudflare Mesh return traffic routing Traffic from a Worker reaches a local device through a Mesh node. Return traffic either needs a router static route or source NAT on the Mesh node. CLOUDFLARE NETWORK LOCAL LAN source: 100.96.x.x forward to 192.168.x.x router route: 100.96.0.0/12 → Mesh node MASQUERADE rewrites source to Pi LAN IP no route → reply lost Worker Mesh Mesh node Raspberry Pi 192.168.x.10 Elgato Air Light :9123 Router

    On modern Linux systems using nftables (most newer Raspberry Pi OS versions), add this rule:

Terminal window
# Check which firewall tool is available
which nft # If this returns a path, use nftables. If not, install iptables.
# Add MASQUERADE rule (replace eth0 with your LAN interface: eth0, wlan0, etc.)
sudo nft add table ip nat
sudo nft add chain ip nat postrouting { type nat hook postrouting priority 100 \; }
sudo nft add rule ip nat postrouting oifname "wlan0" iifname "CloudflareWARP" masquerade

To verify this is the issue before fixing it, SSH into your Mesh node and run:

Terminal window
# This will fail (simulates the Worker's packet path)
curl --interface 100.96.0.2 http://<ELGATO_IP>:9123/elgato/lights
# Error: Failed to connect / Handshake timeout
# This works (local origin traffic)
curl http://<ELGATO_IP>:9123/elgato/lights
# Returns: {"numberOfLights":1,...}

If the first curl fails but the second succeeds, you need the MASQUERADE rule. Make it persistent across reboots by saving the ruleset and loading it at boot:

Terminal window
echo 'table ip nat {
chain postrouting {
type nat hook postrouting priority 100;
oifname "wlan0" iifname "CloudflareWARP" masquerade
}
}' | sudo tee /etc/nftables-mesh-nat.nft
sudo nft -f /etc/nftables-mesh-nat.nft
# Persist across reboots (add to crontab)
(crontab -l 2>/dev/null; echo "@reboot sleep 10 && sudo nft -f /etc/nftables-mesh-nat.nft") | crontab -

Summary

I started with a simple frustration — reaching for my phone every time I wanted to adjust my desk light. What I ended up with is a voice assistant that runs on an ESP32, processes everything on Cloudflare’s edge, and controls local devices through Mesh networking.

The stack that made this possible:

  • @cloudflare/voice SDK — handles the hard parts of real-time voice (STT, TTS, conversation state, interruption)
  • Workers AI — LLM with tool calling for intent recognition
  • Cloudflare Mesh — bridges the gap between the edge and my local network
  • Vercel AI SDK — clean tool calling abstraction on top of Workers AI

The biggest surprises were the ESP32 quirks (mic reinit after speaker, WebSocket Host header) and the Mesh NAT issue. None of these were documented anywhere, so I hope this article saves you some debugging time.

What’s next

There’s a lot more I want to do with this setup:

  • Add more devices — I have other smart lights on my local network that I’d love to control by voice
  • Improve the ESP32 experience — a proper UI on the display showing conversation state and light status
  • Experiment with wake word detection instead of the touch-to-talk button
  • Try different LLM models as Workers AI adds more options. Kimi K2.6 is excellent, but a bit overkill for this. I might try with smaller models like Granite 4.0 or others from Workers AI.

If you are building something similar, or run into any issues, feel free to hit me up on X (Twitter) or LinkedIn. I’d love to hear about what you’re building with the Voice SDK and Mesh. I also co-authored a book - Building a Virtual Assistant with Raspberry Pi that will help you learn how to build an offline first virtual assistant!

Last updated on 18 May 2026