Controlling Elgato Key Lights with Voice Agent

I have an Elgato Air Light sitting on my desk. It’s great for video recording and calls. But every time I want to turn it on, adjust the brightness, or change the color temperature, I have to reach for my phone, open the app, and tap through menus. It’s a small friction, but it adds up.

I also had an M5Stack Core2 gathering dust — an ESP32-based device with a built-in microphone, speaker, and touch screen. I kept thinking: what if I could just talk to it? “Turn on my light.” “Make it warmer.” “Dim it to 30 percent.”

That’s when I thought, why not use the Cloudflare Agent SDK to build an agent. I started building my agent with custom functions to handle audio input and output. The M5Stack would connect to the agent deployed on the edge, send the audio chunks, the agent would process this, perform the action, and stream the audio response back to the device. This was working fine, but it was a lot of code, fragile code. If I switched the Text-To-Speech (TTS) or Speech-To-Text(STT) models, I would have to update the code to handle encoding, and decoding. This wasn’t fun at all.

Then in April, 2026, Cloudflare announced Cloudflare’s Voice SDK. The SDK turns an agent into a real-time voice agent with streaming speech-to-text, text-to-speech, and conversation history. Combine that with Workers AI for the LLM and Cloudflare Mesh for reaching local devices from the edge, and I had everything I needed.

In this article, I’ll walk you through how I built a voice-controlled smart light system — from the ESP32 firmware to the Worker running on Cloudflare’s edge, and all the gotchas I hit along the way. The article will focus more on the Cloudflare stack, and not the device code.

What I built

A voice assistant running on the M5Stack Core2 that can:

Have natural conversations using streaming speech-to-text and text-to-speech
Control my Elgato Air Light on the local network — turn it on/off, adjust brightness and color temperature
Do all processing on Cloudflare’s edge — the ESP32 is just a microphone, speaker, and display

Here’s the architecture. Click each node to see what it does:

When I say “turn on my light,” the LLM recognizes the intent, calls a tool function, which reaches the Elgato light through Cloudflare Mesh — and then speaks back “Done, I’ve turned on your light.”

Prerequisites

Before you follow along, here’s what you’ll need:

An M5Stack Core2 (or any ESP32 with mic and speaker)
An Elgato Key Light or Air Light on your local network
A Raspberry Pi 3/4/5 (or any Linux machine) on the same local network as the light
A Cloudflare account with Workers AI enabled
Familiarity with TypeScript and Arduino/C++

The Voice SDK: `withVoice`

The @cloudflare/voice SDK provides a withVoice mixin that turns any Cloudflare Agent (Durable Object) into a real-time voice agent. It handles:

Continuous streaming STT (speech-to-text) via the Flux model
Sentence-level TTS (text-to-speech) via Deepgram Aura
Conversation history persistence in SQLite
Interruption handling (new speech cancels in-progress TTS)
A WebSocket protocol that clients connect to

This is where things got exciting for me. The SDK abstracts away so much of the complexity that the core server code is surprisingly compact.

Here’s what a single spoken turn looks like inside the Worker:

The server

import { Agent, routeAgentRequest, type Connection } from 'agents';
import { withVoice, WorkersAIFluxSTT, type VoiceTurnContext } from '@cloudflare/voice';
import { streamText } from 'ai';
import { createWorkersAI } from 'workers-ai-provider';

const VoiceAgentBase = withVoice(Agent, { audioFormat: 'pcm16' });

export class VoiceAgent extends VoiceAgentBase<Env> {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new PCM16TTS(this.env.AI);

  async onTurn(transcript: string, context: VoiceTurnContext) {
    const workersAi = createWorkersAI({ binding: this.env.AI });

    const result = streamText({
      model: workersAi('@cf/moonshotai/kimi-k2.6'),
      system: 'You are a helpful voice assistant. Keep responses concise.',
      messages: [
        ...context.messages.map(m => ({
          role: m.role as 'user' | 'assistant',
          content: m.content,
        })),
        { role: 'user', content: transcript },
      ],
      abortSignal: context.signal,
    });

    return result.textStream;
  }
}

The onTurn method is called whenever the user finishes speaking. It receives the transcript and returns a text stream — the SDK handles converting that text to speech and streaming the audio back. Make sure to append the current transcript to context.messages when building the message list for the LLM.

PCM16 TTS: Why I needed a custom class

This was the first gotcha I hit. The built-in WorkersAITTS class sends { text, speaker } to the Deepgram model (it defaults to aura-1), which outputs MP3 by default. The ESP32 doesn’t have an MP3 decoder (or at least what the coding agents told me), so I needed raw PCM16 audio instead.

The fix: a custom TTS class that calls the aura-2-en model directly and passes encoding: "linear16", sample_rate: 24000, and container: "none":

class PCM16TTS {
  #ai: Ai;
  constructor(ai: Ai) {
    this.#ai = ai;
  }
  async synthesize(text: string, signal?: AbortSignal): Promise<ArrayBuffer | null> {
    const resp = await this.#ai.run(
      '@cf/deepgram/aura-2-en' as any,
      {
        text,
        speaker: 'luna',
        encoding: 'linear16',
        sample_rate: 24000,
        container: 'none',
      } as any,
      { returnRawResponse: true, ...(signal ? { signal } : {}) }
    );
    return await (resp as Response).arrayBuffer();
  }
}

Chunking audio for the ESP32

TTS generates audio per-sentence. A short sentence like “Hi! How can I help you?” produces ~90KB of PCM16 data. The ESP32 WebSocket library has a maximum frame size (WEBSOCKETS_MAX_DATA_SIZE), and the device has limited heap (~170KB free). Sending a single 90KB frame works but leaves little headroom.

The afterSynthesize hook lets me chunk audio into smaller frames before sending:

const AUDIO_CHUNK_SIZE = 4096;

afterSynthesize(audio: ArrayBuffer | null, _text: string, connection: Connection) {
  if (!audio) return null;
  const src = new Uint8Array(audio);
  for (let offset = 0; offset < src.byteLength; offset += AUDIO_CHUNK_SIZE) {
    const end = Math.min(offset + AUDIO_CHUNK_SIZE, src.byteLength);
    connection.send(src.slice(offset, end));
  }
  return null; // returning null tells the SDK we handled sending ourselves
}

The WebSocket protocol

The withVoice SDK defines a WebSocket protocol between the client and the server. Here’s the full message flow:

Client → Server

Message	When
`{"type":"hello","protocol_version":1}`	On connect
`{"type":"start_call","preferred_format":"pcm16"}`	User taps to start
Binary PCM16 frames (16kHz, 16-bit, mono)	Continuously while in call
`{"type":"end_call"}`	User taps to end

Server → Client

Message	Description
`welcome`	Connection acknowledged
`status`	State changes: `idle`, `listening`, `thinking`, `speaking`
`transcript`	Final transcript with `role: "user"` or `"assistant"`
`transcript_interim`	Partial STT result while user is speaking
`transcript_start/delta/end`	Streaming LLM response tokens
`audio_config`	Audio format info (format, sampleRate)
`metrics`	Timing info (llm_ms, tts_ms, first_audio_ms)
Binary frames	PCM16 audio during `speaking` status

The ESP32 client

The M5Stack Core2 has a built-in microphone, speaker, display, and touch screen. The firmware does the following:

Connects to WiFi, then opens a WebSocket to the Worker
Sends hello and waits for welcome
On touch: sends start_call, receives the agent’s greeting, then begins streaming mic audio as binary PCM16 frames
Receives status updates, transcripts, and audio — plays audio through the speaker using triple-buffered playRaw()
On touch again: sends end_call

This part took the most debugging. The ESP32 is a constrained device, and the M5Stack Core2 has some quirks that weren’t obvious from the documentation.

Gotcha: mic reinit after speaker playback

The M5Stack Core2 has separate I2S buses for the microphone and speaker, but Speaker.playRaw() disrupts the mic’s I2S state. After playback stops, the mic produces silence. This one took me a while to figure out — I kept thinking my WebSocket connection was dropping, but the mic was just… silent.

The fix: fully tear down and reinitialize the mic after each playback session:

void stop_playback() {
    M5.Speaker.stop();
    M5.Speaker.end();
    is_playing = false;

    // Restart mic — Speaker.playRaw disrupts the mic I2S bus
    M5.Mic.end();
    delay(10);
    auto mic_cfg = M5.Mic.config();
    mic_cfg.sample_rate = 16000;
    mic_cfg.magnification = 16;
    M5.Mic.config(mic_cfg);
    M5.Mic.begin();
}

Gotcha: WebSocket Host header

The ESP32 WebSocket library sends Host: hostname:443 in the header, but routeAgentRequest (which uses partyserver internally) expects just Host: hostname. The extra :443 causes routing to fail silently — no error, no log, just a connection that never reaches the Durable Object.

Important: You need to patch the WebSocketsClient library to omit the port when it’s 443 or 80.

Gotcha: Durable Object path routing

routeAgentRequest converts Durable Object binding names to kebab-case for URL routing. The binding VoiceAgent maps to path /agents/voice-agent/default, not /agents/VoiceAgent/default. The coding agents spent an embarrassing amount of time on this one.

Greeting on call start

One nice touch: the agent can speak immediately when a call begins by implementing onCallStart:

async onCallStart(connection: Connection) {
  await this.speak(connection, 'Hi! How can I help you?');
}

That means start_call can produce server audio before the user says anything. It makes the experience feel much more natural.

Adding smart home control: Elgato Air Light via Mesh

Now for the interesting part. I wanted to say “turn on my light” and have the Worker control the Elgato Air light sitting on my local network.

The challenge

The Elgato Air Light exposes a REST API on the local network (http://<ip>:9123/elgato/lights). But the Worker runs on Cloudflare’s edge — it can’t reach 192.168.x.x directly.

The solution: Cloudflare Mesh + VPC Networks

Why Cloudflare Mesh and not Cloudflare Tunnel?

If you’ve used Cloudflare before, you might be wondering: why not just use Cloudflare Tunnel? Both connect your private network to Cloudflare, but they solve different problems.

Cloudflare Tunnel (cloudflared) is designed for publishing specific services to the internet. You configure a public hostname (like light.example.com), and Tunnel proxies inbound traffic from the internet to your local service. It’s great for “I want my app reachable at this URL.” But each service needs its own tunnel route, and the Worker can’t initiate arbitrary requests to any local IP — it can only reach the services you’ve explicitly published.

Cloudflare Mesh (formerly WARP Connector) is designed for private network connectivity. A Mesh node advertises CIDR routes, making an entire subnet reachable. With a VPC Network binding, your Worker gets a MESH.fetch() that can reach any IP and port in the advertised range — no per-service configuration needed.

	Cloudflare Tunnel	Cloudflare Mesh
Traffic direction	Inbound to origin — clients connect to published services	Bidirectional — any participant can initiate
Addressing	By public hostname	By private IP (every participant gets a Mesh IP)
Worker access	Reach specific published services	Reach any IP/port in the advertised subnet
Connector	`cloudflared`	`warp-cli`
Protocols	HTTP/S, TCP, SSH, RDP, SMB	TCP, UDP, ICMP
Best for	Exposing apps to the internet	Private network connectivity, VPN replacement

For this project, the Worker needs to call the Elgato’s local REST API at 192.168.x.x:9123 — a private IP that shouldn’t be exposed publicly. Mesh gives the Worker outbound access to the entire local subnet with a single binding. If I add more smart devices later, they’re automatically reachable too — no new tunnel routes to configure.

This is the same approach I used in my previous article to expose OpenClaw to the internet, but this time using Mesh instead of Tunnels.

The flow:

Worker calls env.MESH.fetch("http://192.168.x.x:9123/elgato/lights")
    → Cloudflare routes to Mesh network
    → Mesh node on local LAN receives the request
    → Forwards to Elgato at 192.168.x.x:9123
    → Response flows back the same path

Setting up the Mesh node

Mesh nodes require Linux. I used a Raspberry Pi 4 sitting on the same local network as the Elgato light.

Step 1: Create a Mesh node in the Cloudflare dashboard

Go to Networking > Mesh and select Add a node. Name it (e.g. home-network) and copy the connector token.

Step 2: Install the WARP client on the Raspberry Pi

SSH into the Pi and run:

# Add Cloudflare's GPG key and repo
curl -fsSL https://pkg.cloudflareclient.com/pubkey.gpg \
  | sudo gpg --yes --dearmor -o /usr/share/keyrings/cloudflare-warp-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/cloudflare-warp-archive-keyring.gpg] https://pkg.cloudflareclient.com/ $(lsb_release -cs) main" \
  | sudo tee /etc/apt/sources.list.d/cloudflare-client.list

sudo apt-get update && sudo apt-get install -y cloudflare-warp

Step 3: Register as a Mesh connector and connect

sudo warp-cli connector new <YOUR_TOKEN>
sudo warp-cli connect

Verify:

sudo warp-cli status
# Should show: Status update: Connected

The node should appear as Online in the Mesh dashboard with a Mesh IP assigned.

Step 4: Add a CIDR route

In the Mesh dashboard, go to your node > Routes tab > Add route: 192.168.x.0/24. This tells Cloudflare that this Mesh node can forward traffic to devices on the local 192.168.x.x subnet — including the Elgato light.

Step 5: Configure NAT/MASQUERADE on the Mesh node

Important: By default, traffic from your Worker arrives at the Mesh node with a source IP in the 100.96.0.0/12 WARP range. When the Mesh node forwards this to a local device (like your Elgato), that device will try to reply to its default gateway (your router) instead of the Mesh node, causing connection timeouts.

You need to configure the Mesh node to rewrite the source IP before forwarding to local devices. I cover the exact nftables commands in the Gotchas section below. However, if your application is running on the same machine as the Mesh node, you don’t need to set this up.

The Elgato REST API

The Elgato Key Light / Air Light exposes a simple HTTP API on port 9123:

Endpoint	Method	Description
`/elgato/lights`	GET	Get current state
`/elgato/lights`	PUT	Set state
`/elgato/accessory-info`	GET	Device info

The state payload:

{
  "numberOfLights": 1,
  "lights": [{
    "on": 1,
    "brightness": 50,
    "temperature": 200
  }]
}

on: 1 = on, 0 = off
brightness: 0–100
temperature: 143–344 (mirek scale — 143 = ~7000K cool white, 344 = ~2900K warm white)

Adding tool calling to the voice agent

Now that I had a way to reach the Elgato from the Worker, I needed the LLM to call the right API based on what I say. The Vercel AI SDK supports tool calling — you define tools with descriptions and parameters, and the LLM decides when to call them based on user intent.

The kimi-k2.6 model on Workers AI supports multi-turn tool calling natively. When you pass tools to streamText, the SDK:

Sends tool definitions to the LLM
When the LLM returns a tool call, executes the execute function
Feeds the result back to the LLM
The LLM generates a natural language response

The textStream returned to onTurn only contains the final spoken text — all the tool calling happens transparently.

Wrangler config

{
  "compatibility_flags": ["nodejs_compat"],
  "compatibility_date": "2025-09-21",
  "migrations": [
    {
      "new_sqlite_classes": ["VoiceAgent"],
      "tag": "v1"
    }
  ],
  "durable_objects": {
    "bindings": [
      {
        "class_name": "VoiceAgent",
        "name": "VoiceAgent"
      }
    ]
  },
  "ai": {
    "binding": "AI"
  },
  "vpc_networks": [
    {
      "binding": "MESH",
      "network_id": "cf1:network",
      "remote": true
    }
  ],
  "vars": {
    "ELGATO_IP": "192.168.8.187"
  }
}

Tool definitions

import { tool } from 'ai';
import { z } from 'zod/v4';

const ELGATO_PORT = 9123;

function elgatoUrl(env: Env, path: string) {
  return `http://${env.ELGATO_IP}:${ELGATO_PORT}${path}`;
}

function elgatoTools(env: Env) {
  const fetchLight = async (init?: RequestInit) => {
    try {
      return await env.MESH.fetch(elgatoUrl(env, '/elgato/lights'), init);
    } catch {
      return new Response(JSON.stringify({ error: 'Light unreachable via Mesh' }), {
        status: 503,
        headers: { 'Content-Type': 'application/json' },
      });
    }
  };
  const putLight = (body: object) =>
    fetchLight({
      method: 'PUT',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(body),
    });
  const getLightState = async () => {
    const res = await fetchLight();
    return (await res.json()) as {
      numberOfLights: number;
      lights: Array<{ on: number; brightness: number; temperature: number }>;
    };
  };

  return {
    get_light_status: tool({
      description: 'Get the current status of the desk light',
      inputSchema: z.object({}),
      execute: async () => {
        const res = await fetchLight();
        return await res.json();
      },
    }),

    turn_light_on: tool({
      description: 'Turn the desk light on',
      inputSchema: z.object({}),
      execute: async () => {
        const state = await getLightState();
        const light = state.lights[0];
        light.on = 1;
        const res = await putLight({ numberOfLights: 1, lights: [light] });
        return { success: res.ok };
      },
    }),

    turn_light_off: tool({
      description: 'Turn the desk light off',
      inputSchema: z.object({}),
      execute: async () => {
        const state = await getLightState();
        const light = state.lights[0];
        light.on = 0;
        const res = await putLight({ numberOfLights: 1, lights: [light] });
        return { success: res.ok };
      },
    }),

    set_light_brightness: tool({
      description: 'Set the desk light brightness (0-100)',
      inputSchema: z.object({
        brightness: z.number().min(0).max(100),
      }),
      execute: async ({ brightness }) => {
        const state = await getLightState();
        const light = state.lights[0];
        light.brightness = brightness;
        const res = await putLight({
          numberOfLights: 1,
          lights: [light],
        });
        return { success: res.ok, brightness };
      },
    }),

    set_light_temperature: tool({
      description: 'Set the color temperature (143=cool to 344=warm)',
      inputSchema: z.object({
        temperature: z.number().min(143).max(344),
      }),
      execute: async ({ temperature }) => {
        const state = await getLightState();
        const light = state.lights[0];
        light.temperature = temperature;
        const res = await putLight({
          numberOfLights: 1,
          lights: [light],
        });
        return { success: res.ok, temperature };
      },
    }),
  };
}

Then in onTurn:

import { stepCountIs } from 'ai';

async onTurn(transcript: string, context: VoiceTurnContext) {
  const workersAi = createWorkersAI({ binding: this.env.AI });
  const messages = [
    ...context.messages.map(m => ({
      role: m.role as 'user' | 'assistant',
      content: m.content,
    })),
    { role: 'user', content: transcript },
  ];

  const result = streamText({
    model: workersAi('@cf/moonshotai/kimi-k2.6'),
    system: `You are a helpful voice assistant that can also control the desk light.
When asked about the light, use the available tools. Keep responses concise and natural.`,
    tools: elgatoTools(this.env),
    messages,
    abortSignal: context.signal,
    stopWhen: stepCountIs(5),
  });

  return result.textStream;
}

That’s it. The LLM handles intent recognition. When I say “make it brighter,” the model calls set_light_brightness. When I say “what’s the weather,” it just responds normally. No keyword parsing, no intent classification system — the LLM figures it out.

Note: The stopWhen: stepCountIs(5) option gives the model enough room for the tool-call → tool-result → final-answer loop, while preventing an accidental unbounded tool loop. In my Worker, I also log tool-call start/finish and step summaries so Mesh or schema failures are visible in Worker logs.

What’s running where

Component	Where	What it does
ESP32 firmware	M5Stack Core2 on my desk	Mic input, speaker output, touch UI, WebSocket client
VoiceAgent	Cloudflare Worker (Durable Object)	STT, LLM, TTS, tool execution, conversation history
Workers AI	Cloudflare edge	Flux STT, kimi-k2.6 LLM, Deepgram aura-2-en TTS
Mesh node	Raspberry Pi 4 on local LAN	WARP connector bridging Cloudflare to local network
Elgato Air Light	Local network (192.168.8.187:9123)	HTTP API for light control

Gotchas and lessons learned

I hit a lot of issues building this. Here’s a summary of everything I ran into, including some I already mentioned above.

The built-in WorkersAITTS defaults to MP3. If your client can’t decode MP3, you need a custom TTS class that explicitly requests encoding: "linear16". I covered this earlier in the PCM16 TTS section.
routeAgentRequest uses kebab-case paths. The Durable Object binding VoiceAgent maps to URL path /agents/voice-agent/default, not /agents/VoiceAgent/default.
ESP32 mic needs reinit after speaker playback. On the M5Stack Core2, Speaker.playRaw() disrupts the mic I2S bus. You must call Speaker.end(), Mic.end(), then Mic.begin() to restore it.
WebSocket Host header matters. The ESP32 WebSocket library sends Host: hostname:443, which breaks routeAgentRequest routing. Patch the library to omit standard ports.
afterSynthesize returning null is valid. You can use it to chunk large TTS audio into smaller WebSocket frames — just send them yourself via connection.send() and return null so the SDK doesn’t double-send.
Tool calling needs a bounded multi-step loop. Define tools with execute functions, pass them to streamText, and use stopWhen: stepCountIs(5) so the SDK can run the tool-call → execute → feed-result → generate-response loop. The textStream only yields the final spoken text.
Mesh routing requires NAT/MASQUERADE on the Mesh node. If your Worker gets HandshakeTimeoutError when calling a local device via env.MESH.fetch(), the issue is asymmetric routing. When a packet arrives from the Worker, its source IP is in the 100.96.0.0/12 WARP range. The local device replies to its default gateway (your router), not back to the Mesh node.

The official Cloudflare docs recommend solving this by either making the Mesh node the subnet’s default gateway, or adding a static route on your router that points 100.96.0.0/12 to the Mesh node. The coding agent went with a different approach: rewriting the source IP before forwarding to local devices using nftables:

On modern Linux systems using nftables (most newer Raspberry Pi OS versions), add this rule:

# Check which firewall tool is available
which nft  # If this returns a path, use nftables. If not, install iptables.

# Add MASQUERADE rule (replace eth0 with your LAN interface: eth0, wlan0, etc.)
sudo nft add table ip nat
sudo nft add chain ip nat postrouting { type nat hook postrouting priority 100 \; }
sudo nft add rule ip nat postrouting oifname "wlan0" iifname "CloudflareWARP" masquerade

To verify this is the issue before fixing it, SSH into your Mesh node and run:

# This will fail (simulates the Worker's packet path)
curl --interface 100.96.0.2 http://<ELGATO_IP>:9123/elgato/lights
# Error: Failed to connect / Handshake timeout

# This works (local origin traffic)
curl http://<ELGATO_IP>:9123/elgato/lights
# Returns: {"numberOfLights":1,...}

If the first curl fails but the second succeeds, you need the MASQUERADE rule. Make it persistent across reboots by saving the ruleset and loading it at boot:

echo 'table ip nat {
    chain postrouting {
        type nat hook postrouting priority 100;
        oifname "wlan0" iifname "CloudflareWARP" masquerade
    }
}' | sudo tee /etc/nftables-mesh-nat.nft

sudo nft -f /etc/nftables-mesh-nat.nft

# Persist across reboots (add to crontab)
(crontab -l 2>/dev/null; echo "@reboot sleep 10 && sudo nft -f /etc/nftables-mesh-nat.nft") | crontab -

Summary

I started with a simple frustration — reaching for my phone every time I wanted to adjust my desk light. What I ended up with is a voice assistant that runs on an ESP32, processes everything on Cloudflare’s edge, and controls local devices through Mesh networking.

The stack that made this possible:

@cloudflare/voice SDK — handles the hard parts of real-time voice (STT, TTS, conversation state, interruption)
Workers AI — LLM with tool calling for intent recognition
Cloudflare Mesh — bridges the gap between the edge and my local network
Vercel AI SDK — clean tool calling abstraction on top of Workers AI

The biggest surprises were the ESP32 quirks (mic reinit after speaker, WebSocket Host header) and the Mesh NAT issue. None of these were documented anywhere, so I hope this article saves you some debugging time.

What’s next

There’s a lot more I want to do with this setup:

Add more devices — I have other smart lights on my local network that I’d love to control by voice
Improve the ESP32 experience — a proper UI on the display showing conversation state and light status
Experiment with wake word detection instead of the touch-to-talk button
Try different LLM models as Workers AI adds more options. Kimi K2.6 is excellent, but a bit overkill for this. I might try with smaller models like Granite 4.0 or others from Workers AI.

If you are building something similar, or run into any issues, feel free to hit me up on X (Twitter) or LinkedIn. I’d love to hear about what you’re building with the Voice SDK and Mesh. I also co-authored a book - Building a Virtual Assistant with Raspberry Pi that will help you learn how to build an offline first virtual assistant!

Controlling Elgato Key Lights with Voice Agent 📖