2026-06-27

Talking to Claude Code through an Alexa

An afternoon experiment: wiring an Amazon Echo to drive Claude Code on my own Mac. The fun was not the voice, it was designing a planning conversation around an LLM that is too slow to answer inside Alexa's eight second window, and then thinking about what a real product would need.

This was an afternoon experiment, so read it in that spirit. The goal was to start real coding work by talking to a speaker: not "set a timer", but "work on the board script", and have an agent plan it with me and go and do it. No laptop open.

I gave myself a few hours, which means there are shortcuts everywhere. The backend is literally my laptop. The public entry point is a free tunnel. There is exactly one user, me, and no accounts. None of that would survive contact with real users, and I will come back to what a grown-up version looks like at the end. But the core turned out to be more interesting than I expected, because the main constraint forced a better design than the obvious one.

The wall everything runs into

Alexa cuts off a skill's response after about eight seconds. That is the whole design problem in one sentence.

A coding agent does not fit in eight seconds. Even just asking Claude to think of a few options for a task takes anywhere from four to twenty-plus seconds. So you cannot do the obvious thing, which is take what the person said, ask Claude about it, and speak the answer. Alexa will have hung up long before Claude replies.

Once you accept that nothing useful finishes in the window, the design falls out of it: acknowledge instantly, think in the background, and surface the thinking somewhere the eight seconds do not apply.

A planning conversation, not a command

Here is what actually happens when I use it.

me:    "Alexa, open assistant"
claude: "Claude ready. What should I work on?"
 
me:    "work on the board script"
claude: "Okay, work on the board script. I'm putting a few options on the
         dashboard. Tell me which one or describe what you want, then say go."
 
me:    "the second one"
claude: "Here's the plan: add a --since flag to the board script. Say go to
         start, or add more."
 
me:    "go"
claude: "Starting now: add a --since flag to the board script. Watch the
         dashboard, and ask me for the result."

The moment I describe a task, the server does two things at once. It replies immediately with a short acknowledgement, well inside the window, and it spawns a detached job that asks Claude to turn my vague sentence into two or three concrete, short options. Those options land on a dashboard, and the next thing I say is resolved against them. "The second one", "option two", and "three" all map to a choice; anything else is folded in as more detail.

So the latency that made the naive design impossible becomes invisible: by the time I have heard the acknowledgement and decided what I want, the options are already there. The prompt gets built up across a couple of turns, and "go" commits it.

(The skill is invoked as "assistant" rather than "claude", because Amazon will not let you claim a single well-known word as a custom invocation name. It just introduces itself as Claude.)

The jobs are real sessions

When I say "go", the task does not run as some opaque background process. It starts in its own tmux session, which means it shows up in the same sessions tool I use for every other Claude session on the machine, and I can attach and watch it work live:

sessions go <pid>   # drop into the running voice-started job

That mattered more than I expected. A voice command that spawns invisible work is unnerving. A voice command that spawns a session I can see, attach to, and read line by line is just a faster way to start something I already understand.

A dashboard that stays home

Voice is a poor channel for output, so the real surface is a small web dashboard the server renders. It polls every couple of seconds and shows the prompt being built right now with its options, the jobs with running, done, and failed badges, and a log of voice turns so I can see what it heard and what it said back.

The one decision I am quietly pleased with: the dashboard is refused on the public URL. The tunnel that lets Amazon reach my Mac would also let anyone who found it load my activity dashboard. But tunnelled requests carry an X-Forwarded-For header and direct localhost requests do not, so the server serves the dashboard only to local requests and returns a 403 to anything that arrived over the tunnel.

Hearing the answer

Long work reports back through whatever channel suits. The job streams to its log and its tmux pane the whole time. When I want the result out loud, I ask:

me:    "Alexa, ask assistant for the status"
claude: "Done, add a --since flag to the board script. <reads the tail of the
         output, kept short enough to be pleasant>"

If the machine has Telegram configured it also pushes the result to my phone, but that is optional sugar now rather than the main path. Short questions are the one thing answered inline: "ask assistant to tell me X" runs Claude read-only with a few seconds of grace, and speaks the answer if it finishes in time.

Keeping it safe enough

Even an afternoon toy that starts a coding agent from a public URL needs a lock, so every request has to prove itself. The server follows Amazon's verification for self-hosted endpoints: the certificate chain has to be a genuine Amazon address, the certificate has to be valid and issued for echo-api.amazon.com, the body has to verify against its signature, and the timestamp has to be recent so a captured request cannot be replayed. The skill id is a second lock on top. Questions run read-only; tasks run with a permission mode that edits files but still stops before risky commands.

What a real product would look like

Everything above is held together with the cheapest possible parts, and that is the right call for a few hours of play. But it is worth being honest about where the shortcuts are and what you would actually build instead.

The shortcuts: the backend is my Mac, so it only works while the machine is awake with the server and tunnel running. The entry point is a personal free tunnel, not a hosted endpoint. There is one user and no authentication, so "lock it to my skill id" is doing a lot of heavy lifting. And the agent runs directly on my real machine with access to my real files, which is fine for me and unthinkable for anyone else.

A product version inverts most of that:

The skill endpoint becomes a hosted, stateless function behind an API gateway, so there is no laptop to keep awake and no tunnel to babysit. Its only jobs are to verify the request, look up the user, and drop work on a queue, all comfortably inside the eight seconds. Account linking turns the single-user assumption into real multi-tenancy: each person links their own identity, and the skill id stops being the only thing standing between a stranger and a shell.

The actual agent work moves onto sandboxed workers, one isolated environment per user, so a task can never touch anyone else's files or the host. Job and conversation state lives in a small database rather than on one machine's disk, which is what lets a result reach you later, on a different device, after the voice session is long gone. And the results come back through Alexa's proactive notifications or a companion app, so "ask me for the status" becomes "your phone buzzes when it is done."

None of that is exotic. It is the boring, well-trodden shape of any async job system. The interesting part, the planning-conversation-around-latency, stays exactly the same. The afternoon version just runs it on a laptop instead of a fleet.

What it taught me

I went in thinking the hard part would be the voice. It was not; Amazon's speech to text is good and free. The hard part, and the fun part, was that the eight second wall made the naive design impossible and pushed me toward a better one: a system that never pretends to be fast, acknowledges instantly, thinks out of band, and gives you real attachable sessions and a live dashboard instead of a single hopeful reply.

It stopped being "an agent I sit in front of" and became "an agent I can start a conversation with from across the room, then go and watch it work." For an afternoon's wiring, that is a surprisingly nice way to begin a piece of work.

Built from scratch, like everything in this newsletter. Questions about the build? Say hello.

ShareX LinkedIn