// Under the hood

Six microservices. One streaming pipeline.

Kenzy isn't a black box — it's a set of small services with clear jobs, wired together by a simple protocol. Run them all on one machine or spread them across the network. Disable a stage by omitting its URL.

FIG. 01 Request lifecycle

From wake word to spoken reply.

kenzy-node MICWake word + audio captureopenWakeWord on every frame; streams only after activation

↓ raw PCM over WebSocket · 16 kHz

kenzy-server PORT 8765Pipeline orchestratorbuffers audio per room; on session end, fans out

↓ on session end · run in parallel

kenzy-stt PORT 8767Speech-to-textfaster-whisper → transcript

kenzy-speaker PORT 8768Speaker IDSpeechBrain → who's talking

↓ transcript + speaker

kenzy-llm PORT 8766Fast path → LLM + skillsdeterministic match first; LiteLLM tool-calling as fallback

↓ reply text + voice style

kenzy-tts PORT 8769Text-to-speechOpenAI or local Kokoro → PCM

↓ raw PCM over WebSocket · 24 kHz

kenzy-node SPEAKERStreamed playback in the roompersistent output stream; barge-in to interrupt

FIG. 02 Service map

What each service does.

Service	Command	Port	Role & engine
node	kenzy-node	—	Wake word, audio capture & TTS playback · openWakeWord
server	kenzy-server	8765	WebSocket hub & pipeline orchestrator
stt	kenzy-stt	8767	Speech-to-text · faster-whisper
llm	kenzy-llm	8766	LLM, skills & fast path · LiteLLM
tts	kenzy-tts	8769	Text-to-speech · OpenAI / Kokoro
speaker	kenzy-speaker	8768	Speaker identification · SpeechBrain ECAPA-TDNN

// every downstream stage is optional — omit its URL in server.yaml to disable it.

FIG. 03 · PROTOCOL

Plain PCM over WebSocket

Control messages are JSON text frames. Audio is raw int16 PCM binary frames — no custom codecs, nothing to reverse-engineer.

16k

Capture Hz

80ms

Frame

1280

Samples

24k

Playback Hz

Control messages

helloaudio_startaudio_end wakewordtriggerstop tts_starttts_end

FIG. 04 · NODE STATE MACHINE

Three states, on the edge

Each node runs three concurrent async tasks. Wake-word inference runs in every state, so you can interrupt playback and start a new request mid-sentence.

IDLEListening for the wake word→ STREAMING on detection or server trigger

STREAMINGSending PCM to the serverends on VAD silence, timeout, or hard cap

TTSPlaying the replywake word interrupts → back to STREAMING

FIG. 05 · LATENCY

A fast path around the model

The server hands the transcript to kenzy-llm, which first tries a deterministic matcher. Common commands resolve and execute locally — no model round-trip — while everything else falls through to the full tool-calling loop.

It's how "turn on the lights" feels instant while "what should I cook with what's in the fridge?" still gets the full reasoning of a language model.

/processtranscript + speaker

FAST PATHmatch → actmilliseconds

LLM FALLBACKtool-calling loopwhen unmatched

// Read the details

Every config key, documented.

The docs cover each service's settings, the skill API, deployment, and speaker enrollment in depth.

Read the Docs ↗ Get Started ↗