// Under the hood

Six microservices. One streaming pipeline.

Kenzy isn't a black box — it's a set of small services with clear jobs, wired together by a simple protocol. Run them all on one machine or spread them across the network. Disable a stage by omitting its URL.

FIG. 01 Request lifecycle

From wake word to spoken reply.

kenzy-node MICWake word + audio captureopenWakeWord on every frame; streams only after activation
↓  raw PCM over WebSocket  ·  16 kHz
kenzy-server PORT 8765Pipeline orchestratorbuffers audio per room; on session end, fans out
↓  on session end  ·  run in parallel
kenzy-stt PORT 8767Speech-to-textfaster-whisper → transcript
kenzy-speaker PORT 8768Speaker IDSpeechBrain → who's talking
↓  transcript + speaker
kenzy-llm PORT 8766Fast path → LLM + skillsdeterministic match first; LiteLLM tool-calling as fallback
↓  reply text + voice style
kenzy-tts PORT 8769Text-to-speechOpenAI or local Kokoro → PCM
↓  raw PCM over WebSocket  ·  24 kHz
kenzy-node SPEAKERStreamed playback in the roompersistent output stream; barge-in to interrupt

FIG. 02 Service map

What each service does.

ServiceCommandPortRole & engine
nodekenzy-nodeWake word, audio capture & TTS playback · openWakeWord
serverkenzy-server8765WebSocket hub & pipeline orchestrator
sttkenzy-stt8767Speech-to-text · faster-whisper
llmkenzy-llm8766LLM, skills & fast path · LiteLLM
ttskenzy-tts8769Text-to-speech · OpenAI / Kokoro
speakerkenzy-speaker8768Speaker identification · SpeechBrain ECAPA-TDNN

// every downstream stage is optional — omit its URL in server.yaml to disable it.

FIG. 03 · PROTOCOL

Plain PCM over WebSocket

Control messages are JSON text frames. Audio is raw int16 PCM binary frames — no custom codecs, nothing to reverse-engineer.

16k
Capture Hz
80ms
Frame
1280
Samples
24k
Playback Hz

Control messages

helloaudio_startaudio_end wakewordtriggerstop tts_starttts_end

FIG. 04 · NODE STATE MACHINE

Three states, on the edge

Each node runs three concurrent async tasks. Wake-word inference runs in every state, so you can interrupt playback and start a new request mid-sentence.

IDLEListening for the wake word→ STREAMING on detection or server trigger
STREAMINGSending PCM to the serverends on VAD silence, timeout, or hard cap
TTSPlaying the replywake word interrupts → back to STREAMING

FIG. 05 · LATENCY

A fast path around the model

The server hands the transcript to kenzy-llm, which first tries a deterministic matcher. Common commands resolve and execute locally — no model round-trip — while everything else falls through to the full tool-calling loop.

It's how "turn on the lights" feels instant while "what should I cook with what's in the fridge?" still gets the full reasoning of a language model.

/processtranscript + speaker
FAST PATHmatch → actmilliseconds
LLM FALLBACKtool-calling loopwhen unmatched

// Read the details

Every config key, documented.

The docs cover each service's settings, the skill API, deployment, and speaker enrollment in depth.