FIG. 03 · PROTOCOL
Plain PCM over WebSocket
Control messages are JSON text frames. Audio is raw int16 PCM binary frames — no custom codecs, nothing to reverse-engineer.
Control messages
// Under the hood
Kenzy isn't a black box — it's a set of small services with clear jobs, wired together by a simple protocol. Run them all on one machine or spread them across the network. Disable a stage by omitting its URL.
FIG. 01 Request lifecycle
FIG. 02 Service map
| Service | Command | Port | Role & engine |
|---|---|---|---|
| node | kenzy-node | — | Wake word, audio capture & TTS playback · openWakeWord |
| server | kenzy-server | 8765 | WebSocket hub & pipeline orchestrator |
| stt | kenzy-stt | 8767 | Speech-to-text · faster-whisper |
| llm | kenzy-llm | 8766 | LLM, skills & fast path · LiteLLM |
| tts | kenzy-tts | 8769 | Text-to-speech · OpenAI / Kokoro |
| speaker | kenzy-speaker | 8768 | Speaker identification · SpeechBrain ECAPA-TDNN |
// every downstream stage is optional — omit its URL in server.yaml to disable it.
FIG. 04 · NODE STATE MACHINE
Each node runs three concurrent async tasks. Wake-word inference runs in every state, so you can interrupt playback and start a new request mid-sentence.
FIG. 05 · LATENCY
The server hands the transcript to kenzy-llm, which first tries a deterministic matcher. Common commands resolve and execute locally — no model round-trip — while everything else falls through to the full tool-calling loop.
It's how "turn on the lights" feels instant while "what should I cook with what's in the fridge?" still gets the full reasoning of a language model.
// Read the details
The docs cover each service's settings, the skill API, deployment, and speaker enrollment in depth.