feat(social): speech + LLM + TTS + orchestrator (#81 #83 #85 #89) #102

sl-jetson · 2026-03-02T08:19:12-05:00

sl-jetson commented

2026-03-02 08:19:12 -05:00

Summary

Issue #81 — Speech pipeline (speech_pipeline_node.py):

OpenWakeWord 'hey_salty' gate → Silero VAD → faster-whisper STT (Orin GPU FP16) → ECAPA-TDNN speaker diarization
Latency target: <500ms wake-to-first-token, streaming partial transcripts
speech_utils.py: EnergyVad, UtteranceSegmenter, cosine speaker ID — all pure Python (28 tests)
Publishes: /social/speech/transcript (SpeechTranscript) + /social/speech/vad_state (VadState)

Issue #83 — Conversation engine (conversation_node.py):

Phi-3-mini Q4_K_M GGUF via llama-cpp-python (CUDA, 20 GPU layers), streaming token output
Per-person sliding-window context (4K tokens), summary compression, SOUL.md system prompt, group mode
llm_context.py: PersonContext, ContextStore (JSON persistence), build_llama_prompt (ChatML)
Publishes: /social/conversation/response (ConversationResponse, partial + final)

Issue #85 — Streaming TTS (tts_node.py):

Piper ONNX streaming synthesis, sentence-by-sentence first-chunk streaming (<200ms to first audio)
sounddevice USB speaker playback, volume control, optional PCM ROS2 publish
tts_utils.py: split_sentences, pcm16_to_wav_bytes, chunk_pcm, apply_volume, strip_ssml (30 tests)

Issue #89 — Pipeline orchestrator (orchestrator_node.py):

State machine: IDLE→LISTENING→THINKING→SPEAKING→THROTTLED
GPU memory watchdog (throttle <2GB), rolling latency stats (p50/p95 per stage), VAD watchdog
social_bot.launch.py: all 4 nodes with TimerAction delays, IfCondition gates

New msgs: SpeechTranscript.msg, VadState.msg, ConversationResponse.msg
Config: speech_params, conversation_params, tts_params, orchestrator_params (all YAML)
Tests: 58 passing (28 speech_utils + 30 llm_context/tts_utils)

Closes #81
Closes #83
Closes #85
Closes #89

## Summary **Issue #81 — Speech pipeline** (speech_pipeline_node.py): - OpenWakeWord 'hey_salty' gate → Silero VAD → faster-whisper STT (Orin GPU FP16) → ECAPA-TDNN speaker diarization - Latency target: <500ms wake-to-first-token, streaming partial transcripts - speech_utils.py: EnergyVad, UtteranceSegmenter, cosine speaker ID — all pure Python (28 tests) - Publishes: /social/speech/transcript (SpeechTranscript) + /social/speech/vad_state (VadState) **Issue #83 — Conversation engine** (conversation_node.py): - Phi-3-mini Q4_K_M GGUF via llama-cpp-python (CUDA, 20 GPU layers), streaming token output - Per-person sliding-window context (4K tokens), summary compression, SOUL.md system prompt, group mode - llm_context.py: PersonContext, ContextStore (JSON persistence), build_llama_prompt (ChatML) - Publishes: /social/conversation/response (ConversationResponse, partial + final) **Issue #85 — Streaming TTS** (tts_node.py): - Piper ONNX streaming synthesis, sentence-by-sentence first-chunk streaming (<200ms to first audio) - sounddevice USB speaker playback, volume control, optional PCM ROS2 publish - tts_utils.py: split_sentences, pcm16_to_wav_bytes, chunk_pcm, apply_volume, strip_ssml (30 tests) **Issue #89 — Pipeline orchestrator** (orchestrator_node.py): - State machine: IDLE→LISTENING→THINKING→SPEAKING→THROTTLED - GPU memory watchdog (throttle <2GB), rolling latency stats (p50/p95 per stage), VAD watchdog - social_bot.launch.py: all 4 nodes with TimerAction delays, IfCondition gates New msgs: SpeechTranscript.msg, VadState.msg, ConversationResponse.msg Config: speech_params, conversation_params, tts_params, orchestrator_params (all YAML) Tests: 58 passing (28 speech_utils + 30 llm_context/tts_utils) Closes #81 Closes #83 Closes #85 Closes #89

sl-jetson added 1 commit 2026-03-02 08:19:13 -05:00

feat(social): speech pipeline + LLM conversation + TTS + orchestrator (#81 #83 #85 #89 ) 71d6ce610b

Issue #81 — Speech pipeline:
- speech_pipeline_node.py: OpenWakeWord "hey_salty" → Silero VAD → faster-whisper
  STT (Orin GPU, <500ms wake-to-transcript) → ECAPA-TDNN speaker diarization
- speech_utils.py: pcm16↔float32, EnergyVad, UtteranceSegmenter (pre-roll, max-
  duration), cosine speaker identification — all pure Python, no ROS2/GPU needed
- Publishes /social/speech/transcript (SpeechTranscript) + /social/speech/vad_state

Issue #83 — Conversation engine:
- conversation_node.py: llama-cpp-python GGUF (Phi-3-mini Q4_K_M, 20 GPU layers),
  streaming token output, per-person sliding-window context (4K tokens), summary
  compression, SOUL.md system prompt, group mode
- llm_context.py: PersonContext, ContextStore (JSON persistence), build_llama_prompt
  (ChatML format), context compression via LLM summarization
- Publishes /social/conversation/response (ConversationResponse, partial + final)

Issue #85 — Streaming TTS:
- tts_node.py: Piper ONNX streaming synthesis, sentence-by-sentence first-chunk
  streaming (<200ms to first audio), sounddevice USB speaker playback, volume control
- tts_utils.py: split_sentences, pcm16_to_wav_bytes, chunk_pcm, apply_volume, strip_ssml

Issue #89 — Pipeline orchestrator:
- orchestrator_node.py: IDLE→LISTENING→THINKING→SPEAKING state machine, GPU memory
  watchdog (throttle at <2GB free), rolling latency stats (p50/p95 per stage),
  VAD watchdog (alert if speech pipeline hangs), /social/orchestrator/state JSON pub
- social_bot.launch.py: brings up all 4 nodes with TimerAction delays

New messages: SpeechTranscript.msg, VadState.msg, ConversationResponse.msg
Config YAMLs: speech_params, conversation_params, tts_params, orchestrator_params
Tests: 58 tests (28 speech_utils + 30 llm_context/tts_utils), all passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sl-webui force-pushed sl-jetson/social-speech-llm-tts from 71d6ce610b to 5043578934

2026-03-02 08:23:25 -05:00

Compare

sl-jetson merged commit 0f2ea7931b into main

2026-03-02 08:24:25 -05:00

sl-jetson referenced this issue from a commit

2026-03-02 08:24:26 -05:00

Merge pull request 'feat(social): speech + LLM + TTS + orchestrator (#81 #83 #85 #89)' (#102) from sl-jetson/social-speech-llm-tts into main

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: seb/saltylab-firmware#102