sl-jetson a9b2242a2c feat(social): Orin dev environment — JetPack 6 + TRT conversion + systemd (#88)
- Dockerfile.social: social-bot container with faster-whisper, llama-cpp-python
  (CUDA), piper-tts, insightface, pyannote.audio, OpenWakeWord, pyaudio
- scripts/convert_models.sh: TRT FP16 conversion for SCRFD-10GF, ArcFace-R100,
  ECAPA-TDNN; CTranslate2 setup for Whisper; Piper voice download; benchmark suite
- config/asound.conf: ALSA USB mic (card1) + USB speaker (card2) config
- models/README.md: version-pinned model table, /models/ layout, perf targets
- systemd/: saltybot-social.service + saltybot.target + install_systemd.sh
- docker-compose.yml: saltybot-social service with GPU, audio device passthrough,
  NVMe volume mounts for /models and /social_db

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 08:08:57 -05:00

2.9 KiB
Raw Permalink Blame History

Social-bot Model Directory

Layout

/models/
├── onnx/                          # Source ONNX models (version-pinned)
│   ├── scrfd_10g_bnkps.onnx      # Face detection — InsightFace SCRFD-10GF
│   ├── arcface_r100.onnx         # Face recognition — ArcFace R100 (buffalo_l)
│   └── ecapa_tdnn.onnx           # Speaker embedding — ECAPA-TDNN (SpeechBrain export)
│
├── engines/                       # TensorRT FP16 compiled engines
│   ├── scrfd_10g_fp16.engine     # SCRFD → TRT FP16 (640×640)
│   ├── arcface_r100_fp16.engine  # ArcFace → TRT FP16 (112×112)
│   └── ecapa_tdnn_fp16.engine    # ECAPA-TDNN → TRT FP16 (variable len)
│
├── whisper-small-ct2/             # faster-whisper CTranslate2 format (auto-downloaded)
│   ├── model.bin
│   └── tokenizer.json
│
├── piper/                         # Piper TTS voice models
│   ├── en_US-lessac-medium.onnx
│   └── en_US-lessac-medium.onnx.json
│
├── gguf/                          # Quantized LLM (llama-cpp-python)
│   └── phi-3-mini-4k-instruct-q4_k_m.gguf  # ~2.2GB — Phi-3-mini Q4_K_M
│
└── speechbrain_ecapa/             # SpeechBrain pretrained checkpoint cache

Model Versions

Model Version Source Size
SCRFD-10GF InsightFace 0.7 GitHub releases 17MB
ArcFace R100 (w600k_r50) InsightFace buffalo_l Auto via insightface 166MB
ECAPA-TDNN SpeechBrain spkrec-ecapa-voxceleb HuggingFace 87MB
Whisper small faster-whisper 1.0+ CTranslate2 hub 488MB
Piper en_US-lessac-medium Rhasspy piper-voices HuggingFace 63MB
Phi-3-mini-4k Q4_K_M microsoft/Phi-3-mini-4k-instruct GGUF / HuggingFace 2.2GB

Setup

# From within the social container:
/scripts/convert_models.sh all          # download + convert all models
/scripts/convert_models.sh benchmark    # run latency benchmark suite
/scripts/convert_models.sh health       # check GPU memory

Performance Targets (Orin Nano Super, JetPack 6, FP16)

Model Input Target Typical
SCRFD-10GF 640×640 <15ms ~8ms
ArcFace R100 4×112×112 <5ms ~3ms
ECAPA-TDNN 1s audio <20ms ~12ms
Whisper small 1s audio <300ms ~180ms
Piper lessac-medium 10 words <200ms ~60ms
Phi-3-mini Q4_K_M prompt <500ms TTFT ~350ms

LLM Download

# Download Phi-3-mini GGUF manually (2.2GB):
wget -O /models/gguf/phi-3-mini-4k-instruct-q4_k_m.gguf \
  "https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"

# Or use llama-cpp-python's built-in download:
python3 -c "
from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id='microsoft/Phi-3-mini-4k-instruct-gguf',
    filename='Phi-3-mini-4k-instruct-q4.gguf',
    cache_dir='/models/gguf',
    n_gpu_layers=20
)
"