sl-jetson a9b2242a2c feat(social): Orin dev environment — JetPack 6 + TRT conversion + systemd (#88)
- Dockerfile.social: social-bot container with faster-whisper, llama-cpp-python
  (CUDA), piper-tts, insightface, pyannote.audio, OpenWakeWord, pyaudio
- scripts/convert_models.sh: TRT FP16 conversion for SCRFD-10GF, ArcFace-R100,
  ECAPA-TDNN; CTranslate2 setup for Whisper; Piper voice download; benchmark suite
- config/asound.conf: ALSA USB mic (card1) + USB speaker (card2) config
- models/README.md: version-pinned model table, /models/ layout, perf targets
- systemd/: saltybot-social.service + saltybot.target + install_systemd.sh
- docker-compose.yml: saltybot-social service with GPU, audio device passthrough,
  NVMe volume mounts for /models and /social_db

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 08:08:57 -05:00

80 lines
2.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Social-bot Model Directory
## Layout
```
/models/
├── onnx/ # Source ONNX models (version-pinned)
│ ├── scrfd_10g_bnkps.onnx # Face detection — InsightFace SCRFD-10GF
│ ├── arcface_r100.onnx # Face recognition — ArcFace R100 (buffalo_l)
│ └── ecapa_tdnn.onnx # Speaker embedding — ECAPA-TDNN (SpeechBrain export)
├── engines/ # TensorRT FP16 compiled engines
│ ├── scrfd_10g_fp16.engine # SCRFD → TRT FP16 (640×640)
│ ├── arcface_r100_fp16.engine # ArcFace → TRT FP16 (112×112)
│ └── ecapa_tdnn_fp16.engine # ECAPA-TDNN → TRT FP16 (variable len)
├── whisper-small-ct2/ # faster-whisper CTranslate2 format (auto-downloaded)
│ ├── model.bin
│ └── tokenizer.json
├── piper/ # Piper TTS voice models
│ ├── en_US-lessac-medium.onnx
│ └── en_US-lessac-medium.onnx.json
├── gguf/ # Quantized LLM (llama-cpp-python)
│ └── phi-3-mini-4k-instruct-q4_k_m.gguf # ~2.2GB — Phi-3-mini Q4_K_M
└── speechbrain_ecapa/ # SpeechBrain pretrained checkpoint cache
```
## Model Versions
| Model | Version | Source | Size |
|---|---|---|---|
| SCRFD-10GF | InsightFace 0.7 | GitHub releases | 17MB |
| ArcFace R100 (w600k_r50) | InsightFace buffalo_l | Auto via insightface | 166MB |
| ECAPA-TDNN | SpeechBrain spkrec-ecapa-voxceleb | HuggingFace | 87MB |
| Whisper small | faster-whisper 1.0+ | CTranslate2 hub | 488MB |
| Piper en_US-lessac-medium | Rhasspy piper-voices | HuggingFace | 63MB |
| Phi-3-mini-4k Q4_K_M | microsoft/Phi-3-mini-4k-instruct | GGUF / HuggingFace | 2.2GB |
## Setup
```bash
# From within the social container:
/scripts/convert_models.sh all # download + convert all models
/scripts/convert_models.sh benchmark # run latency benchmark suite
/scripts/convert_models.sh health # check GPU memory
```
## Performance Targets (Orin Nano Super, JetPack 6, FP16)
| Model | Input | Target | Typical |
|---|---|---|---|
| SCRFD-10GF | 640×640 | <15ms | ~8ms |
| ArcFace R100 | 4×112×112 | <5ms | ~3ms |
| ECAPA-TDNN | 1s audio | <20ms | ~12ms |
| Whisper small | 1s audio | <300ms | ~180ms |
| Piper lessac-medium | 10 words | <200ms | ~60ms |
| Phi-3-mini Q4_K_M | prompt | <500ms TTFT | ~350ms |
## LLM Download
```bash
# Download Phi-3-mini GGUF manually (2.2GB):
wget -O /models/gguf/phi-3-mini-4k-instruct-q4_k_m.gguf \
"https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"
# Or use llama-cpp-python's built-in download:
python3 -c "
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id='microsoft/Phi-3-mini-4k-instruct-gguf',
filename='Phi-3-mini-4k-instruct-q4.gguf',
cache_dir='/models/gguf',
n_gpu_layers=20
)
"
```