- Dockerfile.social: social-bot container with faster-whisper, llama-cpp-python (CUDA), piper-tts, insightface, pyannote.audio, OpenWakeWord, pyaudio - scripts/convert_models.sh: TRT FP16 conversion for SCRFD-10GF, ArcFace-R100, ECAPA-TDNN; CTranslate2 setup for Whisper; Piper voice download; benchmark suite - config/asound.conf: ALSA USB mic (card1) + USB speaker (card2) config - models/README.md: version-pinned model table, /models/ layout, perf targets - systemd/: saltybot-social.service + saltybot.target + install_systemd.sh - docker-compose.yml: saltybot-social service with GPU, audio device passthrough, NVMe volume mounts for /models and /social_db Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
80 lines
2.9 KiB
Markdown
80 lines
2.9 KiB
Markdown
# Social-bot Model Directory
|
||
|
||
## Layout
|
||
|
||
```
|
||
/models/
|
||
├── onnx/ # Source ONNX models (version-pinned)
|
||
│ ├── scrfd_10g_bnkps.onnx # Face detection — InsightFace SCRFD-10GF
|
||
│ ├── arcface_r100.onnx # Face recognition — ArcFace R100 (buffalo_l)
|
||
│ └── ecapa_tdnn.onnx # Speaker embedding — ECAPA-TDNN (SpeechBrain export)
|
||
│
|
||
├── engines/ # TensorRT FP16 compiled engines
|
||
│ ├── scrfd_10g_fp16.engine # SCRFD → TRT FP16 (640×640)
|
||
│ ├── arcface_r100_fp16.engine # ArcFace → TRT FP16 (112×112)
|
||
│ └── ecapa_tdnn_fp16.engine # ECAPA-TDNN → TRT FP16 (variable len)
|
||
│
|
||
├── whisper-small-ct2/ # faster-whisper CTranslate2 format (auto-downloaded)
|
||
│ ├── model.bin
|
||
│ └── tokenizer.json
|
||
│
|
||
├── piper/ # Piper TTS voice models
|
||
│ ├── en_US-lessac-medium.onnx
|
||
│ └── en_US-lessac-medium.onnx.json
|
||
│
|
||
├── gguf/ # Quantized LLM (llama-cpp-python)
|
||
│ └── phi-3-mini-4k-instruct-q4_k_m.gguf # ~2.2GB — Phi-3-mini Q4_K_M
|
||
│
|
||
└── speechbrain_ecapa/ # SpeechBrain pretrained checkpoint cache
|
||
```
|
||
|
||
## Model Versions
|
||
|
||
| Model | Version | Source | Size |
|
||
|---|---|---|---|
|
||
| SCRFD-10GF | InsightFace 0.7 | GitHub releases | 17MB |
|
||
| ArcFace R100 (w600k_r50) | InsightFace buffalo_l | Auto via insightface | 166MB |
|
||
| ECAPA-TDNN | SpeechBrain spkrec-ecapa-voxceleb | HuggingFace | 87MB |
|
||
| Whisper small | faster-whisper 1.0+ | CTranslate2 hub | 488MB |
|
||
| Piper en_US-lessac-medium | Rhasspy piper-voices | HuggingFace | 63MB |
|
||
| Phi-3-mini-4k Q4_K_M | microsoft/Phi-3-mini-4k-instruct | GGUF / HuggingFace | 2.2GB |
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
# From within the social container:
|
||
/scripts/convert_models.sh all # download + convert all models
|
||
/scripts/convert_models.sh benchmark # run latency benchmark suite
|
||
/scripts/convert_models.sh health # check GPU memory
|
||
```
|
||
|
||
## Performance Targets (Orin Nano Super, JetPack 6, FP16)
|
||
|
||
| Model | Input | Target | Typical |
|
||
|---|---|---|---|
|
||
| SCRFD-10GF | 640×640 | <15ms | ~8ms |
|
||
| ArcFace R100 | 4×112×112 | <5ms | ~3ms |
|
||
| ECAPA-TDNN | 1s audio | <20ms | ~12ms |
|
||
| Whisper small | 1s audio | <300ms | ~180ms |
|
||
| Piper lessac-medium | 10 words | <200ms | ~60ms |
|
||
| Phi-3-mini Q4_K_M | prompt | <500ms TTFT | ~350ms |
|
||
|
||
## LLM Download
|
||
|
||
```bash
|
||
# Download Phi-3-mini GGUF manually (2.2GB):
|
||
wget -O /models/gguf/phi-3-mini-4k-instruct-q4_k_m.gguf \
|
||
"https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-q4.gguf"
|
||
|
||
# Or use llama-cpp-python's built-in download:
|
||
python3 -c "
|
||
from llama_cpp import Llama
|
||
llm = Llama.from_pretrained(
|
||
repo_id='microsoft/Phi-3-mini-4k-instruct-gguf',
|
||
filename='Phi-3-mini-4k-instruct-q4.gguf',
|
||
cache_dir='/models/gguf',
|
||
n_gpu_layers=20
|
||
)
|
||
"
|
||
```
|