# saltybot_audio_direction Audio direction estimator for sound source localization (Issue #430). Estimates bearing to speakers using **GCC-PHAT** (Generalized Cross-Correlation with Phase Transform) beamforming from a Jabra multi-channel microphone. Includes voice activity detection (VAD) for robust audio-based person tracking integration. ## Features - **GCC-PHAT Beamforming**: Phase-domain cross-correlation for direction of arrival estimation - **Voice Activity Detection (VAD)**: RMS energy-based speech detection with smoothing - **Stereo/Quadrophonic Support**: Handles Jabra 2-channel and 4-channel modes - **Robot Self-Noise Filtering**: Optional suppression of motor/wheel noise (future enhancement) - **ROS2 Integration**: Standard ROS2 topic publishing at configurable rates ## Topics ### Published - **`/saltybot/audio_direction`** (`std_msgs/Float32`) Estimated bearing in degrees (0–360, where 0° = front, 90° = right, 180° = rear, 270° = left) - **`/saltybot/audio_activity`** (`std_msgs/Bool`) Voice activity detected (true if speech-like energy) ## Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `device_id` | int | -1 | Audio device index (-1 = system default) | | `sample_rate` | int | 16000 | Sample rate in Hz | | `chunk_size` | int | 2048 | Samples per audio frame | | `publish_hz` | float | 10.0 | Output publication rate (Hz) | | `vad_threshold` | float | 0.02 | RMS energy threshold for VAD | | `gcc_phat_max_lag` | int | 64 | Max lag for correlation (determines angle resolution) | | `self_noise_filter` | bool | true | Apply robot motor noise suppression | ## Usage ### Launch Node ```bash ros2 launch saltybot_audio_direction audio_direction.launch.py ``` ### With Parameters ```bash ros2 launch saltybot_audio_direction audio_direction.launch.py \ device_id:=0 \ publish_hz:=20.0 \ vad_threshold:=0.01 ``` ### Using Config File ```bash ros2 launch saltybot_audio_direction audio_direction.launch.py \ --ros-args --params-file config/audio_direction_params.yaml ``` ## Algorithm ### GCC-PHAT 1. Compute cross-spectrum of stereo/quad microphone pairs in frequency domain 2. Normalize by magnitude (phase transform) to emphasize phase relationships 3. Inverse FFT to time-domain cross-correlation 4. Find maximum correlation lag → time delay between channels 5. Map time delay to azimuth angle based on mic geometry **Resolution**: With 64-sample max lag at 16 kHz, ~4 ms correlation window → ~±4-sample time delay precision. ### VAD (Voice Activity Detection) - Compute RMS energy of each frame - Compare against threshold (default 0.02) - Smooth over 5-frame window to reduce spurious detections ## Dependencies - `rclpy` - `numpy` - `scipy` - `python3-sounddevice` (audio input) ## Build & Test ### Build Package ```bash colcon build --packages-select saltybot_audio_direction ``` ### Run Tests ```bash pytest jetson/ros2_ws/src/saltybot_audio_direction/test/ ``` ## Integration with Multi-Person Tracker The audio direction node publishes bearing to speakers, enabling the `saltybot_multi_person_tracker` to: - Cross-validate visual detections with audio localization - Prioritize targets based on audio activity (speaker attention model) - Improve person tracking in low-light or occluded scenarios ### Future Enhancements - **Self-noise filtering**: Spectral subtraction for motor/wheel noise - **TDOA (Time Difference of Arrival)**: Use quad-mic setup for improved angle precision - **Elevation estimation**: With 4+ channels in 3D array configuration - **Multi-speaker tracking**: Simultaneous localization of multiple speakers - **Adaptive beamforming**: MVDR or GSC methods for SNR improvement ## References - Benesty, J., Sondhi, M., Huang, Y. (2008). "Handbook of Speech Processing" - Knapp, C., Carter, G. (1976). "The Generalized Correlation Method for Estimation of Time Delay" ## License MIT