Implements GCC-PHAT beamforming for sound source localization via Jabra mic. - GCC-PHAT cross-correlation for direction of arrival (DoA) estimation - Voice activity detection (VAD) using RMS energy + smoothing - Stereo/quadrophonic channel support (left/right/front/rear estimation) - ROS2 publishers: /saltybot/audio_direction (Float32 bearing), /saltybot/audio_activity (Bool VAD) - Configurable parameters: sample_rate, chunk_size, publish_hz, vad_threshold, gcc_phat_max_lag - Integration-ready for multi-person tracker speaker tracking Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
saltybot_audio_direction
Audio direction estimator for sound source localization (Issue #430).
Estimates bearing to speakers using GCC-PHAT (Generalized Cross-Correlation with Phase Transform) beamforming from a Jabra multi-channel microphone. Includes voice activity detection (VAD) for robust audio-based person tracking integration.
Features
- GCC-PHAT Beamforming: Phase-domain cross-correlation for direction of arrival estimation
- Voice Activity Detection (VAD): RMS energy-based speech detection with smoothing
- Stereo/Quadrophonic Support: Handles Jabra 2-channel and 4-channel modes
- Robot Self-Noise Filtering: Optional suppression of motor/wheel noise (future enhancement)
- ROS2 Integration: Standard ROS2 topic publishing at configurable rates
Topics
Published
-
/saltybot/audio_direction(std_msgs/Float32) Estimated bearing in degrees (0–360, where 0° = front, 90° = right, 180° = rear, 270° = left) -
/saltybot/audio_activity(std_msgs/Bool) Voice activity detected (true if speech-like energy)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
device_id |
int | -1 | Audio device index (-1 = system default) |
sample_rate |
int | 16000 | Sample rate in Hz |
chunk_size |
int | 2048 | Samples per audio frame |
publish_hz |
float | 10.0 | Output publication rate (Hz) |
vad_threshold |
float | 0.02 | RMS energy threshold for VAD |
gcc_phat_max_lag |
int | 64 | Max lag for correlation (determines angle resolution) |
self_noise_filter |
bool | true | Apply robot motor noise suppression |
Usage
Launch Node
ros2 launch saltybot_audio_direction audio_direction.launch.py
With Parameters
ros2 launch saltybot_audio_direction audio_direction.launch.py \
device_id:=0 \
publish_hz:=20.0 \
vad_threshold:=0.01
Using Config File
ros2 launch saltybot_audio_direction audio_direction.launch.py \
--ros-args --params-file config/audio_direction_params.yaml
Algorithm
GCC-PHAT
- Compute cross-spectrum of stereo/quad microphone pairs in frequency domain
- Normalize by magnitude (phase transform) to emphasize phase relationships
- Inverse FFT to time-domain cross-correlation
- Find maximum correlation lag → time delay between channels
- Map time delay to azimuth angle based on mic geometry
Resolution: With 64-sample max lag at 16 kHz, ~4 ms correlation window → ~±4-sample time delay precision.
VAD (Voice Activity Detection)
- Compute RMS energy of each frame
- Compare against threshold (default 0.02)
- Smooth over 5-frame window to reduce spurious detections
Dependencies
rclpynumpyscipypython3-sounddevice(audio input)
Build & Test
Build Package
colcon build --packages-select saltybot_audio_direction
Run Tests
pytest jetson/ros2_ws/src/saltybot_audio_direction/test/
Integration with Multi-Person Tracker
The audio direction node publishes bearing to speakers, enabling the saltybot_multi_person_tracker to:
- Cross-validate visual detections with audio localization
- Prioritize targets based on audio activity (speaker attention model)
- Improve person tracking in low-light or occluded scenarios
Future Enhancements
- Self-noise filtering: Spectral subtraction for motor/wheel noise
- TDOA (Time Difference of Arrival): Use quad-mic setup for improved angle precision
- Elevation estimation: With 4+ channels in 3D array configuration
- Multi-speaker tracking: Simultaneous localization of multiple speakers
- Adaptive beamforming: MVDR or GSC methods for SNR improvement
References
- Benesty, J., Sondhi, M., Huang, Y. (2008). "Handbook of Speech Processing"
- Knapp, C., Carter, G. (1976). "The Generalized Correlation Method for Estimation of Time Delay"
License
MIT