sl-perception d7b1366d6c feat: Add audio direction estimator (Issue #430)
Implements GCC-PHAT beamforming for sound source localization via Jabra mic.
- GCC-PHAT cross-correlation for direction of arrival (DoA) estimation
- Voice activity detection (VAD) using RMS energy + smoothing
- Stereo/quadrophonic channel support (left/right/front/rear estimation)
- ROS2 publishers: /saltybot/audio_direction (Float32 bearing), /saltybot/audio_activity (Bool VAD)
- Configurable parameters: sample_rate, chunk_size, publish_hz, vad_threshold, gcc_phat_max_lag
- Integration-ready for multi-person tracker speaker tracking

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-05 08:53:43 -05:00
..

saltybot_audio_direction

Audio direction estimator for sound source localization (Issue #430).

Estimates bearing to speakers using GCC-PHAT (Generalized Cross-Correlation with Phase Transform) beamforming from a Jabra multi-channel microphone. Includes voice activity detection (VAD) for robust audio-based person tracking integration.

Features

  • GCC-PHAT Beamforming: Phase-domain cross-correlation for direction of arrival estimation
  • Voice Activity Detection (VAD): RMS energy-based speech detection with smoothing
  • Stereo/Quadrophonic Support: Handles Jabra 2-channel and 4-channel modes
  • Robot Self-Noise Filtering: Optional suppression of motor/wheel noise (future enhancement)
  • ROS2 Integration: Standard ROS2 topic publishing at configurable rates

Topics

Published

  • /saltybot/audio_direction (std_msgs/Float32) Estimated bearing in degrees (0360, where 0° = front, 90° = right, 180° = rear, 270° = left)

  • /saltybot/audio_activity (std_msgs/Bool) Voice activity detected (true if speech-like energy)

Parameters

Parameter Type Default Description
device_id int -1 Audio device index (-1 = system default)
sample_rate int 16000 Sample rate in Hz
chunk_size int 2048 Samples per audio frame
publish_hz float 10.0 Output publication rate (Hz)
vad_threshold float 0.02 RMS energy threshold for VAD
gcc_phat_max_lag int 64 Max lag for correlation (determines angle resolution)
self_noise_filter bool true Apply robot motor noise suppression

Usage

Launch Node

ros2 launch saltybot_audio_direction audio_direction.launch.py

With Parameters

ros2 launch saltybot_audio_direction audio_direction.launch.py \
  device_id:=0 \
  publish_hz:=20.0 \
  vad_threshold:=0.01

Using Config File

ros2 launch saltybot_audio_direction audio_direction.launch.py \
  --ros-args --params-file config/audio_direction_params.yaml

Algorithm

GCC-PHAT

  1. Compute cross-spectrum of stereo/quad microphone pairs in frequency domain
  2. Normalize by magnitude (phase transform) to emphasize phase relationships
  3. Inverse FFT to time-domain cross-correlation
  4. Find maximum correlation lag → time delay between channels
  5. Map time delay to azimuth angle based on mic geometry

Resolution: With 64-sample max lag at 16 kHz, ~4 ms correlation window → ~±4-sample time delay precision.

VAD (Voice Activity Detection)

  • Compute RMS energy of each frame
  • Compare against threshold (default 0.02)
  • Smooth over 5-frame window to reduce spurious detections

Dependencies

  • rclpy
  • numpy
  • scipy
  • python3-sounddevice (audio input)

Build & Test

Build Package

colcon build --packages-select saltybot_audio_direction

Run Tests

pytest jetson/ros2_ws/src/saltybot_audio_direction/test/

Integration with Multi-Person Tracker

The audio direction node publishes bearing to speakers, enabling the saltybot_multi_person_tracker to:

  • Cross-validate visual detections with audio localization
  • Prioritize targets based on audio activity (speaker attention model)
  • Improve person tracking in low-light or occluded scenarios

Future Enhancements

  • Self-noise filtering: Spectral subtraction for motor/wheel noise
  • TDOA (Time Difference of Arrival): Use quad-mic setup for improved angle precision
  • Elevation estimation: With 4+ channels in 3D array configuration
  • Multi-speaker tracking: Simultaneous localization of multiple speakers
  • Adaptive beamforming: MVDR or GSC methods for SNR improvement

References

  • Benesty, J., Sondhi, M., Huang, Y. (2008). "Handbook of Speech Processing"
  • Knapp, C., Carter, G. (1976). "The Generalized Correlation Method for Estimation of Time Delay"

License

MIT