# saltybot_audio_direction

Audio direction estimator for sound source localization (Issue #430).

Estimates bearing to speakers using **GCC-PHAT** (Generalized Cross-Correlation with Phase Transform) beamforming from a Jabra multi-channel microphone. Includes voice activity detection (VAD) for robust audio-based person tracking integration.

## Features

- **GCC-PHAT Beamforming**: Phase-domain cross-correlation for direction of arrival estimation
- **Voice Activity Detection (VAD)**: RMS energy-based speech detection with smoothing
- **Stereo/Quadrophonic Support**: Handles Jabra 2-channel and 4-channel modes
- **Robot Self-Noise Filtering**: Optional suppression of motor/wheel noise (future enhancement)
- **ROS2 Integration**: Standard ROS2 topic publishing at configurable rates

## Topics

### Published
- **`/saltybot/audio_direction`** (`std_msgs/Float32`)
  Estimated bearing in degrees (0–360, where 0° = front, 90° = right, 180° = rear, 270° = left)

- **`/saltybot/audio_activity`** (`std_msgs/Bool`)
  Voice activity detected (true if speech-like energy)

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `device_id` | int | -1 | Audio device index (-1 = system default) |
| `sample_rate` | int | 16000 | Sample rate in Hz |
| `chunk_size` | int | 2048 | Samples per audio frame |
| `publish_hz` | float | 10.0 | Output publication rate (Hz) |
| `vad_threshold` | float | 0.02 | RMS energy threshold for VAD |
| `gcc_phat_max_lag` | int | 64 | Max lag for correlation (determines angle resolution) |
| `self_noise_filter` | bool | true | Apply robot motor noise suppression |

## Usage

### Launch Node
```bash
ros2 launch saltybot_audio_direction audio_direction.launch.py
```

### With Parameters
```bash
ros2 launch saltybot_audio_direction audio_direction.launch.py \
  device_id:=0 \
  publish_hz:=20.0 \
  vad_threshold:=0.01
```

### Using Config File
```bash
ros2 launch saltybot_audio_direction audio_direction.launch.py \
  --ros-args --params-file config/audio_direction_params.yaml
```

## Algorithm

### GCC-PHAT

1. Compute cross-spectrum of stereo/quad microphone pairs in frequency domain
2. Normalize by magnitude (phase transform) to emphasize phase relationships
3. Inverse FFT to time-domain cross-correlation
4. Find maximum correlation lag → time delay between channels
5. Map time delay to azimuth angle based on mic geometry

**Resolution**: With 64-sample max lag at 16 kHz, ~4 ms correlation window → ~±4-sample time delay precision.

### VAD (Voice Activity Detection)

- Compute RMS energy of each frame
- Compare against threshold (default 0.02)
- Smooth over 5-frame window to reduce spurious detections

## Dependencies

- `rclpy`
- `numpy`
- `scipy`
- `python3-sounddevice` (audio input)

## Build & Test

### Build Package
```bash
colcon build --packages-select saltybot_audio_direction
```

### Run Tests
```bash
pytest jetson/ros2_ws/src/saltybot_audio_direction/test/
```

## Integration with Multi-Person Tracker

The audio direction node publishes bearing to speakers, enabling the `saltybot_multi_person_tracker` to:
- Cross-validate visual detections with audio localization
- Prioritize targets based on audio activity (speaker attention model)
- Improve person tracking in low-light or occluded scenarios

### Future Enhancements

- **Self-noise filtering**: Spectral subtraction for motor/wheel noise
- **TDOA (Time Difference of Arrival)**: Use quad-mic setup for improved angle precision
- **Elevation estimation**: With 4+ channels in 3D array configuration
- **Multi-speaker tracking**: Simultaneous localization of multiple speakers
- **Adaptive beamforming**: MVDR or GSC methods for SNR improvement

## References

- Benesty, J., Sondhi, M., Huang, Y. (2008). "Handbook of Speech Processing"
- Knapp, C., Carter, G. (1976). "The Generalized Correlation Method for Estimation of Time Delay"

## License

MIT