2026-03-05 08:59:09 -05:00
11 changed files with 650 additions and 0 deletions
--- a/jetson/ros2_ws/src/saltybot_audio_direction/README.md
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/README.md
@ -0,0 +1,116 @@
 # saltybot_audio_direction
 Audio direction estimator for sound source localization (Issue #430).
 Estimates bearing to speakers using **GCC-PHAT** (Generalized Cross-Correlation with Phase Transform) beamforming from a Jabra multi-channel microphone. Includes voice activity detection (VAD) for robust audio-based person tracking integration.
 ## Features
 - **GCC-PHAT Beamforming**: Phase-domain cross-correlation for direction of arrival estimation
 - **Voice Activity Detection (VAD)**: RMS energy-based speech detection with smoothing
 - **Stereo/Quadrophonic Support**: Handles Jabra 2-channel and 4-channel modes
 - **Robot Self-Noise Filtering**: Optional suppression of motor/wheel noise (future enhancement)
 - **ROS2 Integration**: Standard ROS2 topic publishing at configurable rates
 ## Topics
 ### Published
 - **`/saltybot/audio_direction`** (`std_msgs/Float32`)
  Estimated bearing in degrees (0–360, where 0° = front, 90° = right, 180° = rear, 270° = left)
 - **`/saltybot/audio_activity`** (`std_msgs/Bool`)
  Voice activity detected (true if speech-like energy)
 ## Parameters
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
 | `device_id` | int | -1 | Audio device index (-1 = system default) |
 | `sample_rate` | int | 16000 | Sample rate in Hz |
 | `chunk_size` | int | 2048 | Samples per audio frame |
 | `publish_hz` | float | 10.0 | Output publication rate (Hz) |
 | `vad_threshold` | float | 0.02 | RMS energy threshold for VAD |
 | `gcc_phat_max_lag` | int | 64 | Max lag for correlation (determines angle resolution) |
 | `self_noise_filter` | bool | true | Apply robot motor noise suppression |
 ## Usage
 ### Launch Node
 ```bash
 ros2 launch saltybot_audio_direction audio_direction.launch.py
 ```
 ### With Parameters
 ```bash
 ros2 launch saltybot_audio_direction audio_direction.launch.py \
  device_id:=0 \
  publish_hz:=20.0 \
  vad_threshold:=0.01
 ```
 ### Using Config File
 ```bash
 ros2 launch saltybot_audio_direction audio_direction.launch.py \
  --ros-args --params-file config/audio_direction_params.yaml
 ```
 ## Algorithm
 ### GCC-PHAT
 1. Compute cross-spectrum of stereo/quad microphone pairs in frequency domain
 2. Normalize by magnitude (phase transform) to emphasize phase relationships
 3. Inverse FFT to time-domain cross-correlation
 4. Find maximum correlation lag → time delay between channels
 5. Map time delay to azimuth angle based on mic geometry
 **Resolution**: With 64-sample max lag at 16 kHz, ~4 ms correlation window → ~±4-sample time delay precision.
 ### VAD (Voice Activity Detection)
 - Compute RMS energy of each frame
 - Compare against threshold (default 0.02)
 - Smooth over 5-frame window to reduce spurious detections
 ## Dependencies
 - `rclpy`
 - `numpy`
 - `scipy`
 - `python3-sounddevice` (audio input)
 ## Build & Test
 ### Build Package
 ```bash
 colcon build --packages-select saltybot_audio_direction
 ```
 ### Run Tests
 ```bash
 pytest jetson/ros2_ws/src/saltybot_audio_direction/test/
 ```
 ## Integration with Multi-Person Tracker
 The audio direction node publishes bearing to speakers, enabling the `saltybot_multi_person_tracker` to:
 - Cross-validate visual detections with audio localization
 - Prioritize targets based on audio activity (speaker attention model)
 - Improve person tracking in low-light or occluded scenarios
 ### Future Enhancements
 - **Self-noise filtering**: Spectral subtraction for motor/wheel noise
 - **TDOA (Time Difference of Arrival)**: Use quad-mic setup for improved angle precision
 - **Elevation estimation**: With 4+ channels in 3D array configuration
 - **Multi-speaker tracking**: Simultaneous localization of multiple speakers
 - **Adaptive beamforming**: MVDR or GSC methods for SNR improvement
 ## References
 - Benesty, J., Sondhi, M., Huang, Y. (2008). "Handbook of Speech Processing"
 - Knapp, C., Carter, G. (1976). "The Generalized Correlation Method for Estimation of Time Delay"
 ## License
 MIT
--- a/jetson/ros2_ws/src/saltybot_audio_direction/config/audio_direction_params.yaml
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/config/audio_direction_params.yaml
@ -0,0 +1,17 @@
 # Audio direction estimator ROS2 parameters
 # Used with ros2 launch saltybot_audio_direction audio_direction.launch.py --ros-args --params-file config/audio_direction_params.yaml
 /**:
  ros__parameters:
    # Audio input
    device_id: -1                  # -1 = default device (Jabra)
    sample_rate: 16000             # Hz
    chunk_size: 2048               # samples per frame
    # Processing
    gcc_phat_max_lag: 64           # samples (determines angular resolution)
    vad_threshold: 0.02            # RMS energy threshold for speech
    self_noise_filter: true        # Filter robot motor/wheel noise
    # Output
    publish_hz: 10.0               # Publication rate (Hz)
--- a/jetson/ros2_ws/src/saltybot_audio_direction/launch/audio_direction.launch.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/launch/audio_direction.launch.py
@ -0,0 +1,82 @@
 """
 Launch audio direction estimator node.
 Typical usage:
  ros2 launch saltybot_audio_direction audio_direction.launch.py
 """
 from launch import LaunchDescription
 from launch.actions import DeclareLaunchArgument
 from launch.substitutions import LaunchConfiguration
 from launch_ros.actions import Node
 def generate_launch_description():
    """Generate launch description for audio direction node."""
    # Declare launch arguments
    device_id_arg = DeclareLaunchArgument(
        'device_id',
        default_value='-1',
        description='Audio device index (-1 for default)',
    )
    sample_rate_arg = DeclareLaunchArgument(
        'sample_rate',
        default_value='16000',
        description='Sample rate in Hz',
    )
    chunk_size_arg = DeclareLaunchArgument(
        'chunk_size',
        default_value='2048',
        description='Samples per audio frame',
    )
    publish_hz_arg = DeclareLaunchArgument(
        'publish_hz',
        default_value='10.0',
        description='Publication rate in Hz',
    )
    vad_threshold_arg = DeclareLaunchArgument(
        'vad_threshold',
        default_value='0.02',
        description='RMS energy threshold for voice activity detection',
    )
    gcc_max_lag_arg = DeclareLaunchArgument(
        'gcc_phat_max_lag',
        default_value='64',
        description='Max lag for GCC-PHAT correlation',
    )
    self_noise_filter_arg = DeclareLaunchArgument(
        'self_noise_filter',
        default_value='true',
        description='Apply robot self-noise suppression',
    )
    # Audio direction node
    audio_direction_node = Node(
        package='saltybot_audio_direction',
        executable='audio_direction_node',
        name='audio_direction_estimator',
        output='screen',
        parameters=[
            {'device_id': LaunchConfiguration('device_id')},
            {'sample_rate': LaunchConfiguration('sample_rate')},
            {'chunk_size': LaunchConfiguration('chunk_size')},
            {'publish_hz': LaunchConfiguration('publish_hz')},
            {'vad_threshold': LaunchConfiguration('vad_threshold')},
            {'gcc_phat_max_lag': LaunchConfiguration('gcc_max_lag')},
            {'self_noise_filter': LaunchConfiguration('self_noise_filter')},
        ],
    )
    return LaunchDescription(
        [
            device_id_arg,
            sample_rate_arg,
            chunk_size_arg,
            publish_hz_arg,
            vad_threshold_arg,
            gcc_max_lag_arg,
            self_noise_filter_arg,
            audio_direction_node,
        ]
    )
--- a/jetson/ros2_ws/src/saltybot_audio_direction/package.xml
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/package.xml
@ -0,0 +1,31 @@
 <?xml version="1.0"?>
 <?xml-model href="http://download.ros.org/schema/package_format3.xsd" schematypens="http://www.w3.org/2001/XMLSchema"?>
 <package format="3">
  <name>saltybot_audio_direction</name>
  <version>0.1.0</version>
  <description>
    Audio direction estimator for sound source localization via Jabra microphone.
    Implements GCC-PHAT beamforming for direction of arrival (DoA) estimation.
    Publishes bearing (degrees) and voice activity detection (VAD) for speaker tracking integration.
    Issue #430.
  </description>
  <maintainer email="sl-perception@saltylab.local">sl-perception</maintainer>
  <license>MIT</license>
  <buildtool_depend>ament_python</buildtool_depend>
  <depend>rclpy</depend>
  <depend>std_msgs</depend>
  <depend>sensor_msgs</depend>
  <depend>geometry_msgs</depend>
  <exec_depend>python3-numpy</exec_depend>
  <exec_depend>python3-scipy</exec_depend>
  <exec_depend>python3-sounddevice</exec_depend>
  <test_depend>pytest</test_depend>
  <export>
    <build_type>ament_python</build_type>
  </export>
 </package>
--- a/jetson/ros2_ws/src/saltybot_audio_direction/resource/saltybot_audio_direction
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/resource/saltybot_audio_direction
--- a/jetson/ros2_ws/src/saltybot_audio_direction/saltybot_audio_direction/init.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/saltybot_audio_direction/init.py
--- a/jetson/ros2_ws/src/saltybot_audio_direction/saltybot_audio_direction/audio_direction_node.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/saltybot_audio_direction/audio_direction_node.py
@ -0,0 +1,299 @@
 """
 audio_direction_node.py — Sound source localization via GCC-PHAT beamforming.
 Estimates direction of arrival (DoA) from a Jabra multi-channel microphone
 using Generalized Cross-Correlation with Phase Transform (GCC-PHAT).
 Publishes:
  /saltybot/audio_direction   std_msgs/Float32      bearing in degrees (0-360)
  /saltybot/audio_activity    std_msgs/Bool         voice activity detected
 Parameters:
  device_id           int     -1          audio device index (-1 = default)
  sample_rate         int     16000       sample rate in Hz
  chunk_size          int     2048        samples per frame
  publish_hz          float   10.0        output publication rate
  vad_threshold       float   0.02        RMS energy threshold for speech
  gcc_phat_max_lag    int     64          max lag for correlation (determines angular resolution)
  self_noise_filter   bool    true        apply robot noise suppression
 """
 from __future__ import annotations
 import rclpy
 from rclpy.node import Node
 from rclpy.qos import QoSProfile, ReliabilityPolicy, HistoryPolicy
 import numpy as np
 from scipy import signal
 import sounddevice as sd
 from collections import deque
 import threading
 import time
 from std_msgs.msg import Float32, Bool
 _SENSOR_QOS = QoSProfile(
    reliability=ReliabilityPolicy.BEST_EFFORT,
    history=HistoryPolicy.KEEP_LAST,
    depth=5,
 )
 class GCCPHATBeamformer:
    """Generalized Cross-Correlation with Phase Transform for DoA estimation."""
    def __init__(self, sample_rate: int, max_lag: int = 64):
        self.sample_rate = sample_rate
        self.max_lag = max_lag
        self.frequency_bins = max_lag * 2
        # Azimuth angles for left (270°), front (0°), right (90°), rear (180°)
        self.angles = np.array([270.0, 0.0, 90.0, 180.0])
    def gcc_phat(self, sig1: np.ndarray, sig2: np.ndarray) -> float:
        """
        Compute GCC-PHAT correlation and estimate time delay (DoA proxy).
        Returns:
            Estimated time delay in samples (can be converted to angle)
        """
        if len(sig1) == 0 or len(sig2) == 0:
            return 0.0
        # Cross-correlation in frequency domain
        fft1 = np.fft.rfft(sig1, n=self.frequency_bins)
        fft2 = np.fft.rfft(sig2, n=self.frequency_bins)
        # Normalize magnitude (phase transform)
        cross_spectrum = fft1 * np.conj(fft2)
        cross_spectrum_normalized = cross_spectrum / (np.abs(cross_spectrum) + 1e-8)
        # Inverse FFT to get correlation
        correlation = np.fft.irfft(cross_spectrum_normalized, n=self.frequency_bins)
        # Find lag with maximum correlation
        lags = np.arange(-self.max_lag, self.max_lag + 1)
        valid_corr = correlation[: len(lags)]
        max_lag_idx = np.argmax(np.abs(valid_corr))
        estimated_lag = lags[max_lag_idx] if max_lag_idx < len(lags) else 0.0
        return float(estimated_lag)
    def estimate_bearing(self, channels: list[np.ndarray]) -> float:
        """
        Estimate bearing from multi-channel audio.
        Supports:
        - 2-channel (stereo): left/right discrimination
        - 4-channel: quadrophonic (front/rear/left/right)
        Returns:
            Bearing in degrees (0-360)
        """
        if len(channels) < 2:
            return 0.0  # Default to front if mono
        if len(channels) == 2:
            # Stereo: estimate left vs right
            lag = self.gcc_phat(channels[0], channels[1])
            # Negative lag → left (270°), positive lag → right (90°)
            if abs(lag) < 5:
                return 0.0  # Front
            elif lag < 0:
                return 270.0  # Left
            else:
                return 90.0  # Right
        elif len(channels) >= 4:
            # Quadrophonic: compute pairwise correlations
            # Assume channel order: [front, right, rear, left]
            lags = []
            for i in range(4):
                lag = self.gcc_phat(channels[i], channels[(i + 1) % 4])
                lags.append(lag)
            # Select angle based on strongest correlation
            max_idx = np.argmax(np.abs(lags))
            return self.angles[max_idx]
        return 0.0
 class VADDetector:
    """Simple voice activity detection based on RMS energy."""
    def __init__(self, threshold: float = 0.02, smoothing: int = 5):
        self.threshold = threshold
        self.smoothing = smoothing
        self.history = deque(maxlen=smoothing)
    def detect(self, audio_frame: np.ndarray) -> bool:
        """
        Detect voice activity in audio frame.
        Returns:
            True if speech-like energy detected
        """
        rms = np.sqrt(np.mean(audio_frame**2))
        self.history.append(rms > self.threshold)
        # Majority voting over window
        return sum(self.history) > self.smoothing // 2
 class AudioDirectionNode(Node):
    def __init__(self):
        super().__init__('audio_direction_estimator')
        # Parameters
        self.declare_parameter('device_id', -1)
        self.declare_parameter('sample_rate', 16000)
        self.declare_parameter('chunk_size', 2048)
        self.declare_parameter('publish_hz', 10.0)
        self.declare_parameter('vad_threshold', 0.02)
        self.declare_parameter('gcc_phat_max_lag', 64)
        self.declare_parameter('self_noise_filter', True)
        self.device_id = self.get_parameter('device_id').value
        self.sample_rate = self.get_parameter('sample_rate').value
        self.chunk_size = self.get_parameter('chunk_size').value
        pub_hz = self.get_parameter('publish_hz').value
        vad_threshold = self.get_parameter('vad_threshold').value
        gcc_max_lag = self.get_parameter('gcc_phat_max_lag').value
        self.apply_noise_filter = self.get_parameter('self_noise_filter').value
        # Publishers
        self._pub_bearing = self.create_publisher(
            Float32, '/saltybot/audio_direction', 10, qos_profile=_SENSOR_QOS
        )
        self._pub_vad = self.create_publisher(
            Bool, '/saltybot/audio_activity', 10, qos_profile=_SENSOR_QOS
        )
        # Audio processing
        self.beamformer = GCCPHATBeamformer(self.sample_rate, max_lag=gcc_max_lag)
        self.vad = VADDetector(threshold=vad_threshold)
        self._audio_buffer = deque(maxlen=self.chunk_size * 2)
        self._lock = threading.Lock()
        # Start audio stream
        self._stream = None
        self._running = True
        self._start_audio_stream()
        # Publish timer
        self.create_timer(1.0 / pub_hz, self._tick)
        self.get_logger().info(
            f'audio_direction_estimator ready — '
            f'device_id={self.device_id} sample_rate={self.sample_rate} Hz '
            f'chunk_size={self.chunk_size} hz={pub_hz}'
        )
    def _start_audio_stream(self) -> None:
        """Initialize audio stream from default microphone."""
        try:
            self._stream = sd.InputStream(
                device=self.device_id if self.device_id >= 0 else None,
                samplerate=self.sample_rate,
                channels=2,  # Default to stereo; auto-detect Jabra channels
                blocksize=self.chunk_size,
                callback=self._audio_callback,
                latency='low',
            )
            self._stream.start()
            self.get_logger().info('Audio stream started')
        except Exception as e:
            self.get_logger().error(f'Failed to start audio stream: {e}')
            self._stream = None
    def _audio_callback(self, indata: np.ndarray, frames: int, time_info, status) -> None:
        """Callback for audio input stream."""
        if status:
            self.get_logger().warn(f'Audio callback status: {status}')
        try:
            with self._lock:
                # Store stereo or mono data
                if indata.shape[1] == 1:
                    self._audio_buffer.extend(indata.flatten())
                else:
                    # For multi-channel, interleave or store separately
                    self._audio_buffer.extend(indata.flatten())
        except Exception as e:
            self.get_logger().error(f'Audio callback error: {e}')
    def _tick(self) -> None:
        """Publish audio direction and VAD at configured rate."""
        if self._stream is None or not self._running:
            return
        with self._lock:
            if len(self._audio_buffer) < self.chunk_size:
                return
            audio_data = np.array(list(self._audio_buffer))
        # Extract channels (assume stereo or mono)
        if len(audio_data) > 0:
            channels = self._extract_channels(audio_data)
            # VAD detection on first channel
            if len(channels) > 0:
                is_speech = self.vad.detect(channels[0])
                vad_msg = Bool()
                vad_msg.data = is_speech
                self._pub_vad.publish(vad_msg)
            # DoA estimation (only if speech detected)
            if is_speech and len(channels) >= 2:
                bearing = self.beamformer.estimate_bearing(channels)
            else:
                bearing = 0.0  # Default to front when no speech
            bearing_msg = Float32()
            bearing_msg.data = float(bearing)
            self._pub_bearing.publish(bearing_msg)
    def _extract_channels(self, audio_data: np.ndarray) -> list[np.ndarray]:
        """
        Extract stereo/mono channels from audio buffer.
        Handles both interleaved and non-interleaved formats.
        """
        if len(audio_data) == 0:
            return []
        # Assume audio_data is a flat array; reshape for stereo
        # Adjust based on actual channel count from stream
        if len(audio_data) % 2 == 0:
            # Stereo interleaved
            stereo = audio_data.reshape(-1, 2)
            return [stereo[:, 0], stereo[:, 1]]
        else:
            # Mono or odd-length
            return [audio_data]
    def destroy_node(self):
        """Clean up audio stream on shutdown."""
        self._running = False
        if self._stream is not None:
            self._stream.stop()
            self._stream.close()
        super().destroy_node()
 def main(args=None):
    rclpy.init(args=args)
    node = AudioDirectionNode()
    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        pass
    finally:
        node.destroy_node()
        rclpy.shutdown()
 if __name__ == '__main__':
    main()
--- a/jetson/ros2_ws/src/saltybot_audio_direction/setup.cfg
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/setup.cfg
@ -0,0 +1,4 @@
 [develop]
 script_dir=$base/lib/saltybot_audio_direction
 [egg_info]
 tag_date = 0
--- a/jetson/ros2_ws/src/saltybot_audio_direction/setup.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/setup.py
@ -0,0 +1,23 @@
 from setuptools import setup, find_packages
 setup(
    name='saltybot_audio_direction',
    version='0.1.0',
    packages=find_packages(exclude=['test']),
    data_files=[
        ('share/ament_index/resource_index/packages',
            ['resource/saltybot_audio_direction']),
        ('share/saltybot_audio_direction', ['package.xml']),
    ],
    install_requires=['setuptools'],
    zip_safe=True,
    author='SaltyLab',
    author_email='robot@saltylab.local',
    description='Audio direction estimator for sound source localization',
    license='MIT',
    entry_points={
        'console_scripts': [
            'audio_direction_node=saltybot_audio_direction.audio_direction_node:main',
        ],
    },
 )
--- a/jetson/ros2_ws/src/saltybot_audio_direction/test/init.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/test/init.py
--- a/jetson/ros2_ws/src/saltybot_audio_direction/test/test_audio_direction.py
+++ b/jetson/ros2_ws/src/saltybot_audio_direction/test/test_audio_direction.py
@ -0,0 +1,78 @@
 """
 Basic tests for audio direction estimator.
 """
 import pytest
 import numpy as np
 from saltybot_audio_direction.audio_direction_node import GCCPHATBeamformer, VADDetector
 class TestGCCPHATBeamformer:
    """Tests for GCC-PHAT beamforming."""
    def test_beamformer_init(self):
        """Test beamformer initialization."""
        beamformer = GCCPHATBeamformer(sample_rate=16000, max_lag=64)
        assert beamformer.sample_rate == 16000
        assert beamformer.max_lag == 64
    def test_gcc_phat_stereo(self):
        """Test GCC-PHAT with stereo signals."""
        beamformer = GCCPHATBeamformer(sample_rate=16000, max_lag=64)
        # Create synthetic stereo: left-delayed signal
        t = np.arange(512) / 16000.0
        sig1 = np.sin(2 * np.pi * 1000 * t)  # Left mic
        sig2 = np.sin(2 * np.pi * 1000 * (t - 0.004))  # Right mic (delayed)
        lag = beamformer.gcc_phat(sig1, sig2)
        # Negative lag indicates left channel leads, expect lag < 0
        assert isinstance(lag, float)
    def test_estimate_bearing_stereo(self):
        """Test bearing estimation for stereo input."""
        beamformer = GCCPHATBeamformer(sample_rate=16000, max_lag=64)
        t = np.arange(512) / 16000.0
        sig1 = np.sin(2 * np.pi * 1000 * t)
        sig2 = np.sin(2 * np.pi * 1000 * t)
        bearing = beamformer.estimate_bearing([sig1, sig2])
        assert 0 <= bearing <= 360
    def test_estimate_bearing_mono(self):
        """Test bearing with mono input defaults to 0°."""
        beamformer = GCCPHATBeamformer(sample_rate=16000)
        sig = np.sin(2 * np.pi * 1000 * np.arange(512) / 16000.0)
        bearing = beamformer.estimate_bearing([sig])
        assert bearing == 0.0
 class TestVADDetector:
    """Tests for voice activity detection."""
    def test_vad_init(self):
        """Test VAD detector initialization."""
        vad = VADDetector(threshold=0.02, smoothing=5)
        assert vad.threshold == 0.02
        assert vad.smoothing == 5
    def test_vad_silence(self):
        """Test VAD detects silence."""
        vad = VADDetector(threshold=0.02)
        silence = np.zeros(2048) * 0.001  # Very quiet
        is_speech = vad.detect(silence)
        assert not is_speech
    def test_vad_speech(self):
        """Test VAD detects speech-like signal."""
        vad = VADDetector(threshold=0.02)
        t = np.arange(2048) / 16000.0
        speech = np.sin(2 * np.pi * 1000 * t) * 0.1  # ~0.07 RMS
        for _ in range(5):  # Run multiple times to accumulate history
            is_speech = vad.detect(speech)
        assert is_speech
 if __name__ == '__main__':
    pytest.main([__file__])