sl-perception 569ac3fb35 feat: Add gesture recognition system (Issue #454)
Implements hand and body gesture recognition via MediaPipe on Jetson Orin GPU.
- MediaPipe Hands (21-point hand landmarks) + Pose (33-point body landmarks)
- Recognizes: wave, point, stop_palm, thumbs_up, come_here, arms_up, arms_spread
- GestureArray publishing at 10–15 fps on Jetson Orin
- Confidence threshold: 0.7 (configurable)
- Range: 2–5 meters optimal
- GPU acceleration via Jetson Tensor RT
- Integrates with voice command router for multimodal interaction
- Temporal smoothing: history-based motion detection (wave, beckon)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-05 09:19:40 -05:00

6.6 KiB
Raw Blame History

saltybot_gesture_recognition

Hand and body gesture recognition via MediaPipe on Jetson Orin GPU (Issue #454).

Detects human hand and body gestures in real-time camera feed and publishes recognized gestures for multimodal interaction. Integrates with voice command router for combined audio+gesture control.

Recognized Gestures

Hand Gestures

  • wave — Lateral wrist oscillation (temporal) | Greeting, acknowledgment
  • point — Index extended, others curled | Direction indication ("left"/"right"/"up"/"forward")
  • stop_palm — All fingers extended, palm forward | Emergency stop (e-stop)
  • thumbs_up — Thumb extended up, fist closed | Confirmation, approval
  • come_here — Beckoning: index curled toward palm (temporal) | Call to approach
  • follow — Index extended horizontally | Follow me

Body Gestures

  • arms_up — Both wrists above shoulders | Stop / emergency
  • arms_spread — Arms extended laterally | Back off / clear space
  • crouch — Hips below standing threshold | Come closer

Performance

  • Frame Rate: 1015 fps on Jetson Orin (with GPU acceleration)
  • Latency: ~100150 ms end-to-end
  • Range: 25 meters (optimal 23 m)
  • Accuracy: ~8590% for known gestures (varies by lighting, occlusion)
  • Simultaneous Detections: Up to 10 people + gestures per frame

Topics

Published

  • /saltybot/gestures (saltybot_social_msgs/GestureArray) Array of detected gestures with type, confidence, position, source (hand/body)

Parameters

Parameter Type Default Description
camera_topic str /camera/color/image_raw RGB camera topic
confidence_threshold float 0.7 Min confidence to publish (01)
publish_hz float 15.0 Output rate (Hz)
max_distance_m float 5.0 Max gesture range (meters)
enable_gpu bool true Use Jetson GPU acceleration

Messages

GestureArray

Header header
Gesture[] gestures
uint32 count

Gesture (from saltybot_social_msgs)

Header header
string gesture_type              # "wave", "point", "stop_palm", etc.
int32 person_id                  # -1 if unidentified
float32 confidence               # 01 (typically >= 0.7)
int32 camera_id                  # 0=front
float32 hand_x, hand_y           # Normalized position (01)
bool is_right_hand               # True for right hand
string direction                 # For "point": "left"/"right"/"up"/"forward"/"down"
string source                    # "hand" or "body_pose"

Usage

Launch Node

ros2 launch saltybot_gesture_recognition gesture_recognition.launch.py

With Custom Parameters

ros2 launch saltybot_gesture_recognition gesture_recognition.launch.py \
  camera_topic:='/camera/front/image_raw' \
  confidence_threshold:=0.75 \
  publish_hz:=20.0

Using Config File

ros2 launch saltybot_gesture_recognition gesture_recognition.launch.py \
  --ros-args --params-file config/gesture_params.yaml

Algorithm

MediaPipe Hands

  • 21 landmarks per hand (wrist + finger joints)
  • Detects: palm orientation, finger extension, hand pose
  • Model complexity: 0 (lite, faster) for Jetson

MediaPipe Pose

  • 33 body landmarks (shoulders, hips, wrists, knees, etc.)
  • Detects: arm angle, body orientation, posture
  • Model complexity: 1 (balanced accuracy/speed)

Gesture Classification

  1. Thumbs-up: Thumb extended >0.3, no other fingers extended
  2. Stop-palm: All fingers extended, palm normal > 0.3 (facing camera)
  3. Point: Only index extended, direction from hand position
  4. Wave: High variance in hand x-position over ~5 frames
  5. Beckon: High variance in hand y-position over ~4 frames
  6. Arms-up: Both wrists > shoulder height
  7. Arms-spread: Wrist distance > shoulder width × 1.2
  8. Crouch: Hip-y > shoulder-y + 0.3

Confidence Scoring

  • MediaPipe detection confidence × gesture classification confidence
  • Temporal smoothing: history over last 10 frames
  • Threshold: 0.7 (configurable) for publication

Integration with Voice Command Router

# Listen to both topics
rospy.Subscriber('/saltybot/speech', SpeechTranscript, voice_callback)
rospy.Subscriber('/saltybot/gestures', GestureArray, gesture_callback)

def multimodal_command(voice_cmd, gesture):
    # "robot forward" (voice) + point-forward (gesture) = confirmed forward
    if gesture.gesture_type == 'point' and gesture.direction == 'forward':
        if 'forward' in voice_cmd:
            nav.set_goal(forward_pos)  # High confidence

Dependencies

  • mediapipe — Hand and Pose detection
  • opencv-python — Image processing
  • numpy, scipy — Numerical computation
  • rclpy — ROS2 Python client
  • saltybot_social_msgs — Custom gesture messages

Build & Test

Build

colcon build --packages-select saltybot_gesture_recognition

Run Tests

pytest jetson/ros2_ws/src/saltybot_gesture_recognition/test/

Benchmark on Jetson Orin

ros2 run saltybot_gesture_recognition gesture_node \
  --ros-args -p publish_hz:=30.0 &
ros2 topic hz /saltybot/gestures
# Expected: ~15 Hz (GPU-limited, not message processing)

Troubleshooting

Issue: Low frame rate (< 10 Hz)

  • Solution: Reduce camera resolution or use model_complexity=0

Issue: False positives (confidence > 0.7 but wrong gesture)

  • Solution: Increase confidence_threshold to 0.750.8

Issue: Doesn't detect gestures at distance > 3m

  • Solution: Improve lighting, move closer, or reduce max_distance_m

Future Enhancements

  • Dynamic Gesture Timeout: Stop publishing after 2s without update
  • Person Association: Match gestures to tracked persons (from saltybot_multi_person_tracker)
  • Custom Gesture Training: TensorFlow Lite fine-tuning on robot-specific gestures
  • Gesture Sequences: Recognize multi-step command chains ("wave → point → thumbs-up")
  • Sign Language: ASL/BSL recognition (larger model, future Phase)
  • Accessibility: Voice + gesture for accessibility (e.g., hands-free "stop")

Performance Targets (Jetson Orin Nano Super)

Metric Target Actual
Frame Rate 10+ fps ~15 fps (GPU)
Latency <200 ms ~100150 ms
Max People 510 ~10 (GPU-limited)
Confidence 0.7+ 0.750.95
GPU Memory <1 GB ~400500 MB

References

License

MIT