saltylab-firmware/jetson/ros2_ws/src/saltybot_health_monitor
sl-firmware 9683fd3685 feat: Add ROS2 system health monitor (Issue #408)
Implement centralized health monitoring node that:
- Subscribes to /saltybot/<node>/heartbeat from all tracked nodes
- Tracks expected nodes from YAML configuration
- Marks nodes DEAD if silent >5 seconds
- Triggers auto-restart via ros2 launch when nodes fail
- Publishes /saltybot/system_health JSON with full status
- Alerts face display on critical node failures

Features:
- Configurable heartbeat timeout (default 5s)
- Automatic dead node detection and restart
- System health JSON publishing (timestamp, uptime, node status, critical alerts)
- Face alert system for critical failures
- Rate-limited alerting to avoid spam
- Comprehensive monitoring config with critical/important node tiers

Package structure:
- saltybot_health_monitor: Main health monitoring node
- health_config.yaml: Configurable list of monitored nodes
- health_monitor.launch.py: Launch file with parameters
- Unit tests for heartbeat parsing and health status generation

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-05 08:52:52 -05:00
..

SaltyBot Health Monitor

Central system health monitor for SaltyBot. Tracks heartbeats from all critical nodes, detects failures, triggers auto-restart, and publishes system health status.

Features

  • Heartbeat Monitoring: Subscribes to heartbeat signals from all tracked nodes
  • Automatic Dead Node Detection: Marks nodes as DOWN if silent >5 seconds
  • Auto-Restart Capability: Attempts to restart dead nodes via ROS2 launch
  • System Health Publishing: Publishes /saltybot/system_health JSON with full status
  • Face Alerts: Triggers visual alerts on robot face display for critical failures
  • Configurable: YAML-based node list and timeout parameters

Topics

Subscribed

  • /saltybot/<node_name>/heartbeat (std_msgs/String): Heartbeat from each monitored node

Published

  • /saltybot/system_health (std_msgs/String): System health status as JSON
  • /saltybot/face/alert (std_msgs/String): Critical alerts for face display

Configuration

Edit config/health_config.yaml to configure:

  • monitored_nodes: List of all nodes to track
  • heartbeat_timeout_s: Seconds before node is marked DOWN (default: 5s)
  • check_frequency_hz: Health check rate (default: 1Hz)
  • enable_auto_restart: Enable automatic restart attempts (default: true)
  • critical_nodes: Nodes that trigger face alerts when down

Launch

# Default launch with built-in config
ros2 launch saltybot_health_monitor health_monitor.launch.py

# Custom config
ros2 launch saltybot_health_monitor health_monitor.launch.py \
  config_file:=/path/to/custom_config.yaml

# Disable auto-restart
ros2 launch saltybot_health_monitor health_monitor.launch.py \
  enable_auto_restart:=false

Health Status JSON

The /saltybot/system_health topic publishes:

{
  "timestamp": "2025-03-05T10:00:00.123456",
  "uptime_s": 3600.5,
  "nodes": {
    "rover_driver": {
      "status": "UP",
      "time_since_heartbeat_s": 0.5,
      "heartbeat_count": 1200,
      "restart_count": 0,
      "expected": true
    },
    "slam_node": {
      "status": "DOWN",
      "time_since_heartbeat_s": 6.0,
      "heartbeat_count": 500,
      "restart_count": 1,
      "expected": true
    }
  },
  "critical_down": ["slam_node"],
  "system_healthy": false
}

Node Integration

Each node should publish heartbeats periodically (e.g., every 1-2 seconds):

# In your ROS2 node
heartbeat_pub = self.create_publisher(String, "/saltybot/node_name/heartbeat", 10)
heartbeat_pub.publish(String(data="node_name:alive"))

Restart Behavior

When a node is detected as DOWN:

  1. Health monitor logs a warning
  2. If enable_auto_restart: true, queues a restart command
  3. Node status changes to "RESTARTING"
  4. Restart count is incremented
  5. Face alert is published for critical nodes

The actual restart mechanism can be:

  • Direct ROS2 launch subprocess
  • Systemd service restart
  • Custom restart script
  • Manual restart via external monitor

Debugging

Check health status:

ros2 topic echo /saltybot/system_health

Simulate a node heartbeat:

ros2 topic pub /saltybot/test_node/heartbeat std_msgs/String '{data: "test_node:alive"}'

View monitor logs:

ros2 launch saltybot_health_monitor health_monitor.launch.py | grep health