# SaltyBot Health Monitor Central system health monitor for SaltyBot. Tracks heartbeats from all critical nodes, detects failures, triggers auto-restart, and publishes system health status. ## Features - **Heartbeat Monitoring**: Subscribes to heartbeat signals from all tracked nodes - **Automatic Dead Node Detection**: Marks nodes as DOWN if silent >5 seconds - **Auto-Restart Capability**: Attempts to restart dead nodes via ROS2 launch - **System Health Publishing**: Publishes `/saltybot/system_health` JSON with full status - **Face Alerts**: Triggers visual alerts on robot face display for critical failures - **Configurable**: YAML-based node list and timeout parameters ## Topics ### Subscribed - `/saltybot//heartbeat` (std_msgs/String): Heartbeat from each monitored node ### Published - `/saltybot/system_health` (std_msgs/String): System health status as JSON - `/saltybot/face/alert` (std_msgs/String): Critical alerts for face display ## Configuration Edit `config/health_config.yaml` to configure: - **monitored_nodes**: List of all nodes to track - **heartbeat_timeout_s**: Seconds before node is marked DOWN (default: 5s) - **check_frequency_hz**: Health check rate (default: 1Hz) - **enable_auto_restart**: Enable automatic restart attempts (default: true) - **critical_nodes**: Nodes that trigger face alerts when down ## Launch ```bash # Default launch with built-in config ros2 launch saltybot_health_monitor health_monitor.launch.py # Custom config ros2 launch saltybot_health_monitor health_monitor.launch.py \ config_file:=/path/to/custom_config.yaml # Disable auto-restart ros2 launch saltybot_health_monitor health_monitor.launch.py \ enable_auto_restart:=false ``` ## Health Status JSON The `/saltybot/system_health` topic publishes: ```json { "timestamp": "2025-03-05T10:00:00.123456", "uptime_s": 3600.5, "nodes": { "rover_driver": { "status": "UP", "time_since_heartbeat_s": 0.5, "heartbeat_count": 1200, "restart_count": 0, "expected": true }, "slam_node": { "status": "DOWN", "time_since_heartbeat_s": 6.0, "heartbeat_count": 500, "restart_count": 1, "expected": true } }, "critical_down": ["slam_node"], "system_healthy": false } ``` ## Node Integration Each node should publish heartbeats periodically (e.g., every 1-2 seconds): ```python # In your ROS2 node heartbeat_pub = self.create_publisher(String, "/saltybot/node_name/heartbeat", 10) heartbeat_pub.publish(String(data="node_name:alive")) ``` ## Restart Behavior When a node is detected as DOWN: 1. Health monitor logs a warning 2. If `enable_auto_restart: true`, queues a restart command 3. Node status changes to "RESTARTING" 4. Restart count is incremented 5. Face alert is published for critical nodes The actual restart mechanism can be: - Direct ROS2 launch subprocess - Systemd service restart - Custom restart script - Manual restart via external monitor ## Debugging Check health status: ```bash ros2 topic echo /saltybot/system_health ``` Simulate a node heartbeat: ```bash ros2 topic pub /saltybot/test_node/heartbeat std_msgs/String '{data: "test_node:alive"}' ``` View monitor logs: ```bash ros2 launch saltybot_health_monitor health_monitor.launch.py | grep health ```