# SaltyBot Health Monitor

Central system health monitor for SaltyBot. Tracks heartbeats from all critical nodes, detects failures, triggers auto-restart, and publishes system health status.

## Features

- **Heartbeat Monitoring**: Subscribes to heartbeat signals from all tracked nodes
- **Automatic Dead Node Detection**: Marks nodes as DOWN if silent >5 seconds
- **Auto-Restart Capability**: Attempts to restart dead nodes via ROS2 launch
- **System Health Publishing**: Publishes `/saltybot/system_health` JSON with full status
- **Face Alerts**: Triggers visual alerts on robot face display for critical failures
- **Configurable**: YAML-based node list and timeout parameters

## Topics

### Subscribed
- `/saltybot/<node_name>/heartbeat` (std_msgs/String): Heartbeat from each monitored node

### Published
- `/saltybot/system_health` (std_msgs/String): System health status as JSON
- `/saltybot/face/alert` (std_msgs/String): Critical alerts for face display

## Configuration

Edit `config/health_config.yaml` to configure:

- **monitored_nodes**: List of all nodes to track
- **heartbeat_timeout_s**: Seconds before node is marked DOWN (default: 5s)
- **check_frequency_hz**: Health check rate (default: 1Hz)
- **enable_auto_restart**: Enable automatic restart attempts (default: true)
- **critical_nodes**: Nodes that trigger face alerts when down

## Launch

```bash
# Default launch with built-in config
ros2 launch saltybot_health_monitor health_monitor.launch.py

# Custom config
ros2 launch saltybot_health_monitor health_monitor.launch.py \
  config_file:=/path/to/custom_config.yaml

# Disable auto-restart
ros2 launch saltybot_health_monitor health_monitor.launch.py \
  enable_auto_restart:=false
```

## Health Status JSON

The `/saltybot/system_health` topic publishes:

```json
{
  "timestamp": "2025-03-05T10:00:00.123456",
  "uptime_s": 3600.5,
  "nodes": {
    "rover_driver": {
      "status": "UP",
      "time_since_heartbeat_s": 0.5,
      "heartbeat_count": 1200,
      "restart_count": 0,
      "expected": true
    },
    "slam_node": {
      "status": "DOWN",
      "time_since_heartbeat_s": 6.0,
      "heartbeat_count": 500,
      "restart_count": 1,
      "expected": true
    }
  },
  "critical_down": ["slam_node"],
  "system_healthy": false
}
```

## Node Integration

Each node should publish heartbeats periodically (e.g., every 1-2 seconds):

```python
# In your ROS2 node
heartbeat_pub = self.create_publisher(String, "/saltybot/node_name/heartbeat", 10)
heartbeat_pub.publish(String(data="node_name:alive"))
```

## Restart Behavior

When a node is detected as DOWN:

1. Health monitor logs a warning
2. If `enable_auto_restart: true`, queues a restart command
3. Node status changes to "RESTARTING"
4. Restart count is incremented
5. Face alert is published for critical nodes

The actual restart mechanism can be:
- Direct ROS2 launch subprocess
- Systemd service restart
- Custom restart script
- Manual restart via external monitor

## Debugging

Check health status:
```bash
ros2 topic echo /saltybot/system_health
```

Simulate a node heartbeat:
```bash
ros2 topic pub /saltybot/test_node/heartbeat std_msgs/String '{data: "test_node:alive"}'
```

View monitor logs:
```bash
ros2 launch saltybot_health_monitor health_monitor.launch.py | grep health
```