Files
nannyagent/README.md
harsha f69e1dbc66 add-bpf-capability (#1)
1) add-bpf-capability
2) Not so clean but for now it's okay to start with

Co-authored-by: Harshavardhan Musanalli <harshavmb@gmail.com>
Reviewed-on: #1
2025-10-22 08:16:40 +00:00

221 lines
7.3 KiB
Markdown

# Linux Diagnostic Agent
A Go-based AI agent that diagnoses Linux system issues using the NannyAPI gateway with OpenAI-compatible SDK.
## Features
- Interactive command-line interface for submitting system issues
- **Automatic system information gathering** - Includes OS, kernel, CPU, memory, network info
- **eBPF-powered deep system monitoring** - Advanced tracing for network, processes, files, and security events
- Integrates with NannyAPI using OpenAI-compatible Go SDK
- Executes diagnostic commands safely and collects output
- Provides step-by-step resolution plans
- **Comprehensive integration tests** with realistic Linux problem scenarios
## Setup
1. Clone this repository
2. Copy `.env.example` to `.env` and configure your NannyAPI endpoint:
```bash
cp .env.example .env
```
3. Install dependencies:
```bash
go mod tidy
```
4. Build and run:
```bash
make build
./nanny-agent
```
## Configuration
The agent can be configured using environment variables:
- `NANNYAPI_ENDPOINT`: The NannyAPI endpoint (default: `http://tensorzero.netcup.internal:3000/openai/v1`)
- `NANNYAPI_MODEL`: The model identifier (default: `nannyapi::function_name::diagnose_and_heal`)
## Installation on Linux VM
### Direct Installation
1. **Install Go** (if not already installed):
```bash
# For Ubuntu/Debian
sudo apt update
sudo apt install golang-go
# For RHEL/CentOS/Fedora
sudo dnf install golang
# or
sudo yum install golang
```
2. **Clone and build the agent**:
```bash
git clone <your-repo-url>
cd nannyagentv2
go mod tidy
make build
```
3. **Install as system service** (optional):
```bash
sudo cp nanny-agent /usr/local/bin/
sudo chmod +x /usr/local/bin/nanny-agent
```
4. **Set environment variables**:
```bash
export NANNYAPI_ENDPOINT="http://your-nannyapi-endpoint:3000/openai/v1"
export NANNYAPI_MODEL="your-model-identifier"
```
## Usage
1. Start the agent:
```bash
./nanny-agent
```
2. Enter a system issue description when prompted:
```
> On /var filesystem I cannot create any file but df -h shows 30% free space available.
```
3. The agent will:
- Send the issue to the AI via NannyAPI using OpenAI SDK
- Execute diagnostic commands as suggested by the AI
- Provide command outputs back to the AI
- Display the final diagnosis and resolution plan
4. Type `quit` or `exit` to stop the agent
## How It Works
1. **User Input**: Submit a description of the system issue you're experiencing
2. **System Info Gathering**: Agent automatically collects comprehensive system information and eBPF capabilities
3. **AI Analysis**: Sends the issue description + system info to NannyAPI for analysis
4. **Diagnostic Phase**: AI returns structured commands and eBPF monitoring requests for investigation
5. **Command Execution**: Agent safely executes diagnostic commands and runs eBPF traces in parallel
6. **eBPF Monitoring**: Real-time system tracing (network, processes, files, syscalls) provides deep insights
7. **Iterative Analysis**: Command results and eBPF trace data are sent back to AI for further analysis
8. **Resolution**: AI provides root cause analysis and step-by-step resolution plan based on comprehensive data
## Testing & Integration Tests
The agent includes comprehensive integration tests that simulate realistic Linux problems:
### Available Test Scenarios:
1. **Disk Space Issues** - Inode exhaustion scenarios
2. **Memory Problems** - OOM killer and memory pressure
3. **Network Issues** - DNS resolution problems
4. **Performance Issues** - High load averages and I/O bottlenecks
5. **Web Server Problems** - Permission and configuration issues
6. **Hardware/Boot Issues** - Kernel module and device problems
7. **Database Performance** - Slow queries and I/O contention
8. **Service Failures** - Startup and configuration problems
### Run Integration Tests:
```bash
# Interactive test scenarios
./test-examples.sh
# Automated integration tests
./integration-tests.sh
# Function discovery (find valid NannyAPI functions)
./discover-functions.sh
```
## Safety
## eBPF Monitoring Capabilities
The agent includes advanced eBPF (Extended Berkeley Packet Filter) monitoring for deep system investigation:
- **System Call Tracing**: Monitor process behavior through syscall analysis
- **Network Activity**: Track network connections, data flow, and protocol usage
- **Process Monitoring**: Real-time process creation, execution, and lifecycle tracking
- **File System Events**: Monitor file access, creation, deletion, and permission changes
- **Performance Analysis**: CPU, memory, and I/O performance profiling
- **Security Events**: Detect privilege escalation and suspicious activities
The AI automatically requests appropriate eBPF monitoring based on the issue type, providing unprecedented visibility into system behavior during problem diagnosis.
For detailed eBPF documentation, see [EBPF_README.md](EBPF_README.md).
## Safety
- All commands are validated before execution to prevent dangerous operations
- Read-only diagnostic commands are prioritized
- No commands that modify system state (rm, mv, etc.) are executed
- Commands have timeouts to prevent hanging
- Secure execution environment with proper error handling
- eBPF monitoring is read-only and time-limited for safety
## API Integration
The agent uses the `github.com/sashabaranov/go-openai` SDK to communicate with NannyAPI's OpenAI-compatible API endpoint. This provides:
- Robust HTTP client with retries and timeouts
- Structured request/response handling
- Automatic JSON marshaling/unmarshaling
- Error handling and validation
## Example Session
```
Linux Diagnostic Agent Started
Enter a system issue description (or 'quit' to exit):
> Cannot create files in /var but df shows space available
Diagnosing issue: Cannot create files in /var but df shows space available
Gathering system information...
AI Response:
{
"response_type": "diagnostic",
"reasoning": "The 'No space left on device' error despite available disk space suggests inode exhaustion...",
"commands": [
{"id": "check_inodes", "command": "df -i /var", "description": "Check inode usage..."}
]
}
Executing command 'check_inodes': df -i /var
Output:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 1000000 999999 1 100% /var
=== DIAGNOSIS COMPLETE ===
Root Cause: The /var filesystem has exhausted all available inodes
Resolution Plan: 1. Find and remove unnecessary files...
Confidence: High
```
Note: The AI receives comprehensive system information including:
- Hostname, OS version, kernel version
- CPU cores, memory, system uptime
- Network interfaces and private IPs
- Current load average and disk usage
## Available Make Commands
- `make build` - Build the application
- `make run` - Build and run the application
- `make clean` - Clean build artifacts
- `make test` - Run unit tests
- `make install` - Install dependencies
- `make build-prod` - Build for production
- `make install-system` - Install system-wide (requires sudo)
- `make fmt` - Format code
- `make help` - Show available commands
## Testing Commands
- `./test-examples.sh` - Show interactive test scenarios
- `./integration-tests.sh` - Run automated integration tests
- `./discover-functions.sh` - Find available NannyAPI functions
- `./install.sh` - Installation script for Linux VMs