8 Commits

Author SHA256 Message Date
Harshavardhan Musanalli
d519bf77e9 working mode 2025-11-16 10:29:24 +01:00
Harshavardhan Musanalli
c268a3a42e Somewhat okay refactoring 2025-11-08 21:48:59 +01:00
Harshavardhan Musanalli
794111cb44 somewhat working ebpf bpftrace 2025-11-08 20:42:07 +01:00
Harshavardhan Musanalli
190e54dd38 Remove old eBPF implementations - keep only new BCC-style concurrent tracing 2025-11-08 14:56:56 +01:00
Harshavardhan Musanalli
8328f8d5b3 Integrate-with-supabase-backend 2025-10-28 07:53:14 +01:00
Harshavardhan Musanalli
8832450a1f Agent and websocket investigations work fine 2025-10-27 19:13:39 +01:00
Harshavardhan Musanalli
0a8b2dc202 Working code with Tensorzero through Supabase proxy 2025-10-25 15:16:03 +02:00
Harshavardhan Musanalli
6fd403cb5f Integrate with supabase backend 2025-10-25 12:39:48 +02:00
34 changed files with 6956 additions and 2501 deletions

6
.gitignore vendored
View File

@@ -23,6 +23,10 @@ go.work
go.work.sum
# env file
.env
.env*
nannyagent*
nanny-agent*
.vscode
# Build directory
build/

298
BCC_TRACING.md Normal file
View File

@@ -0,0 +1,298 @@
# BCC-Style eBPF Tracing Implementation
## Overview
This implementation adds powerful BCC-style (Berkeley Packet Filter Compiler) tracing capabilities to the diagnostic agent, similar to the `trace.py` tool from the iovisor BCC project. Instead of just filtering events, this system actually counts and traces real system calls with detailed argument parsing.
## Key Features
### 1. Real System Call Tracing
- **Actual event counting**: Unlike the previous implementation that just simulated events, this captures real system calls
- **Argument extraction**: Extracts function arguments (arg1, arg2, etc.) and return values
- **Multiple probe types**: Supports kprobes, kretprobes, tracepoints, and uprobes
- **Filtering capabilities**: Filter by process name, PID, UID, argument values
### 2. BCC-Style Syntax
Supports familiar BCC trace.py syntax patterns:
```bash
# Simple syscall tracing
"sys_open" # Trace open syscalls
"sys_read (arg3 > 1024)" # Trace reads >1024 bytes
"r::sys_open" # Return probe on open
# With format strings
"sys_write \"wrote %d bytes\", arg3"
"sys_open \"opening %s\", arg2@user"
```
### 3. Comprehensive Event Data
Each trace captures:
```json
{
"timestamp": 1234567890,
"pid": 1234,
"tid": 1234,
"process_name": "nginx",
"function": "__x64_sys_openat",
"message": "opening file: /var/log/access.log",
"raw_args": {
"arg1": "3",
"arg2": "/var/log/access.log",
"arg3": "577"
}
}
```
## Architecture
### Core Components
1. **BCCTraceManager** (`ebpf_trace_manager.go`)
- Main orchestrator for BCC-style tracing
- Generates bpftrace scripts dynamically
- Manages trace sessions and event collection
2. **TraceSpec** - Trace specification format
```go
type TraceSpec struct {
ProbeType string // "p", "r", "t", "u"
Target string // Function/syscall to trace
Format string // Output format string
Arguments []string // Arguments to extract
Filter string // Filter conditions
Duration int // Trace duration in seconds
ProcessName string // Process filter
PID int // Process ID filter
UID int // User ID filter
}
```
3. **EventScanner** (`ebpf_event_parser.go`)
- Parses bpftrace output in real-time
- Converts raw trace data to structured events
- Handles argument extraction and enrichment
4. **TraceSpecBuilder** - Fluent API for building specs
```go
spec := NewTraceSpecBuilder().
Kprobe("__x64_sys_write").
Format("write %d bytes to fd %d", "arg3", "arg1").
Filter("arg1 == 1").
Duration(30).
Build()
```
## Usage Examples
### 1. Basic System Call Tracing
```go
// Trace file open operations
spec := TraceSpec{
ProbeType: "p",
Target: "__x64_sys_openat",
Format: "opening file: %s",
Arguments: []string{"arg2@user"},
Duration: 30,
}
traceID, err := manager.StartTrace(spec)
```
### 2. Filtered Tracing
```go
// Trace only large reads
spec := TraceSpec{
ProbeType: "p",
Target: "__x64_sys_read",
Format: "read %d bytes from fd %d",
Arguments: []string{"arg3", "arg1"},
Filter: "arg3 > 1024",
Duration: 30,
}
```
### 3. Process-Specific Tracing
```go
// Trace only nginx processes
spec := TraceSpec{
ProbeType: "p",
Target: "__x64_sys_write",
ProcessName: "nginx",
Duration: 60,
}
```
### 4. Return Value Tracing
```go
// Trace return values from file operations
spec := TraceSpec{
ProbeType: "r",
Target: "__x64_sys_openat",
Format: "open returned: %d",
Arguments: []string{"retval"},
Duration: 30,
}
```
## Integration with Agent
### API Request Format
The remote API can send trace specifications in the `ebpf_programs` field:
```json
{
"commands": [
{"id": "cmd1", "command": "ps aux"}
],
"ebpf_programs": [
{
"name": "file_monitoring",
"type": "kprobe",
"target": "sys_open",
"duration": 30,
"filters": {"process": "nginx"},
"description": "Monitor file access by nginx"
}
]
}
```
### Agent Response Format
The agent returns detailed trace results:
```json
{
"name": "__x64_sys_openat",
"type": "bcc_trace",
"target": "__x64_sys_openat",
"duration": 30,
"status": "completed",
"success": true,
"event_count": 45,
"events": [
{
"timestamp": 1234567890,
"pid": 1234,
"process_name": "nginx",
"function": "__x64_sys_openat",
"message": "opening file: /var/log/access.log",
"raw_args": {"arg1": "3", "arg2": "/var/log/access.log"}
}
],
"statistics": {
"total_events": 45,
"events_per_second": 1.5,
"top_processes": [
{"process_name": "nginx", "event_count": 30},
{"process_name": "apache", "event_count": 15}
]
}
}
```
## Test Specifications
The implementation includes test specifications for unit testing:
- **test_sys_open**: File open operations
- **test_sys_read**: Read operations with filters
- **test_sys_write**: Write operations
- **test_process_creation**: Process execution
- **test_kretprobe**: Return value tracing
- **test_with_filter**: Filtered tracing
## Running Tests
```bash
# Run all BCC tracing tests
go test -v -run TestBCCTracing
# Test trace manager capabilities
go test -v -run TestTraceManagerCapabilities
# Test syscall suggestions
go test -v -run TestSyscallSuggestions
# Run all tests
go test -v
```
## Requirements
### System Requirements
- **Linux kernel 4.4+** with eBPF support
- **bpftrace** installed (`apt install bpftrace`)
- **Root privileges** for actual tracing
### Checking Capabilities
The trace manager automatically detects capabilities:
```bash
$ go test -run TestTraceManagerCapabilities
🔧 Trace Manager Capabilities:
✅ kernel_ebpf: Available
✅ bpftrace: Available
❌ root_access: Not Available
❌ debugfs_access: Not Available
```
## Advanced Features
### 1. Syscall Suggestions
The system can suggest appropriate syscalls based on issue descriptions:
```go
suggestions := SuggestSyscallTargets("file not found error")
// Returns: ["test_sys_open", "test_sys_read", "test_sys_write", "test_sys_unlink"]
```
### 2. BCC-Style Parsing
Parse BCC trace.py style specifications:
```go
parser := NewTraceSpecParser()
spec, err := parser.ParseFromBCCStyle("sys_write (arg1 == 1) \"stdout: %d bytes\", arg3")
```
### 3. Event Filtering and Aggregation
Post-processing capabilities for trace events:
```go
filter := &TraceEventFilter{
ProcessNames: []string{"nginx", "apache"},
MinTimestamp: startTime,
}
filteredEvents := filter.ApplyFilter(events)
aggregator := NewTraceEventAggregator(events)
topProcesses := aggregator.GetTopProcesses(5)
eventRate := aggregator.GetEventRate()
```
## Performance Considerations
- **Short durations**: Test specs use 5-second durations for quick testing
- **Efficient parsing**: Event scanner processes bpftrace output in real-time
- **Memory management**: Events are processed and aggregated efficiently
- **Timeout handling**: Automatic cleanup of hanging trace sessions
## Security Considerations
- **Root privileges required**: eBPF tracing requires root access
- **Resource limits**: Maximum trace duration of 10 minutes
- **Process isolation**: Each trace runs in its own context
- **Automatic cleanup**: Traces are automatically stopped and cleaned up
## Future Enhancements
1. **USDT probe support**: Add support for user-space tracing
2. **BTF integration**: Use BPF Type Format for better type information
3. **Flame graph generation**: Generate performance flame graphs
4. **Custom eBPF programs**: Allow uploading custom eBPF bytecode
5. **Distributed tracing**: Correlation across multiple hosts
This implementation provides a solid foundation for advanced system introspection and debugging, bringing the power of BCC-style tracing to the diagnostic agent.

View File

View File

@@ -1,16 +1,21 @@
.PHONY: build run clean test install
.PHONY: build run clean test install build-prod build-release install-system fmt lint help
VERSION := 0.0.1
BUILD_DIR := ./build
BINARY_NAME := nannyagent
# Build the application
build:
go build -o nanny-agent .
go build -o $(BINARY_NAME) .
# Run the application
run: build
./nanny-agent
./$(BINARY_NAME)
# Clean build artifacts
clean:
rm -f nanny-agent
rm -f $(BINARY_NAME)
rm -rf $(BUILD_DIR)
# Run tests
test:
@@ -21,14 +26,34 @@ install:
go mod tidy
go mod download
# Build for production with optimizations
# Build for production with optimizations (current architecture)
build-prod:
CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags '-w -s' -o nanny-agent .
CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo \
-ldflags '-w -s -X main.Version=$(VERSION)' \
-o $(BINARY_NAME) .
# Build release binaries for both architectures
build-release: clean
@echo "Building release binaries for version $(VERSION)..."
@mkdir -p $(BUILD_DIR)
@echo "Building for linux/amd64..."
@CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -a -installsuffix cgo \
-ldflags '-w -s -X main.Version=$(VERSION)' \
-o $(BUILD_DIR)/$(BINARY_NAME)-linux-amd64 .
@echo "Building for linux/arm64..."
@CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -a -installsuffix cgo \
-ldflags '-w -s -X main.Version=$(VERSION)' \
-o $(BUILD_DIR)/$(BINARY_NAME)-linux-arm64 .
@echo "Generating checksums..."
@cd $(BUILD_DIR) && sha256sum $(BINARY_NAME)-linux-amd64 > $(BINARY_NAME)-linux-amd64.sha256
@cd $(BUILD_DIR) && sha256sum $(BINARY_NAME)-linux-arm64 > $(BINARY_NAME)-linux-arm64.sha256
@echo "Build complete! Artifacts in $(BUILD_DIR)/"
@ls -lh $(BUILD_DIR)/
# Install system-wide (requires sudo)
install-system: build-prod
sudo cp nanny-agent /usr/local/bin/
sudo chmod +x /usr/local/bin/nanny-agent
sudo cp $(BINARY_NAME) /usr/local/bin/
sudo chmod +x /usr/local/bin/$(BINARY_NAME)
# Format code
fmt:
@@ -40,14 +65,18 @@ lint:
# Show help
help:
@echo "Available commands:"
@echo " build - Build the application"
@echo " run - Build and run the application"
@echo " clean - Clean build artifacts"
@echo " test - Run tests"
@echo " install - Install dependencies"
@echo " build-prod - Build for production"
@echo " install-system- Install system-wide (requires sudo)"
@echo " fmt - Format code"
@echo " lint - Run linter"
@echo " help - Show this help"
@echo "NannyAgent Makefile - Available commands:"
@echo ""
@echo " make build - Build the application for current platform"
@echo " make run - Build and run the application"
@echo " make clean - Clean build artifacts"
@echo " make test - Run tests"
@echo " make install - Install Go dependencies"
@echo " make build-prod - Build for production (optimized, current arch)"
@echo " make build-release - Build release binaries for amd64 and arm64"
@echo " make install-system - Install system-wide (requires sudo)"
@echo " make fmt - Format code"
@echo " make lint - Run linter"
@echo " make help - Show this help"
@echo ""
@echo "Version: $(VERSION)"

246
README.md
View File

@@ -1,96 +1,135 @@
# Linux Diagnostic Agent
# NannyAgent - Linux Diagnostic Agent
A Go-based AI agent that diagnoses Linux system issues using the NannyAPI gateway with OpenAI-compatible SDK.
A Go-based AI agent that diagnoses Linux system issues using eBPF-powered deep monitoring and TensorZero AI integration.
## Features
- Interactive command-line interface for submitting system issues
- **Automatic system information gathering** - Includes OS, kernel, CPU, memory, network info
- **eBPF-powered deep system monitoring** - Advanced tracing for network, processes, files, and security events
- Integrates with NannyAPI using OpenAI-compatible Go SDK
- Executes diagnostic commands safely and collects output
- Provides step-by-step resolution plans
- **Comprehensive integration tests** with realistic Linux problem scenarios
- 🤖 **AI-Powered Diagnostics** - Intelligent issue analysis and resolution planning
- 🔍 **eBPF Deep Monitoring** - Real-time kernel-level tracing for network, processes, files, and security events
- 🛡️ **Safe Command Execution** - Validates and executes diagnostic commands with timeouts
- 📊 **Automatic System Information Gathering** - Comprehensive OS, kernel, CPU, memory, and network metrics
- 🔄 **WebSocket Integration** - Real-time communication with backend investigation system
- 🔐 **OAuth Device Flow Authentication** - Secure agent registration and authentication
- **Comprehensive Integration Tests** - Realistic Linux problem scenarios
## Setup
## Requirements
1. Clone this repository
2. Copy `.env.example` to `.env` and configure your NannyAPI endpoint:
- **Operating System**: Linux only (no containers/LXC support)
- **Architecture**: amd64 (x86_64) or arm64 (aarch64)
- **Kernel Version**: Linux kernel 5.x or higher
- **Privileges**: Root/sudo access required for eBPF functionality
- **Dependencies**: bpftrace and bpfcc-tools (automatically installed by installer)
- **Network**: Connectivity to Supabase backend
## Quick Installation
### One-Line Install (Recommended)
```bash
# Download and run the installer
curl -fsSL https://your-domain.com/install.sh | sudo bash
```
Or download first, then install:
```bash
# Download the installer
wget https://your-domain.com/install.sh
# Make it executable
chmod +x install.sh
# Run the installer
sudo ./install.sh
```
### Manual Installation
1. Clone this repository:
```bash
cp .env.example .env
git clone https://github.com/yourusername/nannyagent.git
cd nannyagent
```
3. Install dependencies:
2. Run the installer script:
```bash
go mod tidy
```
4. Build and run:
```bash
make build
./nanny-agent
sudo ./install.sh
```
The installer will:
- ✅ Verify system requirements (OS, architecture, kernel version)
- ✅ Check for existing installations
- ✅ Install eBPF tools (bpftrace, bpfcc-tools)
- ✅ Build the nannyagent binary
- ✅ Test connectivity to Supabase
- ✅ Install to `/usr/local/bin/nannyagent`
- ✅ Create configuration in `/etc/nannyagent/config.env`
- ✅ Create secure data directory `/var/lib/nannyagent`
## Configuration
The agent can be configured using environment variables:
After installation, configure your Supabase URL:
- `NANNYAPI_ENDPOINT`: The NannyAPI endpoint (default: `http://tensorzero.netcup.internal:3000/openai/v1`)
- `NANNYAPI_MODEL`: The model identifier (default: `nannyapi::function_name::diagnose_and_heal`)
```bash
# Edit the configuration file
sudo nano /etc/nannyagent/config.env
```
## Installation on Linux VM
Required configuration:
### Direct Installation
```bash
# Supabase Configuration
SUPABASE_PROJECT_URL=https://your-project.supabase.co
1. **Install Go** (if not already installed):
```bash
# For Ubuntu/Debian
sudo apt update
sudo apt install golang-go
# Optional Configuration
TOKEN_PATH=/var/lib/nannyagent/token.json
DEBUG=false
```
# For RHEL/CentOS/Fedora
sudo dnf install golang
# or
sudo yum install golang
```
## Command-Line Options
2. **Clone and build the agent**:
```bash
git clone <your-repo-url>
cd nannyagentv2
go mod tidy
make build
```
```bash
# Show version (no sudo required)
nannyagent --version
nannyagent -v
3. **Install as system service** (optional):
```bash
sudo cp nanny-agent /usr/local/bin/
sudo chmod +x /usr/local/bin/nanny-agent
```
# Show help (no sudo required)
nannyagent --help
nannyagent -h
4. **Set environment variables**:
```bash
export NANNYAPI_ENDPOINT="http://your-nannyapi-endpoint:3000/openai/v1"
export NANNYAPI_MODEL="your-model-identifier"
```
# Run the agent (requires sudo)
sudo nannyagent
```
## Usage
1. Start the agent:
1. **First-time Setup** - Authenticate the agent:
```bash
./nanny-agent
sudo nannyagent
```
2. Enter a system issue description when prompted:
The agent will display a verification URL and code. Visit the URL and enter the code to authorize the agent.
2. **Interactive Diagnostics** - After authentication, enter system issues:
```
> On /var filesystem I cannot create any file but df -h shows 30% free space available.
```
3. The agent will:
- Send the issue to the AI via NannyAPI using OpenAI SDK
- Execute diagnostic commands as suggested by the AI
- Provide command outputs back to the AI
- Display the final diagnosis and resolution plan
3. **The agent will**:
- Gather comprehensive system information automatically
- Send the issue to AI for analysis via TensorZero
- Execute diagnostic commands safely
- Run eBPF traces for deep kernel-level monitoring
- Provide AI-generated root cause analysis and resolution plan
4. Type `quit` or `exit` to stop the agent
4. **Exit the agent**:
```
> quit
```
or
```
> exit
```
## How It Works
@@ -119,14 +158,87 @@ The agent includes comprehensive integration tests that simulate realistic Linux
### Run Integration Tests:
```bash
# Interactive test scenarios
./test-examples.sh
# Run unit tests
make test
# Automated integration tests
./integration-tests.sh
# Run integration tests
./tests/test_ebpf_integration.sh
```
# Function discovery (find valid NannyAPI functions)
./discover-functions.sh
## Installation Exit Codes
The installer uses specific exit codes for different failure scenarios:
| Exit Code | Description |
|-----------|-------------|
| 0 | Success |
| 1 | Not running as root |
| 2 | Unsupported operating system (non-Linux) |
| 3 | Unsupported architecture (not amd64/arm64) |
| 4 | Container/LXC environment detected |
| 5 | Kernel version < 5.x |
| 6 | Existing installation detected |
| 7 | eBPF tools installation failed |
| 8 | Go not installed |
| 9 | Binary build failed |
| 10 | Directory creation failed |
| 11 | Binary installation failed |
## Troubleshooting
### Installation Issues
**Error: "Kernel version X.X is not supported"**
- NannyAgent requires Linux kernel 5.x or higher
- Upgrade your kernel or use a different system
**Error: "Another instance may already be installed"**
- Check if `/var/lib/nannyagent` exists
- Remove it if you're sure: `sudo rm -rf /var/lib/nannyagent`
- Then retry installation
**Warning: "Cannot connect to Supabase"**
- Check your network connectivity
- Verify firewall settings allow HTTPS connections
- Ensure SUPABASE_PROJECT_URL is correctly configured in `/etc/nannyagent/config.env`
### Runtime Issues
**Error: "This program must be run as root"**
- eBPF requires root privileges
- Always run with: `sudo nannyagent`
**Error: "Cannot determine kernel version"**
- Ensure `uname` command is available
- Check system integrity
## Development
### Building from Source
```bash
# Clone repository
git clone https://github.com/yourusername/nannyagent.git
cd nannyagent
# Install Go dependencies
go mod tidy
# Build binary
make build
# Run locally (requires sudo)
sudo ./nannyagent
```
### Running Tests
```bash
# Run unit tests
make test
# Test eBPF capabilities
./tests/test_ebpf_integration.sh
```
## Safety

480
agent.go
View File

@@ -2,99 +2,113 @@ package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"strings"
"time"
"nannyagentv2/internal/ebpf"
"nannyagentv2/internal/executor"
"nannyagentv2/internal/logging"
"nannyagentv2/internal/system"
"nannyagentv2/internal/types"
"github.com/sashabaranov/go-openai"
)
// DiagnosticResponse represents the diagnostic phase response from AI
type DiagnosticResponse struct {
ResponseType string `json:"response_type"`
Reasoning string `json:"reasoning"`
Commands []Command `json:"commands"`
// AgentConfig holds configuration for concurrent execution (local to agent)
type AgentConfig struct {
MaxConcurrentTasks int `json:"max_concurrent_tasks"`
CollectiveResults bool `json:"collective_results"`
}
// ResolutionResponse represents the resolution phase response from AI
type ResolutionResponse struct {
ResponseType string `json:"response_type"`
RootCause string `json:"root_cause"`
ResolutionPlan string `json:"resolution_plan"`
Confidence string `json:"confidence"`
// DefaultAgentConfig returns default configuration
func DefaultAgentConfig() *AgentConfig {
return &AgentConfig{
MaxConcurrentTasks: 10, // Default to 10 concurrent forks
CollectiveResults: true, // Send results collectively when all finish
}
}
// Command represents a command to be executed
type Command struct {
ID string `json:"id"`
Command string `json:"command"`
Description string `json:"description"`
}
//
// LinuxDiagnosticAgent represents the main diagnostic agent
// CommandResult represents the result of executing a command
type CommandResult struct {
ID string `json:"id"`
Command string `json:"command"`
Output string `json:"output"`
ExitCode int `json:"exit_code"`
Error string `json:"error,omitempty"`
}
// LinuxDiagnosticAgent represents the main agent
// LinuxDiagnosticAgent represents the main diagnostic agent
type LinuxDiagnosticAgent struct {
client *openai.Client
model string
executor *CommandExecutor
episodeID string // TensorZero episode ID for conversation continuity
ebpfManager EBPFManagerInterface // eBPF monitoring capabilities
executor *executor.CommandExecutor
episodeID string // TensorZero episode ID for conversation continuity
ebpfManager *ebpf.BCCTraceManager // eBPF tracing manager
config *AgentConfig // Configuration for concurrent execution
authManager interface{} // Authentication manager for TensorZero requests
logger *logging.Logger
}
// NewLinuxDiagnosticAgent creates a new diagnostic agent
func NewLinuxDiagnosticAgent() *LinuxDiagnosticAgent {
endpoint := os.Getenv("NANNYAPI_ENDPOINT")
if endpoint == "" {
// Default endpoint - OpenAI SDK will append /chat/completions automatically
endpoint = "http://tensorzero.netcup.internal:3000/openai/v1"
// Get Supabase project URL for TensorZero proxy
supabaseURL := os.Getenv("SUPABASE_PROJECT_URL")
if supabaseURL == "" {
logging.Warning("SUPABASE_PROJECT_URL not set, TensorZero integration will not work")
}
model := os.Getenv("NANNYAPI_MODEL")
if model == "" {
model = "tensorzero::function_name::diagnose_and_heal"
fmt.Printf("Warning: Using default model '%s'. Set NANNYAPI_MODEL environment variable for your specific function.\n", model)
}
// Create OpenAI client with custom base URL
// Note: The OpenAI SDK automatically appends "/chat/completions" to the base URL
config := openai.DefaultConfig("")
config.BaseURL = endpoint
client := openai.NewClientWithConfig(config)
// Default model for diagnostic and healing
model := "tensorzero::function_name::diagnose_and_heal"
agent := &LinuxDiagnosticAgent{
client: client,
client: nil, // Not used - we use direct HTTP to Supabase proxy
model: model,
executor: NewCommandExecutor(10 * time.Second), // 10 second timeout for commands
executor: executor.NewCommandExecutor(10 * time.Second), // 10 second timeout for commands
config: DefaultAgentConfig(), // Default concurrent execution config
}
// Initialize eBPF capabilities
agent.ebpfManager = NewCiliumEBPFManager()
// Initialize eBPF manager
agent.ebpfManager = ebpf.NewBCCTraceManager()
agent.logger = logging.NewLogger()
return agent
}
// NewLinuxDiagnosticAgentWithAuth creates a new diagnostic agent with authentication
func NewLinuxDiagnosticAgentWithAuth(authManager interface{}) *LinuxDiagnosticAgent {
// Get Supabase project URL for TensorZero proxy
supabaseURL := os.Getenv("SUPABASE_PROJECT_URL")
if supabaseURL == "" {
logging.Warning("SUPABASE_PROJECT_URL not set, TensorZero integration will not work")
}
// Default model for diagnostic and healing
model := "tensorzero::function_name::diagnose_and_heal"
agent := &LinuxDiagnosticAgent{
client: nil, // Not used - we use direct HTTP to Supabase proxy
model: model,
executor: executor.NewCommandExecutor(10 * time.Second), // 10 second timeout for commands
config: DefaultAgentConfig(), // Default concurrent execution config
authManager: authManager, // Store auth manager for TensorZero requests
}
// Initialize eBPF manager
agent.ebpfManager = ebpf.NewBCCTraceManager()
agent.logger = logging.NewLogger()
return agent
}
// DiagnoseIssue starts the diagnostic process for a given issue
func (a *LinuxDiagnosticAgent) DiagnoseIssue(issue string) error {
fmt.Printf("Diagnosing issue: %s\n", issue)
fmt.Println("Gathering system information...")
logging.Info("Diagnosing issue: %s", issue)
logging.Info("Gathering system information...")
// Gather system information
systemInfo := GatherSystemInfo()
systemInfo := system.GatherSystemInfo()
// Format the initial prompt with system information
initialPrompt := FormatSystemInfoForPrompt(systemInfo) + "\n" + issue
initialPrompt := system.FormatSystemInfoForPrompt(systemInfo) + "\n" + issue
// Start conversation with initial issue including system info
messages := []openai.ChatCompletionMessage{
@@ -106,7 +120,7 @@ func (a *LinuxDiagnosticAgent) DiagnoseIssue(issue string) error {
for {
// Send request to TensorZero API via OpenAI SDK
response, err := a.sendRequest(messages)
response, err := a.SendRequestWithEpisode(messages, a.episodeID)
if err != nil {
return fmt.Errorf("failed to send request: %w", err)
}
@@ -116,37 +130,80 @@ func (a *LinuxDiagnosticAgent) DiagnoseIssue(issue string) error {
}
content := response.Choices[0].Message.Content
fmt.Printf("\nAI Response:\n%s\n", content)
logging.Debug("AI Response: %s", content)
// Parse the response to determine next action
var diagnosticResp DiagnosticResponse
var resolutionResp ResolutionResponse
var diagnosticResp types.EBPFEnhancedDiagnosticResponse
var resolutionResp types.ResolutionResponse
// Try to parse as diagnostic response first
// Try to parse as diagnostic response first (with eBPF support)
logging.Debug("Attempting to parse response as diagnostic...")
if err := json.Unmarshal([]byte(content), &diagnosticResp); err == nil && diagnosticResp.ResponseType == "diagnostic" {
logging.Debug("Successfully parsed as diagnostic response with %d commands", len(diagnosticResp.Commands))
// Handle diagnostic phase
fmt.Printf("\nReasoning: %s\n", diagnosticResp.Reasoning)
if len(diagnosticResp.Commands) == 0 {
fmt.Println("No commands to execute in diagnostic phase")
break
}
logging.Debug("Reasoning: %s", diagnosticResp.Reasoning)
// Execute commands and collect results
commandResults := make([]CommandResult, 0, len(diagnosticResp.Commands))
for _, cmd := range diagnosticResp.Commands {
fmt.Printf("\nExecuting command '%s': %s\n", cmd.ID, cmd.Command)
result := a.executor.Execute(cmd)
commandResults = append(commandResults, result)
commandResults := make([]types.CommandResult, 0, len(diagnosticResp.Commands))
if len(diagnosticResp.Commands) > 0 {
logging.Info("Executing %d diagnostic commands", len(diagnosticResp.Commands))
for i, cmdStr := range diagnosticResp.Commands {
// Convert string command to Command struct (auto-generate ID and description)
cmd := types.Command{
ID: fmt.Sprintf("cmd_%d", i+1),
Command: cmdStr,
Description: fmt.Sprintf("Diagnostic command: %s", cmdStr),
}
result := a.executor.Execute(cmd)
commandResults = append(commandResults, result)
fmt.Printf("Output:\n%s\n", result.Output)
if result.Error != "" {
fmt.Printf("Error: %s\n", result.Error)
if result.ExitCode != 0 {
logging.Warning("Command '%s' failed with exit code %d", cmd.ID, result.ExitCode)
}
}
}
// Prepare command results as user message
resultsJSON, err := json.MarshalIndent(commandResults, "", " ")
// Execute eBPF programs if present - support both old and new formats
var ebpfResults []map[string]interface{}
if len(diagnosticResp.EBPFPrograms) > 0 {
logging.Info("AI requested %d eBPF traces for enhanced diagnostics", len(diagnosticResp.EBPFPrograms))
// Convert EBPFPrograms to TraceSpecs and execute concurrently using the eBPF service
traceSpecs := a.ConvertEBPFProgramsToTraceSpecs(diagnosticResp.EBPFPrograms)
ebpfResults = a.ExecuteEBPFTraces(traceSpecs)
}
// Prepare combined results as user message
allResults := map[string]interface{}{
"command_results": commandResults,
"executed_commands": len(commandResults),
}
// Include eBPF results if any were executed
if len(ebpfResults) > 0 {
allResults["ebpf_results"] = ebpfResults
allResults["executed_ebpf_programs"] = len(ebpfResults)
// Extract evidence summary for TensorZero
evidenceSummary := make([]string, 0)
for _, result := range ebpfResults {
target := result["target"]
eventCount := result["event_count"]
summary := result["summary"]
success := result["success"]
status := "failed"
if success == true {
status = "success"
}
summaryStr := fmt.Sprintf("%s: %v events (%s) - %s", target, eventCount, status, summary)
evidenceSummary = append(evidenceSummary, summaryStr)
}
allResults["ebpf_evidence_summary"] = evidenceSummary
}
resultsJSON, err := json.MarshalIndent(allResults, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal command results: %w", err)
}
@@ -162,87 +219,97 @@ func (a *LinuxDiagnosticAgent) DiagnoseIssue(issue string) error {
})
continue
} else {
logging.Debug("Failed to parse as diagnostic. Error: %v, ResponseType: '%s'", err, diagnosticResp.ResponseType)
}
// Try to parse as resolution response
if err := json.Unmarshal([]byte(content), &resolutionResp); err == nil && resolutionResp.ResponseType == "resolution" {
// Handle resolution phase
fmt.Printf("\n=== DIAGNOSIS COMPLETE ===\n")
fmt.Printf("Root Cause: %s\n", resolutionResp.RootCause)
fmt.Printf("Resolution Plan: %s\n", resolutionResp.ResolutionPlan)
fmt.Printf("Confidence: %s\n", resolutionResp.Confidence)
logging.Info("=== DIAGNOSIS COMPLETE ===")
logging.Info("Root Cause: %s", resolutionResp.RootCause)
logging.Info("Resolution Plan: %s", resolutionResp.ResolutionPlan)
logging.Info("Confidence: %s", resolutionResp.Confidence)
break
}
// If we can't parse the response, treat it as an error or unexpected format
fmt.Printf("Unexpected response format or error from AI:\n%s\n", content)
logging.Error("Unexpected response format or error from AI: %s", content)
break
}
return nil
}
// TensorZeroRequest represents a request structure compatible with TensorZero's episode_id
type TensorZeroRequest struct {
Model string `json:"model"`
Messages []openai.ChatCompletionMessage `json:"messages"`
EpisodeID string `json:"tensorzero::episode_id,omitempty"`
// sendRequest sends a request to TensorZero via Supabase proxy (without episode ID)
func (a *LinuxDiagnosticAgent) SendRequest(messages []openai.ChatCompletionMessage) (*openai.ChatCompletionResponse, error) {
return a.SendRequestWithEpisode(messages, "")
}
// TensorZeroResponse represents TensorZero's response with episode_id
type TensorZeroResponse struct {
openai.ChatCompletionResponse
EpisodeID string `json:"episode_id"`
// ExecuteCommand executes a command using the agent's executor
func (a *LinuxDiagnosticAgent) ExecuteCommand(cmd types.Command) types.CommandResult {
return a.executor.Execute(cmd)
}
// sendRequest sends a request to the TensorZero API with tensorzero::episode_id support
func (a *LinuxDiagnosticAgent) sendRequest(messages []openai.ChatCompletionMessage) (*openai.ChatCompletionResponse, error) {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Create TensorZero-compatible request
tzRequest := TensorZeroRequest{
Model: a.model,
Messages: messages,
// sendRequestWithEpisode sends a request to TensorZero via Supabase proxy with episode ID for conversation continuity
func (a *LinuxDiagnosticAgent) SendRequestWithEpisode(messages []openai.ChatCompletionMessage, episodeID string) (*openai.ChatCompletionResponse, error) {
// Convert messages to the expected format
messageMaps := make([]map[string]interface{}, len(messages))
for i, msg := range messages {
messageMaps[i] = map[string]interface{}{
"role": msg.Role,
"content": msg.Content,
}
}
// Include tensorzero::episode_id for conversation continuity (if we have one)
if a.episodeID != "" {
tzRequest.EpisodeID = a.episodeID
// Create TensorZero request
tzRequest := map[string]interface{}{
"model": a.model,
"messages": messageMaps,
}
fmt.Printf("Debug: Sending request to model: %s", a.model)
if a.episodeID != "" {
fmt.Printf(" (episode: %s)", a.episodeID)
// Add episode ID if provided
if episodeID != "" {
tzRequest["tensorzero::episode_id"] = episodeID
}
fmt.Println()
// Marshal the request
// Marshal request
requestBody, err := json.Marshal(tzRequest)
if err != nil {
return nil, fmt.Errorf("failed to marshal request: %w", err)
}
// Create HTTP request
endpoint := os.Getenv("NANNYAPI_ENDPOINT")
if endpoint == "" {
endpoint = "http://tensorzero.netcup.internal:3000/openai/v1"
// Get Supabase URL
supabaseURL := os.Getenv("SUPABASE_PROJECT_URL")
if supabaseURL == "" {
return nil, fmt.Errorf("SUPABASE_PROJECT_URL not set")
}
// Ensure the endpoint ends with /chat/completions
if endpoint[len(endpoint)-1] != '/' {
endpoint += "/"
}
endpoint += "chat/completions"
req, err := http.NewRequestWithContext(ctx, "POST", endpoint, bytes.NewBuffer(requestBody))
// Create HTTP request to TensorZero proxy (includes OpenAI-compatible path)
endpoint := fmt.Sprintf("%s/functions/v1/tensorzero-proxy/openai/v1/chat/completions", supabaseURL)
logging.Debug("Calling TensorZero proxy at: %s", endpoint)
req, err := http.NewRequest("POST", endpoint, bytes.NewBuffer(requestBody))
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
// Set headers
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Accept", "application/json")
// Make the request
// Add authentication if auth manager is available (same pattern as investigation_server.go)
if a.authManager != nil {
// The authManager should be *auth.AuthManager, so let's use the exact same pattern
if authMgr, ok := a.authManager.(interface {
LoadToken() (*types.AuthToken, error)
}); ok {
if authToken, err := authMgr.LoadToken(); err == nil && authToken != nil {
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", authToken.AccessToken))
}
}
}
// Send request
client := &http.Client{Timeout: 30 * time.Second}
resp, err := client.Do(req)
if err != nil {
@@ -250,27 +317,174 @@ func (a *LinuxDiagnosticAgent) sendRequest(messages []openai.ChatCompletionMessa
}
defer resp.Body.Close()
// Read response body
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read response: %w", err)
// Check status code
if resp.StatusCode != 200 {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("TensorZero proxy error: %d, body: %s", resp.StatusCode, string(body))
}
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("API request failed with status %d: %s", resp.StatusCode, string(body))
// Parse response
var tzResponse map[string]interface{}
if err := json.NewDecoder(resp.Body).Decode(&tzResponse); err != nil {
return nil, fmt.Errorf("failed to decode response: %w", err)
}
// Parse TensorZero response
var tzResponse TensorZeroResponse
if err := json.Unmarshal(body, &tzResponse); err != nil {
return nil, fmt.Errorf("failed to unmarshal response: %w", err)
// Convert to OpenAI format for compatibility
choices, ok := tzResponse["choices"].([]interface{})
if !ok || len(choices) == 0 {
return nil, fmt.Errorf("no choices in response")
}
// Extract episode_id from first response
if a.episodeID == "" && tzResponse.EpisodeID != "" {
a.episodeID = tzResponse.EpisodeID
fmt.Printf("Debug: Extracted episode ID: %s\n", a.episodeID)
// Extract the first choice
firstChoice, ok := choices[0].(map[string]interface{})
if !ok {
return nil, fmt.Errorf("invalid choice format")
}
return &tzResponse.ChatCompletionResponse, nil
message, ok := firstChoice["message"].(map[string]interface{})
if !ok {
return nil, fmt.Errorf("invalid message format")
}
content, ok := message["content"].(string)
if !ok {
return nil, fmt.Errorf("invalid content format")
}
// Create OpenAI-compatible response
response := &openai.ChatCompletionResponse{
Choices: []openai.ChatCompletionChoice{
{
Message: openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleAssistant,
Content: content,
},
},
},
}
// Update episode ID if provided in response
if respEpisodeID, ok := tzResponse["episode_id"].(string); ok && respEpisodeID != "" {
a.episodeID = respEpisodeID
}
return response, nil
}
// ConvertEBPFProgramsToTraceSpecs converts old EBPFProgram format to new TraceSpec format
func (a *LinuxDiagnosticAgent) ConvertEBPFProgramsToTraceSpecs(ebpfPrograms []types.EBPFRequest) []ebpf.TraceSpec {
var traceSpecs []ebpf.TraceSpec
for _, prog := range ebpfPrograms {
spec := a.convertToTraceSpec(prog)
traceSpecs = append(traceSpecs, spec)
}
return traceSpecs
}
// convertToTraceSpec converts an EBPFRequest to a TraceSpec for BCC-style tracing
func (a *LinuxDiagnosticAgent) convertToTraceSpec(prog types.EBPFRequest) ebpf.TraceSpec {
// Determine probe type based on target and type
probeType := "p" // default to kprobe
target := prog.Target
if strings.HasPrefix(target, "tracepoint:") {
probeType = "t"
target = strings.TrimPrefix(target, "tracepoint:")
} else if strings.HasPrefix(target, "kprobe:") {
probeType = "p"
target = strings.TrimPrefix(target, "kprobe:")
} else if prog.Type == "tracepoint" {
probeType = "t"
} else if prog.Type == "syscall" {
// Convert syscall names to kprobe targets
if !strings.HasPrefix(target, "__x64_sys_") && !strings.Contains(target, ":") {
if strings.HasPrefix(target, "sys_") {
target = "__x64_" + target
} else {
target = "__x64_sys_" + target
}
}
probeType = "p"
}
// Set default duration if not specified
duration := prog.Duration
if duration <= 0 {
duration = 5 // default 5 seconds
}
return ebpf.TraceSpec{
ProbeType: probeType,
Target: target,
Format: prog.Description, // Use description as format
Arguments: []string{}, // Start with no arguments for compatibility
Duration: duration,
UID: -1, // No UID filter (don't default to 0 which means root only)
}
}
// executeEBPFTraces executes multiple eBPF traces using the eBPF service
func (a *LinuxDiagnosticAgent) ExecuteEBPFTraces(traceSpecs []ebpf.TraceSpec) []map[string]interface{} {
if len(traceSpecs) == 0 {
return []map[string]interface{}{}
}
a.logger.Info("Executing %d eBPF traces", len(traceSpecs))
results := make([]map[string]interface{}, 0, len(traceSpecs))
// Execute each trace using the eBPF manager
for i, spec := range traceSpecs {
a.logger.Debug("Starting trace %d: %s", i, spec.Target)
// Start the trace
traceID, err := a.ebpfManager.StartTrace(spec)
if err != nil {
a.logger.Error("Failed to start trace %d: %v", i, err)
result := map[string]interface{}{
"index": i,
"target": spec.Target,
"success": false,
"error": err.Error(),
}
results = append(results, result)
continue
}
// Wait for the trace duration
time.Sleep(time.Duration(spec.Duration) * time.Second)
// Get the trace result
traceResult, err := a.ebpfManager.GetTraceResult(traceID)
if err != nil {
a.logger.Error("Failed to get results for trace %d: %v", i, err)
result := map[string]interface{}{
"index": i,
"target": spec.Target,
"success": false,
"error": err.Error(),
}
results = append(results, result)
continue
}
// Build successful result
result := map[string]interface{}{
"index": i,
"target": spec.Target,
"success": true,
"event_count": traceResult.EventCount,
"events_per_second": traceResult.Statistics.EventsPerSecond,
"duration": traceResult.EndTime.Sub(traceResult.StartTime).Seconds(),
"summary": traceResult.Summary,
}
results = append(results, result)
a.logger.Debug("Completed trace %d: %d events", i, traceResult.EventCount)
}
a.logger.Info("Completed %d eBPF traces", len(results))
return results
}

View File

@@ -1,107 +0,0 @@
package main
import (
"testing"
"time"
)
func TestCommandExecutor_ValidateCommand(t *testing.T) {
executor := NewCommandExecutor(5 * time.Second)
tests := []struct {
name string
command string
wantErr bool
}{
{
name: "safe command - ls",
command: "ls -la /var",
wantErr: false,
},
{
name: "safe command - df",
command: "df -h",
wantErr: false,
},
{
name: "safe command - ps",
command: "ps aux | grep nginx",
wantErr: false,
},
{
name: "dangerous command - rm",
command: "rm -rf /tmp/*",
wantErr: true,
},
{
name: "dangerous command - dd",
command: "dd if=/dev/zero of=/dev/sda",
wantErr: true,
},
{
name: "dangerous command - sudo",
command: "sudo systemctl stop nginx",
wantErr: true,
},
{
name: "dangerous command - redirection",
command: "echo 'test' > /etc/passwd",
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := executor.validateCommand(tt.command)
if (err != nil) != tt.wantErr {
t.Errorf("validateCommand() error = %v, wantErr %v", err, tt.wantErr)
}
})
}
}
func TestCommandExecutor_Execute(t *testing.T) {
executor := NewCommandExecutor(5 * time.Second)
// Test safe command execution
cmd := Command{
ID: "test_echo",
Command: "echo 'Hello, World!'",
Description: "Test echo command",
}
result := executor.Execute(cmd)
if result.ExitCode != 0 {
t.Errorf("Expected exit code 0, got %d", result.ExitCode)
}
if result.Output != "Hello, World!\n" {
t.Errorf("Expected 'Hello, World!\\n', got '%s'", result.Output)
}
if result.Error != "" {
t.Errorf("Expected no error, got '%s'", result.Error)
}
}
func TestCommandExecutor_ExecuteUnsafeCommand(t *testing.T) {
executor := NewCommandExecutor(5 * time.Second)
// Test unsafe command rejection
cmd := Command{
ID: "test_rm",
Command: "rm -rf /tmp/test",
Description: "Dangerous rm command",
}
result := executor.Execute(cmd)
if result.ExitCode != 1 {
t.Errorf("Expected exit code 1 for unsafe command, got %d", result.ExitCode)
}
if result.Error == "" {
t.Error("Expected error for unsafe command, got none")
}
}

View File

@@ -1,141 +0,0 @@
#!/bin/bash
# Test the eBPF-enhanced NannyAgent
# This script demonstrates the new eBPF integration capabilities
set -e
echo "🔬 Testing eBPF-Enhanced NannyAgent"
echo "=================================="
echo ""
AGENT="./nannyagent-ebpf"
if [ ! -f "$AGENT" ]; then
echo "Building agent..."
go build -o nannyagent-ebpf .
fi
echo "1. Checking eBPF Capabilities"
echo "-----------------------------"
./ebpf_helper.sh check
echo ""
echo "2. Testing eBPF Manager Initialization"
echo "-------------------------------------"
echo "Starting agent in test mode..."
echo ""
# Create a test script that will send a predefined issue to test eBPF
cat > /tmp/test_ebpf_issue.txt << 'EOF'
Network connection timeouts to external services. Applications report intermittent failures when trying to connect to remote APIs. The issue occurs randomly and affects multiple processes.
EOF
echo "Test Issue: Network connection timeouts"
echo "Expected eBPF Programs: Network tracing, syscall monitoring"
echo ""
echo "3. Demonstration of eBPF Program Suggestions"
echo "-------------------------------------------"
# Show what eBPF programs would be suggested for different issues
echo "For NETWORK issues - Expected eBPF programs:"
echo "- tracepoint:syscalls/sys_enter_connect (network connections)"
echo "- kprobe:tcp_connect (TCP connection attempts)"
echo "- kprobe:tcp_sendmsg (network send operations)"
echo ""
echo "For PROCESS issues - Expected eBPF programs:"
echo "- tracepoint:syscalls/sys_enter_execve (process execution)"
echo "- tracepoint:sched/sched_process_exit (process termination)"
echo "- kprobe:do_fork (process creation)"
echo ""
echo "For FILE issues - Expected eBPF programs:"
echo "- tracepoint:syscalls/sys_enter_openat (file opens)"
echo "- kprobe:vfs_read (file reads)"
echo "- kprobe:vfs_write (file writes)"
echo ""
echo "For PERFORMANCE issues - Expected eBPF programs:"
echo "- tracepoint:syscalls/sys_enter_* (syscall frequency analysis)"
echo "- kprobe:schedule (CPU scheduling events)"
echo ""
echo "4. eBPF Integration Features"
echo "---------------------------"
echo "✓ Cilium eBPF library integration"
echo "✓ bpftrace-based program execution"
echo "✓ Dynamic program generation based on issue type"
echo "✓ Parallel execution with regular diagnostic commands"
echo "✓ Structured JSON event collection"
echo "✓ AI-driven eBPF program selection"
echo ""
echo "5. Example AI Response with eBPF"
echo "-------------------------------"
cat << 'EOF'
{
"response_type": "diagnostic",
"reasoning": "Network timeout issues require monitoring TCP connections and system calls to identify bottlenecks",
"commands": [
{"id": "net_status", "command": "ss -tulpn", "description": "Current network connections"},
{"id": "net_config", "command": "ip route show", "description": "Network configuration"}
],
"ebpf_programs": [
{
"name": "tcp_connect_monitor",
"type": "kprobe",
"target": "tcp_connect",
"duration": 15,
"description": "Monitor TCP connection attempts"
},
{
"name": "syscall_network",
"type": "tracepoint",
"target": "syscalls/sys_enter_connect",
"duration": 15,
"filters": {"comm": "curl"},
"description": "Monitor network-related system calls"
}
]
}
EOF
echo ""
echo "6. Security and Safety"
echo "--------------------"
echo "✓ eBPF programs are read-only and time-limited"
echo "✓ No system modification capabilities"
echo "✓ Automatic cleanup after execution"
echo "✓ Safe execution in containers and restricted environments"
echo "✓ Graceful fallback when eBPF is not available"
echo ""
echo "7. Next Steps"
echo "------------"
echo "To test the full eBPF integration:"
echo ""
echo "a) Run with root privileges for full eBPF access:"
echo " sudo $AGENT"
echo ""
echo "b) Try these test scenarios:"
echo " - 'Network connection timeouts'"
echo " - 'High CPU usage and slow performance'"
echo " - 'File permission errors'"
echo " - 'Process hanging or not responding'"
echo ""
echo "c) Install additional eBPF tools:"
echo " sudo ./ebpf_helper.sh install"
echo ""
echo "🎯 eBPF Integration Complete!"
echo ""
echo "The agent now supports:"
echo "- Dynamic eBPF program compilation and execution"
echo "- AI-driven selection of appropriate tracepoints and kprobes"
echo "- Real-time system event monitoring during diagnosis"
echo "- Integration with Cilium eBPF library for professional-grade monitoring"
echo ""
echo "This provides unprecedented visibility into system behavior"
echo "for accurate root cause analysis and issue resolution."

View File

@@ -1,51 +0,0 @@
#!/bin/bash
# NannyAPI Function Discovery Script
# This script helps you find the correct function name for your NannyAPI setup
echo "🔍 NannyAPI Function Discovery"
echo "=============================="
echo ""
ENDPOINT="${NANNYAPI_ENDPOINT:-http://tensorzero.netcup.internal:3000/openai/v1}"
echo "Testing endpoint: $ENDPOINT/chat/completions"
echo ""
# Test common function name patterns
test_functions=(
"nannyapi::function_name::diagnose"
"nannyapi::function_name::diagnose_and_heal"
"nannyapi::function_name::linux_diagnostic"
"nannyapi::function_name::system_diagnostic"
"nannyapi::model_name::gpt-4"
"nannyapi::model_name::claude"
)
for func in "${test_functions[@]}"; do
echo "Testing function: $func"
response=$(curl -s -X POST "$ENDPOINT/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\":\"$func\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}]}")
if echo "$response" | grep -q "Unknown function"; then
echo " ❌ Function not found"
elif echo "$response" | grep -q "error"; then
echo " ⚠️ Error: $(echo "$response" | jq -r '.error' 2>/dev/null || echo "$response")"
else
echo " ✅ Function exists and responding!"
echo " Use this in your environment: export NANNYAPI_MODEL=\"$func\""
fi
echo ""
done
echo "💡 If none of the above work, check your NannyAPI configuration file"
echo " for the correct function names and update NANNYAPI_MODEL accordingly."
echo ""
echo "Example NannyAPI config snippet:"
echo "```yaml"
echo "functions:"
echo " diagnose_and_heal: # This becomes 'nannyapi::function_name::diagnose_and_heal'"
echo " # function definition"
echo "```"

334
docs/INSTALLATION.md Normal file
View File

@@ -0,0 +1,334 @@
# NannyAgent Installation Guide
## Quick Install
### One-Line Install (Recommended)
After uploading `install.sh` to your website:
```bash
curl -fsSL https://your-domain.com/install.sh | sudo bash
```
Or with wget:
```bash
wget -qO- https://your-domain.com/install.sh | sudo bash
```
### Two-Step Install (More Secure)
Download and inspect the installer first:
```bash
# Download the installer
curl -fsSL https://your-domain.com/install.sh -o install.sh
# Inspect the script (recommended!)
less install.sh
# Make it executable
chmod +x install.sh
# Run the installer
sudo ./install.sh
```
## Installation from GitHub
If you're hosting on GitHub:
```bash
curl -fsSL https://raw.githubusercontent.com/yourusername/nannyagent/main/install.sh | sudo bash
```
## System Requirements
Before installing, ensure your system meets these requirements:
### Operating System
- ✅ Linux (any distribution)
- ❌ Windows (not supported)
- ❌ macOS (not supported)
- ❌ Containers/Docker (not supported)
- ❌ LXC (not supported)
### Architecture
- ✅ amd64 (x86_64)
- ✅ arm64 (aarch64)
- ❌ i386/i686 (32-bit not supported)
- ❌ Other architectures (not supported)
### Kernel Version
- ✅ Linux kernel 5.x or higher
- ❌ Linux kernel 4.x or lower (not supported)
Check your kernel version:
```bash
uname -r
# Should show 5.x.x or higher
```
### Privileges
- Must have root/sudo access
- Will create system directories:
- `/usr/local/bin/nannyagent` (binary)
- `/etc/nannyagent` (configuration)
- `/var/lib/nannyagent` (data directory)
### Network
- Connectivity to Supabase backend required
- HTTPS access to your Supabase project URL
- No proxy support at this time
## What the Installer Does
The installer performs these steps automatically:
1.**System Checks**
- Verifies root privileges
- Detects OS and architecture
- Checks kernel version (5.x+)
- Detects container environments
- Checks for existing installations
2.**Dependency Installation**
- Installs `bpftrace` (eBPF tracing tool)
- Installs `bpfcc-tools` (BCC toolkit)
- Installs kernel headers if needed
- Uses your system's package manager (apt/dnf/yum)
3.**Build & Install**
- Verifies Go installation (required for building)
- Compiles the nannyagent binary
- Tests connectivity to Supabase
- Installs binary to `/usr/local/bin`
4.**Configuration**
- Creates `/etc/nannyagent/config.env`
- Creates `/var/lib/nannyagent` data directory
- Sets proper permissions (secure)
- Creates installation lock file
## Installation Exit Codes
The installer exits with specific codes for different scenarios:
| Exit Code | Meaning | Resolution |
|-----------|---------|------------|
| 0 | Success | Installation completed |
| 1 | Not root | Run with `sudo` |
| 2 | Unsupported OS | Use Linux |
| 3 | Unsupported architecture | Use amd64 or arm64 |
| 4 | Container detected | Install on bare metal or VM |
| 5 | Kernel too old | Upgrade to kernel 5.x+ |
| 6 | Existing installation | Remove `/var/lib/nannyagent` first |
| 7 | eBPF tools failed | Check package manager and repos |
| 8 | Go not installed | Install Go from golang.org |
| 9 | Build failed | Check Go installation and dependencies |
| 10 | Directory creation failed | Check permissions |
| 11 | Binary installation failed | Check disk space and permissions |
## Post-Installation
After successful installation:
### 1. Configure Supabase URL
Edit the configuration file:
```bash
sudo nano /etc/nannyagent/config.env
```
Set your Supabase project URL:
```bash
SUPABASE_PROJECT_URL=https://your-project.supabase.co
TOKEN_PATH=/var/lib/nannyagent/token.json
DEBUG=false
```
### 2. Test the Installation
Check version (no sudo needed):
```bash
nannyagent --version
```
Show help (no sudo needed):
```bash
nannyagent --help
```
### 3. Run the Agent
Start the agent (requires sudo):
```bash
sudo nannyagent
```
On first run, you'll see authentication instructions:
```
Visit: https://your-app.com/device-auth
Enter code: ABCD-1234
```
## Uninstallation
To remove NannyAgent:
```bash
# Remove binary
sudo rm /usr/local/bin/nannyagent
# Remove configuration
sudo rm -rf /etc/nannyagent
# Remove data directory (includes authentication tokens)
sudo rm -rf /var/lib/nannyagent
```
## Troubleshooting
### "Kernel version X.X is not supported"
Your kernel is too old. Check current version:
```bash
uname -r
```
Options:
1. Upgrade your kernel to 5.x or higher
2. Use a different system with a newer kernel
3. Check your distribution's documentation for kernel upgrades
### "Another instance may already be installed"
The installer detected an existing installation. Options:
**Option 1:** Remove the existing installation
```bash
sudo rm -rf /var/lib/nannyagent
```
**Option 2:** Check if it's actually running
```bash
ps aux | grep nannyagent
```
If running, stop it first, then remove the data directory.
### "Cannot connect to Supabase"
This is a warning, not an error. The installation will complete, but the agent won't work without connectivity.
Check:
1. Is SUPABASE_PROJECT_URL set correctly?
```bash
cat /etc/nannyagent/config.env
```
2. Can you reach the URL?
```bash
curl -I https://your-project.supabase.co
```
3. Check firewall rules:
```bash
sudo iptables -L -n | grep -i drop
```
### "Go is not installed"
The installer requires Go to build the binary. Install Go:
**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install golang-go
```
**RHEL/CentOS/Fedora:**
```bash
sudo dnf install golang
```
Or download from: https://golang.org/dl/
### "eBPF tools installation failed"
Check your package repositories:
**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install bpfcc-tools bpftrace
```
**RHEL/Fedora:**
```bash
sudo dnf install bcc-tools bpftrace
```
## Security Considerations
### Permissions
The installer creates directories with restricted permissions:
- `/etc/nannyagent` - 755 (readable by all, writable by root)
- `/etc/nannyagent/config.env` - 600 (only root can read/write)
- `/var/lib/nannyagent` - 700 (only root can access)
### Authentication Tokens
Authentication tokens are stored securely in:
```
/var/lib/nannyagent/token.json
```
Only root can access this file (permissions: 600).
### Network Communication
All communication with Supabase uses HTTPS (TLS encrypted).
## Manual Installation (Alternative)
If you prefer manual installation:
```bash
# 1. Clone repository
git clone https://github.com/yourusername/nannyagent.git
cd nannyagent
# 2. Install eBPF tools (Ubuntu/Debian)
sudo apt update
sudo apt install bpfcc-tools bpftrace linux-headers-$(uname -r)
# 3. Build binary
go mod tidy
CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -ldflags '-w -s' -o nannyagent .
# 4. Install
sudo cp nannyagent /usr/local/bin/
sudo chmod 755 /usr/local/bin/nannyagent
# 5. Create directories
sudo mkdir -p /etc/nannyagent
sudo mkdir -p /var/lib/nannyagent
sudo chmod 700 /var/lib/nannyagent
# 6. Create configuration
sudo cat > /etc/nannyagent/config.env <<EOF
SUPABASE_PROJECT_URL=https://your-project.supabase.co
TOKEN_PATH=/var/lib/nannyagent/token.json
DEBUG=false
EOF
sudo chmod 600 /etc/nannyagent/config.env
```
## Support
For issues or questions:
- GitHub Issues: https://github.com/yourusername/nannyagent/issues
- Documentation: https://github.com/yourusername/nannyagent/docs

View File

@@ -1,550 +0,0 @@
package main
import (
"context"
"fmt"
"log"
"strings"
"sync"
"time"
"github.com/cilium/ebpf"
"github.com/cilium/ebpf/asm"
"github.com/cilium/ebpf/link"
"github.com/cilium/ebpf/perf"
"github.com/cilium/ebpf/rlimit"
)
// NetworkEvent represents a network event captured by eBPF
type NetworkEvent struct {
Timestamp uint64 `json:"timestamp"`
PID uint32 `json:"pid"`
TID uint32 `json:"tid"`
UID uint32 `json:"uid"`
EventType string `json:"event_type"`
Comm [16]byte `json:"-"`
CommStr string `json:"comm"`
}
// CiliumEBPFManager implements eBPF monitoring using Cilium eBPF library
type CiliumEBPFManager struct {
mu sync.RWMutex
activePrograms map[string]*EBPFProgram
completedResults map[string]*EBPFTrace
capabilities map[string]bool
}
// EBPFProgram represents a running eBPF program
type EBPFProgram struct {
ID string
Request EBPFRequest
Program *ebpf.Program
Link link.Link
PerfReader *perf.Reader
Events []NetworkEvent
StartTime time.Time
Cancel context.CancelFunc
}
// NewCiliumEBPFManager creates a new Cilium-based eBPF manager
func NewCiliumEBPFManager() *CiliumEBPFManager {
// Remove memory limit for eBPF programs
if err := rlimit.RemoveMemlock(); err != nil {
log.Printf("Failed to remove memlock limit: %v", err)
}
return &CiliumEBPFManager{
activePrograms: make(map[string]*EBPFProgram),
completedResults: make(map[string]*EBPFTrace),
capabilities: map[string]bool{
"kernel_support": true,
"kprobe": true,
"kretprobe": true,
"tracepoint": true,
},
}
}
// StartEBPFProgram starts an eBPF program using Cilium library
func (em *CiliumEBPFManager) StartEBPFProgram(req EBPFRequest) (string, error) {
programID := fmt.Sprintf("%s_%d", req.Name, time.Now().Unix())
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(req.Duration+5)*time.Second)
program, err := em.createEBPFProgram(req)
if err != nil {
cancel()
return "", fmt.Errorf("failed to create eBPF program: %w", err)
}
programLink, err := em.attachProgram(program, req)
if err != nil {
if program != nil {
program.Close()
}
cancel()
return "", fmt.Errorf("failed to attach eBPF program: %w", err)
}
// Create perf event map for collecting events
perfMap, err := ebpf.NewMap(&ebpf.MapSpec{
Type: ebpf.PerfEventArray,
KeySize: 4,
ValueSize: 4,
MaxEntries: 128,
Name: "events",
})
if err != nil {
if programLink != nil {
programLink.Close()
}
if program != nil {
program.Close()
}
cancel()
return "", fmt.Errorf("failed to create perf map: %w", err)
}
perfReader, err := perf.NewReader(perfMap, 4096)
if err != nil {
perfMap.Close()
if programLink != nil {
programLink.Close()
}
if program != nil {
program.Close()
}
cancel()
return "", fmt.Errorf("failed to create perf reader: %w", err)
}
ebpfProgram := &EBPFProgram{
ID: programID,
Request: req,
Program: program,
Link: programLink,
PerfReader: perfReader,
Events: make([]NetworkEvent, 0),
StartTime: time.Now(),
Cancel: cancel,
}
em.mu.Lock()
em.activePrograms[programID] = ebpfProgram
em.mu.Unlock()
// Start event collection in goroutine
go em.collectEvents(ctx, programID)
log.Printf("Started eBPF program %s (%s on %s) for %d seconds using Cilium library",
programID, req.Type, req.Target, req.Duration)
return programID, nil
}
// createEBPFProgram creates actual eBPF program using Cilium library
func (em *CiliumEBPFManager) createEBPFProgram(req EBPFRequest) (*ebpf.Program, error) {
var programType ebpf.ProgramType
switch req.Type {
case "kprobe", "kretprobe":
programType = ebpf.Kprobe
case "tracepoint":
programType = ebpf.TracePoint
default:
return nil, fmt.Errorf("unsupported program type: %s", req.Type)
}
// Create eBPF instructions that capture basic event data
// We'll use a simplified approach that collects events when the probe fires
instructions := asm.Instructions{
// Get current PID/TID
asm.FnGetCurrentPidTgid.Call(),
asm.Mov.Reg(asm.R6, asm.R0), // store pid_tgid in R6
// Get current UID/GID
asm.FnGetCurrentUidGid.Call(),
asm.Mov.Reg(asm.R7, asm.R0), // store uid_gid in R7
// Get current ktime
asm.FnKtimeGetNs.Call(),
asm.Mov.Reg(asm.R8, asm.R0), // store timestamp in R8
// For now, just return 0 - we'll detect the probe firings via attachment success
// and generate events based on realistic UDP traffic patterns
asm.Mov.Imm(asm.R0, 0),
asm.Return(),
}
// Create eBPF program specification with actual instructions
spec := &ebpf.ProgramSpec{
Name: req.Name,
Type: programType,
License: "GPL",
Instructions: instructions,
}
// Load the actual eBPF program using Cilium library
program, err := ebpf.NewProgram(spec)
if err != nil {
return nil, fmt.Errorf("failed to load eBPF program: %w", err)
}
log.Printf("Created native eBPF %s program for %s using Cilium library", req.Type, req.Target)
return program, nil
}
// attachProgram attaches the eBPF program to the appropriate probe point
func (em *CiliumEBPFManager) attachProgram(program *ebpf.Program, req EBPFRequest) (link.Link, error) {
if program == nil {
return nil, fmt.Errorf("cannot attach nil program")
}
switch req.Type {
case "kprobe":
l, err := link.Kprobe(req.Target, program, nil)
return l, err
case "kretprobe":
l, err := link.Kretprobe(req.Target, program, nil)
return l, err
case "tracepoint":
// Parse tracepoint target (e.g., "syscalls:sys_enter_connect")
l, err := link.Tracepoint("syscalls", "sys_enter_connect", program, nil)
return l, err
default:
return nil, fmt.Errorf("unsupported program type: %s", req.Type)
}
}
// collectEvents collects events from eBPF program via perf buffer using Cilium library
func (em *CiliumEBPFManager) collectEvents(ctx context.Context, programID string) {
defer em.cleanupProgram(programID)
em.mu.RLock()
ebpfProgram, exists := em.activePrograms[programID]
em.mu.RUnlock()
if !exists {
return
}
duration := time.Duration(ebpfProgram.Request.Duration) * time.Second
endTime := time.Now().Add(duration)
eventCount := 0
for time.Now().Before(endTime) {
select {
case <-ctx.Done():
log.Printf("eBPF program %s cancelled", programID)
return
default:
// Our eBPF programs use minimal bytecode and don't write to perf buffer
// Instead, we generate realistic events based on the fact that programs are successfully attached
// and would fire when UDP kernel functions are called
// Generate events at reasonable intervals to simulate UDP activity
if eventCount < 30 && (time.Now().UnixMilli()%180 < 18) {
em.generateRealisticUDPEvent(programID, &eventCount)
}
time.Sleep(150 * time.Millisecond)
}
}
// Store results before cleanup
em.mu.Lock()
if program, exists := em.activePrograms[programID]; exists {
// Convert NetworkEvent to EBPFEvent for compatibility
events := make([]EBPFEvent, len(program.Events))
for i, event := range program.Events {
events[i] = EBPFEvent{
Timestamp: int64(event.Timestamp),
EventType: event.EventType,
ProcessID: int(event.PID),
ProcessName: event.CommStr,
Data: map[string]interface{}{
"pid": event.PID,
"tid": event.TID,
"uid": event.UID,
},
}
}
endTime := time.Now()
duration := endTime.Sub(program.StartTime)
trace := &EBPFTrace{
TraceID: programID,
StartTime: program.StartTime,
EndTime: endTime,
EventCount: len(events),
Events: events,
Capability: fmt.Sprintf("%s on %s", program.Request.Type, program.Request.Target),
Summary: fmt.Sprintf("eBPF %s on %s captured %d events over %v using Cilium library",
program.Request.Type, program.Request.Target, len(events), duration),
ProcessList: em.extractProcessList(events),
}
em.completedResults[programID] = trace
// Log grouped event summary instead of individual events
em.logEventSummary(programID, program.Request, events)
}
em.mu.Unlock()
log.Printf("eBPF program %s completed - collected %d events via Cilium library", programID, eventCount)
}
// parseEventFromPerf parses raw perf buffer data into NetworkEvent
func (em *CiliumEBPFManager) parseEventFromPerf(data []byte, req EBPFRequest) NetworkEvent {
// Parse raw perf event data - this is a simplified parser
// In production, you'd have a structured event format defined in your eBPF program
var pid uint32 = 1234 // Default values for parsing
var timestamp uint64 = uint64(time.Now().UnixNano())
// Basic parsing - extract PID if data is long enough
if len(data) >= 8 {
// Assume first 4 bytes are PID, next 4 are timestamp (simplified)
pid = uint32(data[0]) | uint32(data[1])<<8 | uint32(data[2])<<16 | uint32(data[3])<<24
}
return NetworkEvent{
Timestamp: timestamp,
PID: pid,
TID: pid,
UID: 1000,
EventType: req.Name,
CommStr: "cilium_ebpf_process",
}
}
// GetProgramResults returns the trace results for a program
func (em *CiliumEBPFManager) GetProgramResults(programID string) (*EBPFTrace, error) {
em.mu.RLock()
defer em.mu.RUnlock()
// First check completed results
if trace, exists := em.completedResults[programID]; exists {
return trace, nil
}
// If not found in completed results, check active programs (for ongoing programs)
program, exists := em.activePrograms[programID]
if !exists {
return nil, fmt.Errorf("program %s not found", programID)
}
endTime := time.Now()
duration := endTime.Sub(program.StartTime)
// Convert NetworkEvent to EBPFEvent for compatibility
events := make([]EBPFEvent, len(program.Events))
for i, event := range program.Events {
events[i] = EBPFEvent{
Timestamp: int64(event.Timestamp),
EventType: event.EventType,
ProcessID: int(event.PID),
ProcessName: event.CommStr,
Data: map[string]interface{}{
"pid": event.PID,
"tid": event.TID,
"uid": event.UID,
},
}
}
return &EBPFTrace{
TraceID: programID,
StartTime: program.StartTime,
EndTime: endTime,
Capability: program.Request.Name,
Events: events,
EventCount: len(program.Events),
ProcessList: em.extractProcessList(events),
Summary: fmt.Sprintf("eBPF %s on %s captured %d events over %v using Cilium library", program.Request.Type, program.Request.Target, len(program.Events), duration),
}, nil
}
// cleanupProgram cleans up a completed eBPF program
func (em *CiliumEBPFManager) cleanupProgram(programID string) {
em.mu.Lock()
defer em.mu.Unlock()
if program, exists := em.activePrograms[programID]; exists {
if program.Cancel != nil {
program.Cancel()
}
if program.PerfReader != nil {
program.PerfReader.Close()
}
if program.Link != nil {
program.Link.Close()
}
if program.Program != nil {
program.Program.Close()
}
delete(em.activePrograms, programID)
log.Printf("Cleaned up eBPF program %s", programID)
}
}
// GetCapabilities returns the eBPF capabilities
func (em *CiliumEBPFManager) GetCapabilities() map[string]bool {
return em.capabilities
}
// GetSummary returns a summary of the eBPF manager
func (em *CiliumEBPFManager) GetSummary() map[string]interface{} {
em.mu.RLock()
defer em.mu.RUnlock()
activeCount := len(em.activePrograms)
activeIDs := make([]string, 0, activeCount)
for id := range em.activePrograms {
activeIDs = append(activeIDs, id)
}
return map[string]interface{}{
"active_programs": activeCount,
"program_ids": activeIDs,
"capabilities": em.capabilities,
}
}
// StopProgram stops and cleans up an eBPF program
func (em *CiliumEBPFManager) StopProgram(programID string) error {
em.mu.Lock()
defer em.mu.Unlock()
program, exists := em.activePrograms[programID]
if !exists {
return fmt.Errorf("program %s not found", programID)
}
if program.Cancel != nil {
program.Cancel()
}
em.cleanupProgram(programID)
return nil
}
// ListActivePrograms returns a list of active program IDs
func (em *CiliumEBPFManager) ListActivePrograms() []string {
em.mu.RLock()
defer em.mu.RUnlock()
ids := make([]string, 0, len(em.activePrograms))
for id := range em.activePrograms {
ids = append(ids, id)
}
return ids
}
// generateRealisticUDPEvent generates a realistic UDP event when eBPF probes fire
func (em *CiliumEBPFManager) generateRealisticUDPEvent(programID string, eventCount *int) {
em.mu.RLock()
ebpfProgram, exists := em.activePrograms[programID]
em.mu.RUnlock()
if !exists {
return
}
// Use process data from actual UDP-using processes on the system
processes := []struct {
pid uint32
name string
expectedActivity string
}{
{1460, "avahi-daemon", "mDNS announcements"},
{1954, "dnsmasq", "DNS resolution"},
{4746, "firefox", "WebRTC/DNS queries"},
{1926, "tailscaled", "VPN keepalives"},
{1589, "NetworkManager", "DHCP renewal"},
}
// Select process based on the target probe to make it realistic
var selectedProc struct {
pid uint32
name string
expectedActivity string
}
switch ebpfProgram.Request.Target {
case "udp_sendmsg":
// More likely to catch outbound traffic from these processes
selectedProc = processes[*eventCount%3] // avahi, dnsmasq, firefox
case "udp_recvmsg":
// More likely to catch inbound traffic responses
selectedProc = processes[(*eventCount+1)%len(processes)]
default:
selectedProc = processes[*eventCount%len(processes)]
}
event := NetworkEvent{
Timestamp: uint64(time.Now().UnixNano()),
PID: selectedProc.pid,
TID: selectedProc.pid,
UID: 1000,
EventType: ebpfProgram.Request.Name,
CommStr: selectedProc.name,
}
em.mu.Lock()
if prog, exists := em.activePrograms[programID]; exists {
prog.Events = append(prog.Events, event)
*eventCount++
}
em.mu.Unlock()
}
// extractProcessList extracts unique process names from eBPF events
func (em *CiliumEBPFManager) extractProcessList(events []EBPFEvent) []string {
processSet := make(map[string]bool)
for _, event := range events {
if event.ProcessName != "" {
processSet[event.ProcessName] = true
}
}
processes := make([]string, 0, len(processSet))
for process := range processSet {
processes = append(processes, process)
}
return processes
}
// logEventSummary logs a grouped summary of eBPF events instead of individual events
func (em *CiliumEBPFManager) logEventSummary(programID string, request EBPFRequest, events []EBPFEvent) {
if len(events) == 0 {
log.Printf("eBPF program %s (%s on %s) completed with 0 events", programID, request.Type, request.Target)
return
}
// Group events by process
processCounts := make(map[string]int)
for _, event := range events {
key := fmt.Sprintf("%s (PID %d)", event.ProcessName, event.ProcessID)
processCounts[key]++
}
// Create summary message
var summary strings.Builder
summary.WriteString(fmt.Sprintf("eBPF program %s (%s on %s) completed with %d events: ",
programID, request.Type, request.Target, len(events)))
i := 0
for process, count := range processCounts {
if i > 0 {
summary.WriteString(", ")
}
summary.WriteString(fmt.Sprintf("%s×%d", process, count))
i++
}
log.Printf(summary.String())
}

View File

@@ -1,296 +0,0 @@
#!/bin/bash
# eBPF Helper Scripts for NannyAgent
# This script contains various eBPF programs and helpers for system monitoring
# Check if running as root (required for most eBPF operations)
check_root() {
if [ "$EUID" -ne 0 ]; then
echo "Warning: Many eBPF operations require root privileges"
echo "Consider running with sudo for full functionality"
fi
}
# Install eBPF tools if not present
install_ebpf_tools() {
echo "Installing eBPF tools..."
# Detect package manager and install appropriate packages
if command -v apt-get >/dev/null 2>&1; then
# Ubuntu/Debian
echo "Detected Ubuntu/Debian system"
apt-get update
apt-get install -y bpftrace linux-tools-generic linux-tools-$(uname -r) || true
apt-get install -y bcc-tools python3-bcc || true
elif command -v yum >/dev/null 2>&1; then
# RHEL/CentOS 7
echo "Detected RHEL/CentOS system"
yum install -y bpftrace perf || true
elif command -v dnf >/dev/null 2>&1; then
# RHEL/CentOS 8+/Fedora
echo "Detected Fedora/RHEL 8+ system"
dnf install -y bpftrace perf bcc-tools python3-bcc || true
elif command -v zypper >/dev/null 2>&1; then
# openSUSE
echo "Detected openSUSE system"
zypper install -y bpftrace perf || true
else
echo "Unknown package manager. Please install eBPF tools manually:"
echo "- bpftrace"
echo "- perf (linux-tools)"
echo "- BCC tools (optional)"
fi
}
# Check eBPF capabilities of the current system
check_ebpf_capabilities() {
echo "Checking eBPF capabilities..."
# Check kernel version
kernel_version=$(uname -r)
echo "Kernel version: $kernel_version"
# Check if eBPF is enabled in kernel
if [ -f /proc/config.gz ]; then
if zcat /proc/config.gz | grep -q "CONFIG_BPF=y"; then
echo "✓ eBPF support enabled in kernel"
else
echo "✗ eBPF support not found in kernel config"
fi
elif [ -f "/boot/config-$(uname -r)" ]; then
if grep -q "CONFIG_BPF=y" "/boot/config-$(uname -r)"; then
echo "✓ eBPF support enabled in kernel"
else
echo "✗ eBPF support not found in kernel config"
fi
else
echo "? Unable to check kernel eBPF config"
fi
# Check available tools
echo ""
echo "Available eBPF tools:"
tools=("bpftrace" "perf" "execsnoop" "opensnoop" "tcpconnect" "biotop")
for tool in "${tools[@]}"; do
if command -v "$tool" >/dev/null 2>&1; then
echo "$tool"
else
echo "$tool"
fi
done
# Check debugfs mount
if mount | grep -q debugfs; then
echo "✓ debugfs mounted"
else
echo "✗ debugfs not mounted (required for ftrace)"
echo " To mount: sudo mount -t debugfs none /sys/kernel/debug"
fi
# Check if we can load eBPF programs
echo ""
echo "Testing eBPF program loading..."
if bpftrace -e 'BEGIN { print("eBPF test successful"); exit(); }' >/dev/null 2>&1; then
echo "✓ eBPF program loading works"
else
echo "✗ eBPF program loading failed (may need root privileges)"
fi
}
# Create simple syscall monitoring script
create_syscall_monitor() {
cat > /tmp/nannyagent_syscall_monitor.bt << 'EOF'
#!/usr/bin/env bpftrace
BEGIN {
printf("Monitoring syscalls... Press Ctrl-C to stop\n");
printf("[\n");
}
tracepoint:syscalls:sys_enter_* {
printf("{\"timestamp\":%llu,\"event_type\":\"syscall_enter\",\"process_id\":%d,\"process_name\":\"%s\",\"syscall\":\"%s\",\"user_id\":%d},\n",
nsecs, pid, comm, probe, uid);
}
END {
printf("]\n");
}
EOF
chmod +x /tmp/nannyagent_syscall_monitor.bt
echo "Syscall monitor created: /tmp/nannyagent_syscall_monitor.bt"
}
# Create network activity monitor
create_network_monitor() {
cat > /tmp/nannyagent_network_monitor.bt << 'EOF'
#!/usr/bin/env bpftrace
BEGIN {
printf("Monitoring network activity... Press Ctrl-C to stop\n");
printf("[\n");
}
kprobe:tcp_sendmsg,
kprobe:tcp_recvmsg,
kprobe:udp_sendmsg,
kprobe:udp_recvmsg {
$action = (probe =~ /send/ ? "send" : "recv");
$protocol = (probe =~ /tcp/ ? "tcp" : "udp");
printf("{\"timestamp\":%llu,\"event_type\":\"network_%s\",\"protocol\":\"%s\",\"process_id\":%d,\"process_name\":\"%s\"},\n",
nsecs, $action, $protocol, pid, comm);
}
END {
printf("]\n");
}
EOF
chmod +x /tmp/nannyagent_network_monitor.bt
echo "Network monitor created: /tmp/nannyagent_network_monitor.bt"
}
# Create file access monitor
create_file_monitor() {
cat > /tmp/nannyagent_file_monitor.bt << 'EOF'
#!/usr/bin/env bpftrace
BEGIN {
printf("Monitoring file access... Press Ctrl-C to stop\n");
printf("[\n");
}
tracepoint:syscalls:sys_enter_openat {
printf("{\"timestamp\":%llu,\"event_type\":\"file_open\",\"process_id\":%d,\"process_name\":\"%s\",\"filename\":\"%s\",\"flags\":%d},\n",
nsecs, pid, comm, str(args->pathname), args->flags);
}
tracepoint:syscalls:sys_enter_unlinkat {
printf("{\"timestamp\":%llu,\"event_type\":\"file_delete\",\"process_id\":%d,\"process_name\":\"%s\",\"filename\":\"%s\"},\n",
nsecs, pid, comm, str(args->pathname));
}
END {
printf("]\n");
}
EOF
chmod +x /tmp/nannyagent_file_monitor.bt
echo "File monitor created: /tmp/nannyagent_file_monitor.bt"
}
# Create process monitor
create_process_monitor() {
cat > /tmp/nannyagent_process_monitor.bt << 'EOF'
#!/usr/bin/env bpftrace
BEGIN {
printf("Monitoring process activity... Press Ctrl-C to stop\n");
printf("[\n");
}
tracepoint:syscalls:sys_enter_execve {
printf("{\"timestamp\":%llu,\"event_type\":\"process_exec\",\"process_id\":%d,\"process_name\":\"%s\",\"filename\":\"%s\"},\n",
nsecs, pid, comm, str(args->filename));
}
tracepoint:sched:sched_process_exit {
printf("{\"timestamp\":%llu,\"event_type\":\"process_exit\",\"process_id\":%d,\"process_name\":\"%s\",\"exit_code\":%d},\n",
nsecs, args->pid, args->comm, args->code);
}
END {
printf("]\n");
}
EOF
chmod +x /tmp/nannyagent_process_monitor.bt
echo "Process monitor created: /tmp/nannyagent_process_monitor.bt"
}
# Performance monitoring setup
setup_performance_monitoring() {
echo "Setting up performance monitoring..."
# Create performance monitoring script
cat > /tmp/nannyagent_perf_monitor.sh << 'EOF'
#!/bin/bash
DURATION=${1:-10}
OUTPUT_FILE=${2:-/tmp/nannyagent_perf_output.json}
echo "Running performance monitoring for $DURATION seconds..."
echo "[" > "$OUTPUT_FILE"
# Sample system performance every second
for i in $(seq 1 $DURATION); do
timestamp=$(date +%s)000000000
cpu_percent=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
memory_percent=$(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100.0}')
load_avg=$(uptime | awk -F'load average:' '{print $2}' | xargs)
echo "{\"timestamp\":$timestamp,\"event_type\":\"performance_sample\",\"cpu_percent\":\"$cpu_percent\",\"memory_percent\":\"$memory_percent\",\"load_avg\":\"$load_avg\"}," >> "$OUTPUT_FILE"
[ $i -lt $DURATION ] && sleep 1
done
echo "]" >> "$OUTPUT_FILE"
echo "Performance data saved to $OUTPUT_FILE"
EOF
chmod +x /tmp/nannyagent_perf_monitor.sh
echo "Performance monitor created: /tmp/nannyagent_perf_monitor.sh"
}
# Main function
main() {
check_root
case "${1:-help}" in
"install")
install_ebpf_tools
;;
"check")
check_ebpf_capabilities
;;
"setup")
echo "Setting up eBPF monitoring scripts..."
create_syscall_monitor
create_network_monitor
create_file_monitor
create_process_monitor
setup_performance_monitoring
echo "All eBPF monitoring scripts created in /tmp/"
;;
"test")
echo "Testing eBPF functionality..."
check_ebpf_capabilities
if command -v bpftrace >/dev/null 2>&1; then
echo "Running quick eBPF test..."
timeout 5s bpftrace -e 'BEGIN { print("eBPF is working!"); } tracepoint:syscalls:sys_enter_openat { @[comm] = count(); } END { print(@); clear(@); }'
fi
;;
"help"|*)
echo "eBPF Helper Script for NannyAgent"
echo ""
echo "Usage: $0 [command]"
echo ""
echo "Commands:"
echo " install - Install eBPF tools on the system"
echo " check - Check eBPF capabilities"
echo " setup - Create eBPF monitoring scripts"
echo " test - Test eBPF functionality"
echo " help - Show this help message"
echo ""
echo "Examples:"
echo " $0 check # Check what eBPF tools are available"
echo " $0 install # Install eBPF tools (requires root)"
echo " $0 setup # Create monitoring scripts"
echo " $0 test # Test eBPF functionality"
;;
esac
}
# Run main function with all arguments
main "$@"

View File

@@ -1,341 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"log"
"time"
"github.com/sashabaranov/go-openai"
)
// EBPFEnhancedDiagnosticResponse represents an AI response that includes eBPF program requests
type EBPFEnhancedDiagnosticResponse struct {
ResponseType string `json:"response_type"`
Reasoning string `json:"reasoning"`
Commands []Command `json:"commands"`
EBPFPrograms []EBPFRequest `json:"ebpf_programs,omitempty"`
Description string `json:"description,omitempty"`
}
// DiagnoseWithEBPF performs diagnosis using both regular commands and eBPF monitoring
func (a *LinuxDiagnosticAgent) DiagnoseWithEBPF(issue string) error {
fmt.Printf("Diagnosing issue with eBPF monitoring: %s\n", issue)
fmt.Println("Gathering system information and eBPF capabilities...")
// Gather system information
systemInfo := GatherSystemInfo()
// Get eBPF capabilities if manager is available
var ebpfInfo string
if a.ebpfManager != nil {
capabilities := a.ebpfManager.GetCapabilities()
summary := a.ebpfManager.GetSummary()
commonPrograms := "\nCommon eBPF programs available: 3 programs including UDP monitoring, TCP monitoring, and syscall tracing via Cilium eBPF library"
ebpfInfo = fmt.Sprintf(`
eBPF MONITORING CAPABILITIES:
- Available capabilities: %v
- Manager status: %v%s
eBPF USAGE INSTRUCTIONS:
You can request eBPF monitoring by including "ebpf_programs" in your diagnostic response:
{
"response_type": "diagnostic",
"reasoning": "Need to trace system calls to debug the issue",
"commands": [...regular commands...],
"ebpf_programs": [
{
"name": "syscall_monitor",
"type": "tracepoint",
"target": "syscalls/sys_enter_openat",
"duration": 15,
"filters": {"comm": "process_name"},
"description": "Monitor file open operations"
}
]
}
Available eBPF program types:
- tracepoint: Monitor kernel tracepoints (e.g., "syscalls/sys_enter_openat", "sched/sched_process_exec")
- kprobe: Monitor kernel function entry (e.g., "tcp_connect", "vfs_read")
- kretprobe: Monitor kernel function return (e.g., "tcp_connect", "vfs_write")
Common targets:
- syscalls/sys_enter_openat (file operations)
- syscalls/sys_enter_execve (process execution)
- tcp_connect, tcp_sendmsg (network activity)
- vfs_read, vfs_write (file I/O)
`, capabilities, summary, commonPrograms)
} else {
ebpfInfo = "\neBPF monitoring not available on this system"
}
// Create enhanced system prompt
initialPrompt := FormatSystemInfoForPrompt(systemInfo) + ebpfInfo +
fmt.Sprintf("\nISSUE DESCRIPTION: %s", issue)
// Start conversation
messages := []openai.ChatCompletionMessage{
{
Role: openai.ChatMessageRoleUser,
Content: initialPrompt,
},
}
for {
// Send request to AI
response, err := a.sendRequest(messages)
if err != nil {
return fmt.Errorf("failed to send request: %w", err)
}
if len(response.Choices) == 0 {
return fmt.Errorf("no choices in response")
}
content := response.Choices[0].Message.Content
fmt.Printf("\nAI Response:\n%s\n", content)
// Try to parse as eBPF-enhanced diagnostic response
var ebpfResp EBPFEnhancedDiagnosticResponse
if err := json.Unmarshal([]byte(content), &ebpfResp); err == nil && ebpfResp.ResponseType == "diagnostic" {
fmt.Printf("\nReasoning: %s\n", ebpfResp.Reasoning)
// Execute both regular commands and eBPF programs
result, err := a.executeWithEBPFPrograms(ebpfResp)
if err != nil {
return fmt.Errorf("failed to execute with eBPF: %w", err)
}
// Add results to conversation
resultsJSON, err := json.MarshalIndent(result, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal results: %w", err)
}
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleAssistant,
Content: content,
})
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleUser,
Content: string(resultsJSON),
})
continue
}
// Try to parse as regular diagnostic response
var diagnosticResp DiagnosticResponse
if err := json.Unmarshal([]byte(content), &diagnosticResp); err == nil && diagnosticResp.ResponseType == "diagnostic" {
fmt.Printf("\nReasoning: %s\n", diagnosticResp.Reasoning)
if len(diagnosticResp.Commands) == 0 {
fmt.Println("No commands to execute")
break
}
// Execute regular commands only
commandResults := make([]CommandResult, 0, len(diagnosticResp.Commands))
for _, cmd := range diagnosticResp.Commands {
fmt.Printf("\nExecuting command '%s': %s\n", cmd.ID, cmd.Command)
result := a.executor.Execute(cmd)
commandResults = append(commandResults, result)
fmt.Printf("Output:\n%s\n", result.Output)
if result.Error != "" {
fmt.Printf("Error: %s\n", result.Error)
}
}
// Add results to conversation
resultsJSON, err := json.MarshalIndent(commandResults, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal results: %w", err)
}
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleAssistant,
Content: content,
})
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleUser,
Content: string(resultsJSON),
})
continue
}
// Try to parse as resolution response
var resolutionResp ResolutionResponse
if err := json.Unmarshal([]byte(content), &resolutionResp); err == nil && resolutionResp.ResponseType == "resolution" {
fmt.Printf("\n=== DIAGNOSIS COMPLETE ===\n")
fmt.Printf("Root Cause: %s\n", resolutionResp.RootCause)
fmt.Printf("Resolution Plan: %s\n", resolutionResp.ResolutionPlan)
fmt.Printf("Confidence: %s\n", resolutionResp.Confidence)
// Show any active eBPF programs
if a.ebpfManager != nil {
activePrograms := a.ebpfManager.ListActivePrograms()
if len(activePrograms) > 0 {
fmt.Printf("\n=== eBPF MONITORING SUMMARY ===\n")
for _, programID := range activePrograms {
if trace, err := a.ebpfManager.GetProgramResults(programID); err == nil {
fmt.Printf("Program %s: %s\n", programID, trace.Summary)
}
}
}
}
break
}
// Unknown response format
fmt.Printf("Unexpected response format:\n%s\n", content)
break
}
return nil
}
// executeWithEBPFPrograms executes regular commands alongside eBPF programs
func (a *LinuxDiagnosticAgent) executeWithEBPFPrograms(resp EBPFEnhancedDiagnosticResponse) (map[string]interface{}, error) {
result := map[string]interface{}{
"command_results": make([]CommandResult, 0),
"ebpf_results": make(map[string]*EBPFTrace),
}
var ebpfProgramIDs []string
// Debug: Check if eBPF programs were requested
fmt.Printf("DEBUG: AI requested %d eBPF programs\n", len(resp.EBPFPrograms))
if a.ebpfManager == nil {
fmt.Printf("DEBUG: eBPF manager is nil\n")
} else {
fmt.Printf("DEBUG: eBPF manager available, capabilities: %v\n", a.ebpfManager.GetCapabilities())
}
// Start eBPF programs if requested and available
if len(resp.EBPFPrograms) > 0 && a.ebpfManager != nil {
fmt.Printf("Starting %d eBPF monitoring programs...\n", len(resp.EBPFPrograms))
for _, program := range resp.EBPFPrograms {
programID, err := a.ebpfManager.StartEBPFProgram(program)
if err != nil {
log.Printf("Failed to start eBPF program %s: %v", program.Name, err)
continue
}
ebpfProgramIDs = append(ebpfProgramIDs, programID)
fmt.Printf("Started eBPF program: %s (%s on %s)\n", programID, program.Type, program.Target)
}
// Give eBPF programs time to start
time.Sleep(200 * time.Millisecond)
}
// Execute regular commands
commandResults := make([]CommandResult, 0, len(resp.Commands))
for _, cmd := range resp.Commands {
fmt.Printf("\nExecuting command '%s': %s\n", cmd.ID, cmd.Command)
cmdResult := a.executor.Execute(cmd)
commandResults = append(commandResults, cmdResult)
fmt.Printf("Output:\n%s\n", cmdResult.Output)
if cmdResult.Error != "" {
fmt.Printf("Error: %s\n", cmdResult.Error)
}
}
result["command_results"] = commandResults
// If no eBPF programs were requested but we have eBPF capability and this seems network-related,
// automatically start UDP monitoring
if len(ebpfProgramIDs) == 0 && a.ebpfManager != nil && len(resp.EBPFPrograms) == 0 {
fmt.Printf("No eBPF programs requested by AI - starting default UDP monitoring...\n")
defaultUDPPrograms := []EBPFRequest{
{
Name: "udp_sendmsg_auto",
Type: "kprobe",
Target: "udp_sendmsg",
Duration: 10,
Description: "Monitor UDP send operations",
},
{
Name: "udp_recvmsg_auto",
Type: "kprobe",
Target: "udp_recvmsg",
Duration: 10,
Description: "Monitor UDP receive operations",
},
}
for _, program := range defaultUDPPrograms {
programID, err := a.ebpfManager.StartEBPFProgram(program)
if err != nil {
log.Printf("Failed to start default eBPF program %s: %v", program.Name, err)
continue
}
ebpfProgramIDs = append(ebpfProgramIDs, programID)
fmt.Printf("Started default eBPF program: %s (%s on %s)\n", programID, program.Type, program.Target)
}
}
// Wait for eBPF programs to complete and collect results
if len(ebpfProgramIDs) > 0 {
fmt.Printf("Waiting for %d eBPF programs to complete...\n", len(ebpfProgramIDs))
// Wait for the longest duration + buffer
maxDuration := 0
for _, program := range resp.EBPFPrograms {
if program.Duration > maxDuration {
maxDuration = program.Duration
}
}
waitTime := time.Duration(maxDuration+2) * time.Second
if waitTime < 5*time.Second {
waitTime = 5 * time.Second
}
time.Sleep(waitTime)
// Collect results
ebpfResults := make(map[string]*EBPFTrace)
for _, programID := range ebpfProgramIDs {
if trace, err := a.ebpfManager.GetProgramResults(programID); err == nil {
ebpfResults[programID] = trace
fmt.Printf("Collected eBPF results from %s: %d events\n", programID, trace.EventCount)
} else {
log.Printf("Failed to get results from eBPF program %s: %v", programID, err)
}
}
result["ebpf_results"] = ebpfResults
}
return result, nil
}
// GetEBPFCapabilitiesPrompt returns eBPF capabilities formatted for AI prompts
func (a *LinuxDiagnosticAgent) GetEBPFCapabilitiesPrompt() string {
if a.ebpfManager == nil {
return "eBPF monitoring not available"
}
capabilities := a.ebpfManager.GetCapabilities()
summary := a.ebpfManager.GetSummary()
return fmt.Sprintf(`
eBPF MONITORING SYSTEM STATUS:
- Capabilities: %v
- Manager Status: %v
INTEGRATION INSTRUCTIONS:
To request eBPF monitoring, include "ebpf_programs" array in diagnostic responses.
Each program should specify type (tracepoint/kprobe/kretprobe), target, and duration.
eBPF programs will run in parallel with regular diagnostic commands.
`, capabilities, summary)
}

View File

@@ -1,4 +0,0 @@
package main
// This file intentionally left minimal to avoid compilation order issues
// The EBPFManagerInterface is defined in ebpf_simple_manager.go

View File

@@ -1,387 +0,0 @@
package main
import (
"context"
"fmt"
"log"
"os"
"os/exec"
"strings"
"sync"
"time"
)
// EBPFEvent represents an event captured by eBPF programs
type EBPFEvent struct {
Timestamp int64 `json:"timestamp"`
EventType string `json:"event_type"`
ProcessID int `json:"process_id"`
ProcessName string `json:"process_name"`
UserID int `json:"user_id"`
Data map[string]interface{} `json:"data"`
}
// EBPFTrace represents a collection of eBPF events for a specific investigation
type EBPFTrace struct {
TraceID string `json:"trace_id"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
Capability string `json:"capability"`
Events []EBPFEvent `json:"events"`
Summary string `json:"summary"`
EventCount int `json:"event_count"`
ProcessList []string `json:"process_list"`
}
// EBPFRequest represents a request to run eBPF monitoring
type EBPFRequest struct {
Name string `json:"name"`
Type string `json:"type"` // "tracepoint", "kprobe", "kretprobe"
Target string `json:"target"` // tracepoint path or function name
Duration int `json:"duration"` // seconds
Filters map[string]string `json:"filters,omitempty"`
Description string `json:"description"`
}
// EBPFManagerInterface defines the interface for eBPF managers
type EBPFManagerInterface interface {
GetCapabilities() map[string]bool
GetSummary() map[string]interface{}
StartEBPFProgram(req EBPFRequest) (string, error)
GetProgramResults(programID string) (*EBPFTrace, error)
StopProgram(programID string) error
ListActivePrograms() []string
}
// SimpleEBPFManager implements basic eBPF functionality using bpftrace
type SimpleEBPFManager struct {
programs map[string]*RunningProgram
programsLock sync.RWMutex
capabilities map[string]bool
programCounter int
}
// RunningProgram represents an active eBPF program
type RunningProgram struct {
ID string
Request EBPFRequest
Process *exec.Cmd
Events []EBPFEvent
StartTime time.Time
Cancel context.CancelFunc
}
// NewSimpleEBPFManager creates a new simple eBPF manager
func NewSimpleEBPFManager() *SimpleEBPFManager {
manager := &SimpleEBPFManager{
programs: make(map[string]*RunningProgram),
capabilities: make(map[string]bool),
}
// Test capabilities
manager.testCapabilities()
return manager
}
// testCapabilities checks what eBPF capabilities are available
func (em *SimpleEBPFManager) testCapabilities() {
// Test if bpftrace is available
if _, err := exec.LookPath("bpftrace"); err == nil {
em.capabilities["bpftrace"] = true
}
// Test root privileges (required for eBPF)
em.capabilities["root_access"] = os.Geteuid() == 0
// Test kernel version (simplified check)
cmd := exec.Command("uname", "-r")
output, err := cmd.Output()
if err == nil {
version := strings.TrimSpace(string(output))
em.capabilities["kernel_ebpf"] = strings.Contains(version, "4.") || strings.Contains(version, "5.") || strings.Contains(version, "6.")
} else {
em.capabilities["kernel_ebpf"] = false
}
log.Printf("eBPF capabilities: %+v", em.capabilities)
}
// GetCapabilities returns the available eBPF capabilities
func (em *SimpleEBPFManager) GetCapabilities() map[string]bool {
em.programsLock.RLock()
defer em.programsLock.RUnlock()
caps := make(map[string]bool)
for k, v := range em.capabilities {
caps[k] = v
}
return caps
}
// GetSummary returns a summary of the eBPF manager state
func (em *SimpleEBPFManager) GetSummary() map[string]interface{} {
em.programsLock.RLock()
defer em.programsLock.RUnlock()
return map[string]interface{}{
"capabilities": em.capabilities,
"active_programs": len(em.programs),
"program_ids": em.ListActivePrograms(),
}
}
// StartEBPFProgram starts a new eBPF monitoring program
func (em *SimpleEBPFManager) StartEBPFProgram(req EBPFRequest) (string, error) {
if !em.capabilities["bpftrace"] {
return "", fmt.Errorf("bpftrace not available")
}
if !em.capabilities["root_access"] {
return "", fmt.Errorf("root access required for eBPF programs")
}
em.programsLock.Lock()
defer em.programsLock.Unlock()
// Generate program ID
em.programCounter++
programID := fmt.Sprintf("prog_%d", em.programCounter)
// Create bpftrace script
script, err := em.generateBpftraceScript(req)
if err != nil {
return "", fmt.Errorf("failed to generate script: %w", err)
}
// Start bpftrace process
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(req.Duration)*time.Second)
cmd := exec.CommandContext(ctx, "bpftrace", "-e", script)
program := &RunningProgram{
ID: programID,
Request: req,
Process: cmd,
Events: []EBPFEvent{},
StartTime: time.Now(),
Cancel: cancel,
}
// Start the program
if err := cmd.Start(); err != nil {
cancel()
return "", fmt.Errorf("failed to start bpftrace: %w", err)
}
em.programs[programID] = program
// Monitor the program in a goroutine
go em.monitorProgram(programID)
log.Printf("Started eBPF program %s for %s", programID, req.Name)
return programID, nil
}
// generateBpftraceScript creates a bpftrace script based on the request
func (em *SimpleEBPFManager) generateBpftraceScript(req EBPFRequest) (string, error) {
switch req.Type {
case "network":
return `
BEGIN {
printf("Starting network monitoring...\n");
}
tracepoint:syscalls:sys_enter_connect,
tracepoint:syscalls:sys_enter_accept,
tracepoint:syscalls:sys_enter_recvfrom,
tracepoint:syscalls:sys_enter_sendto {
printf("NETWORK|%d|%s|%d|%s\n", nsecs, probe, pid, comm);
}
END {
printf("Network monitoring completed\n");
}`, nil
case "process":
return `
BEGIN {
printf("Starting process monitoring...\n");
}
tracepoint:syscalls:sys_enter_execve,
tracepoint:syscalls:sys_enter_fork,
tracepoint:syscalls:sys_enter_clone {
printf("PROCESS|%d|%s|%d|%s\n", nsecs, probe, pid, comm);
}
END {
printf("Process monitoring completed\n");
}`, nil
case "file":
return `
BEGIN {
printf("Starting file monitoring...\n");
}
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat,
tracepoint:syscalls:sys_enter_read,
tracepoint:syscalls:sys_enter_write {
printf("FILE|%d|%s|%d|%s\n", nsecs, probe, pid, comm);
}
END {
printf("File monitoring completed\n");
}`, nil
default:
return "", fmt.Errorf("unsupported eBPF program type: %s", req.Type)
}
}
// monitorProgram monitors a running eBPF program and collects events
func (em *SimpleEBPFManager) monitorProgram(programID string) {
em.programsLock.Lock()
program, exists := em.programs[programID]
if !exists {
em.programsLock.Unlock()
return
}
em.programsLock.Unlock()
// Wait for the program to complete
err := program.Process.Wait()
// Clean up
program.Cancel()
em.programsLock.Lock()
if err != nil {
log.Printf("eBPF program %s completed with error: %v", programID, err)
} else {
log.Printf("eBPF program %s completed successfully", programID)
}
// Parse output and generate events (simplified for demo)
// In a real implementation, you would parse the bpftrace output
program.Events = []EBPFEvent{
{
Timestamp: time.Now().Unix(),
EventType: program.Request.Type,
ProcessID: 0,
ProcessName: "example",
UserID: 0,
Data: map[string]interface{}{
"description": "Sample eBPF event",
"program_id": programID,
},
},
}
em.programsLock.Unlock()
log.Printf("Generated %d events for program %s", len(program.Events), programID)
}
// GetProgramResults returns the results of a completed program
func (em *SimpleEBPFManager) GetProgramResults(programID string) (*EBPFTrace, error) {
em.programsLock.RLock()
defer em.programsLock.RUnlock()
program, exists := em.programs[programID]
if !exists {
return nil, fmt.Errorf("program %s not found", programID)
}
// Check if program is still running
if program.Process.ProcessState == nil {
return nil, fmt.Errorf("program %s is still running", programID)
}
events := make([]EBPFEvent, len(program.Events))
copy(events, program.Events)
processes := make([]string, 0)
processMap := make(map[string]bool)
for _, event := range events {
if !processMap[event.ProcessName] {
processes = append(processes, event.ProcessName)
processMap[event.ProcessName] = true
}
}
trace := &EBPFTrace{
TraceID: programID,
StartTime: program.StartTime,
EndTime: time.Now(),
Capability: program.Request.Type,
Events: events,
EventCount: len(events),
ProcessList: processes,
Summary: fmt.Sprintf("Collected %d events for %s monitoring", len(events), program.Request.Type),
}
return trace, nil
}
// StopProgram stops a running eBPF program
func (em *SimpleEBPFManager) StopProgram(programID string) error {
em.programsLock.Lock()
defer em.programsLock.Unlock()
program, exists := em.programs[programID]
if !exists {
return fmt.Errorf("program %s not found", programID)
}
// Cancel the context and kill the process
program.Cancel()
if program.Process.Process != nil {
program.Process.Process.Kill()
}
delete(em.programs, programID)
log.Printf("Stopped eBPF program %s", programID)
return nil
}
// ListActivePrograms returns a list of active program IDs
func (em *SimpleEBPFManager) ListActivePrograms() []string {
em.programsLock.RLock()
defer em.programsLock.RUnlock()
programs := make([]string, 0, len(em.programs))
for id := range em.programs {
programs = append(programs, id)
}
return programs
}
// GetCommonEBPFRequests returns predefined eBPF programs for common use cases
func (em *SimpleEBPFManager) GetCommonEBPFRequests() []EBPFRequest {
return []EBPFRequest{
{
Name: "network_activity",
Type: "network",
Target: "syscalls:sys_enter_connect,sys_enter_accept,sys_enter_recvfrom,sys_enter_sendto",
Duration: 30,
Description: "Monitor network connections and data transfers",
},
{
Name: "process_activity",
Type: "process",
Target: "syscalls:sys_enter_execve,sys_enter_fork,sys_enter_clone",
Duration: 30,
Description: "Monitor process creation and execution",
},
{
Name: "file_access",
Type: "file",
Target: "syscalls:sys_enter_open,sys_enter_openat,sys_enter_read,sys_enter_write",
Duration: 30,
Description: "Monitor file system access and I/O operations",
},
}
}
// Helper functions - using system_info.go functions
// isRoot and checkKernelVersion are available from system_info.go

View File

@@ -1,67 +0,0 @@
package main
import (
"fmt"
"os"
)
// Standalone test for eBPF integration
func testEBPFIntegration() {
fmt.Println("🔬 eBPF Integration Quick Test")
fmt.Println("=============================")
// Skip privilege checks for testing - show what would happen
if os.Geteuid() != 0 {
fmt.Println("⚠️ Running as non-root user - showing limited test results")
fmt.Println(" In production, this program requires root privileges")
fmt.Println("")
}
// Create a basic diagnostic agent
agent := NewLinuxDiagnosticAgent()
// Test eBPF capability detection
fmt.Println("1. Checking eBPF Capabilities:")
// Test if eBPF manager was initialized
if agent.ebpfManager == nil {
fmt.Println(" ❌ eBPF Manager not initialized")
return
}
fmt.Println(" ✅ eBPF Manager initialized successfully")
// Test eBPF program suggestions for different categories
fmt.Println("2. Testing eBPF Program Categories:")
// Simulate what would be available for different issue types
categories := []string{"NETWORK", "PROCESS", "FILE", "PERFORMANCE"}
for _, category := range categories {
fmt.Printf(" %s: Available\n", category)
}
// Test simple diagnostic with eBPF
fmt.Println("3. Testing eBPF-Enhanced Diagnostics:")
testIssue := "Process hanging - application stops responding"
fmt.Printf(" Issue: %s\n", testIssue)
// Call the eBPF-enhanced diagnostic (adjusted parameters)
result := agent.DiagnoseWithEBPF(testIssue)
fmt.Printf(" Response received: %s\n", result)
fmt.Println()
fmt.Println("✅ eBPF Integration Test Complete!")
fmt.Println(" The agent successfully:")
fmt.Println(" - Initialized eBPF manager")
fmt.Println(" - Integrated with diagnostic system")
fmt.Println(" - Ready for eBPF program execution")
}
// Add test command to main if run with "test-ebpf" argument
func init() {
if len(os.Args) > 1 && os.Args[1] == "test-ebpf" {
testEBPFIntegration()
os.Exit(0)
}
}

15
go.mod
View File

@@ -5,8 +5,19 @@ go 1.23.0
toolchain go1.24.2
require (
github.com/cilium/ebpf v0.19.0
github.com/gorilla/websocket v1.5.3
github.com/joho/godotenv v1.5.1
github.com/sashabaranov/go-openai v1.32.0
github.com/shirou/gopsutil/v3 v3.24.5
)
require golang.org/x/sys v0.31.0 // indirect
require (
github.com/go-ole/go-ole v1.2.6 // indirect
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 // indirect
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c // indirect
github.com/shoenig/go-m1cpu v0.1.6 // indirect
github.com/tklauser/go-sysconf v0.3.12 // indirect
github.com/tklauser/numcpus v0.6.1 // indirect
github.com/yusufpapurcu/wmi v1.2.4 // indirect
golang.org/x/sys v0.31.0 // indirect
)

58
go.sum
View File

@@ -1,28 +1,42 @@
github.com/cilium/ebpf v0.19.0 h1:Ro/rE64RmFBeA9FGjcTc+KmCeY6jXmryu6FfnzPRIao=
github.com/cilium/ebpf v0.19.0/go.mod h1:fLCgMo3l8tZmAdM3B2XqdFzXBpwkcSTroaVqN08OWVY=
github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6 h1:teYtXy9B7y5lHTp8V9KPxpYRAVA7dozigQcMiBust1s=
github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6/go.mod h1:p4lGIVX+8Wa6ZPNDvqcxq36XpUDLh42FLetFU7odllI=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/go-ole/go-ole v1.2.6 h1:/Fpf6oFPoeFik9ty7siob0G6Ke8QvQEuVcuChpwXzpY=
github.com/go-ole/go-ole v1.2.6/go.mod h1:pprOEPIfldk/42T2oK7lQ4v4JSDwmV0As9GaiUsvbm0=
github.com/google/go-cmp v0.5.6/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/josharian/native v1.1.0 h1:uuaP0hAbW7Y4l0ZRQ6C9zfb7Mg1mbFKry/xzDAfmtLA=
github.com/josharian/native v1.1.0/go.mod h1:7X/raswPFr05uY3HiLlYeyQntB6OO7E/d2Cu7qoaN2w=
github.com/jsimonetti/rtnetlink/v2 v2.0.1 h1:xda7qaHDSVOsADNouv7ukSuicKZO7GgVUCXxpaIEIlM=
github.com/jsimonetti/rtnetlink/v2 v2.0.1/go.mod h1:7MoNYNbb3UaDHtF8udiJo/RH6VsTKP1pqKLUTVCvToE=
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY=
github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE=
github.com/mdlayher/netlink v1.7.2 h1:/UtM3ofJap7Vl4QWCPDGXY8d3GIY2UGSDbK+QWmY8/g=
github.com/mdlayher/netlink v1.7.2/go.mod h1:xraEF7uJbxLhc5fpHL4cPe221LI2bdttWlU+ZGLfQSw=
github.com/mdlayher/socket v0.4.1 h1:eM9y2/jlbs1M615oshPQOHZzj6R6wMT7bX5NPiQvn2U=
github.com/mdlayher/socket v0.4.1/go.mod h1:cAqeGjoufqdxWkD7DkpyS+wcefOtmu5OQ8KuoJGIReA=
github.com/rogpeppe/go-internal v1.12.0 h1:exVL4IDcn6na9z1rAb56Vxr+CgyK3nn3O+epU5NdKM8=
github.com/rogpeppe/go-internal v1.12.0/go.mod h1:E+RYuTGaKKdloAfM02xzb0FW3Paa99yedzYV+kq4uf4=
github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg=
github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE=
github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0=
github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4=
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0 h1:6E+4a0GO5zZEnZ81pIr0yLvtUWk2if982qA3F3QD6H4=
github.com/lufia/plan9stats v0.0.0-20211012122336-39d0f177ccd0/go.mod h1:zJYVVT2jmtg6P3p1VtQj7WsuWi/y4VnjVBn7F8KPB3I=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c h1:ncq/mPwQF4JjgDlrVEn3C11VoGHZN7m8qihwgMEtzYw=
github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE=
github.com/sashabaranov/go-openai v1.32.0 h1:Yk3iE9moX3RBXxrof3OBtUBrE7qZR0zF9ebsoO4zVzI=
github.com/sashabaranov/go-openai v1.32.0/go.mod h1:lj5b/K+zjTSFxVLijLSTDZuP7adOgerWeFyZLUhAKRg=
golang.org/x/net v0.38.0 h1:vRMAPTMaeGqVhG5QyLJHqNDwecKTomGeqbnfZyKlBI8=
golang.org/x/net v0.38.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8=
golang.org/x/sync v0.1.0 h1:wsuoTGHzEhffawBOhz5CYhcrV4IdKZbEyZjBMuTp12o=
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
github.com/shirou/gopsutil/v3 v3.24.5 h1:i0t8kL+kQTvpAYToeuiVk3TgDeKOFioZO3Ztz/iZ9pI=
github.com/shirou/gopsutil/v3 v3.24.5/go.mod h1:bsoOS1aStSs9ErQ1WWfxllSeS1K5D+U30r2NfcubMVk=
github.com/shoenig/go-m1cpu v0.1.6 h1:nxdKQNcEB6vzgA2E2bvzKIYRuNj7XNJ4S/aRSwKzFtM=
github.com/shoenig/go-m1cpu v0.1.6/go.mod h1:1JJMcUBvfNwpq05QDQVAnx3gUHr9IYF7GNg9SUEw2VQ=
github.com/shoenig/test v0.6.4 h1:kVTaSd7WLz5WZ2IaoM0RSzRsUD+m8wRR+5qvntpn4LU=
github.com/shoenig/test v0.6.4/go.mod h1:byHiCGXqrVaflBLAMq/srcZIHynQPQgeyvkvXnjqq0k=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/tklauser/go-sysconf v0.3.12 h1:0QaGUFOdQaIVdPgfITYzaTegZvdCjmYO52cSFAEVmqU=
github.com/tklauser/go-sysconf v0.3.12/go.mod h1:Ho14jnntGE1fpdOqQEEaiKRpvIavV0hSfmBq8nJbHYI=
github.com/tklauser/numcpus v0.6.1 h1:ng9scYS7az0Bk4OZLvrNXNSAO2Pxr1XXRAPyjhIx+Fk=
github.com/tklauser/numcpus v0.6.1/go.mod h1:1XfjsgE2zo8GVw7POkMbHENHzVg3GzmoZ9fESEdAacY=
github.com/yusufpapurcu/wmi v1.2.4 h1:zFUKzehAFReQwLys1b/iSMl+JQGSCSjtVqQn9bBrPo0=
github.com/yusufpapurcu/wmi v1.2.4/go.mod h1:SBZ9tNy3G9/m5Oi98Zks0QjeHVDvuK0qfxQmPyzfmi0=
golang.org/x/sys v0.0.0-20190916202348-b4ddaad3f8a3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201204225414-ed752295db88/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.11.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik=
golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k=
golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=

View File

@@ -1,85 +1,403 @@
#!/bin/bash
# Linux Diagnostic Agent Installation Script
# This script installs the nanny-agent on a Linux system
set -e
echo "🔧 Linux Diagnostic Agent Installation Script"
echo "=============================================="
# NannyAgent Installer Script
# Version: 0.0.1
# Description: Installs NannyAgent Linux diagnostic tool with eBPF capabilities
# Check if Go is installed
if ! command -v go &> /dev/null; then
echo "❌ Go is not installed. Please install Go first:"
VERSION="0.0.1"
INSTALL_DIR="/usr/local/bin"
CONFIG_DIR="/etc/nannyagent"
DATA_DIR="/var/lib/nannyagent"
BINARY_NAME="nannyagent"
LOCKFILE="${DATA_DIR}/.nannyagent.lock"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging functions
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check if running as root
check_root() {
if [ "$EUID" -ne 0 ]; then
log_error "This installer must be run as root"
log_info "Please run: sudo bash install.sh"
exit 1
fi
}
# Detect OS and architecture
detect_platform() {
OS=$(uname -s | tr '[:upper:]' '[:lower:]')
ARCH=$(uname -m)
log_info "Detected OS: $OS"
log_info "Detected Architecture: $ARCH"
# Check if OS is Linux
if [ "$OS" != "linux" ]; then
log_error "Unsupported operating system: $OS"
log_error "This installer only supports Linux"
exit 2
fi
# Check if architecture is supported (amd64 or arm64)
case "$ARCH" in
x86_64|amd64)
ARCH="amd64"
;;
aarch64|arm64)
ARCH="arm64"
;;
*)
log_error "Unsupported architecture: $ARCH"
log_error "Only amd64 (x86_64) and arm64 (aarch64) are supported"
exit 3
;;
esac
# Check if running in container/LXC
if [ -f /.dockerenv ] || grep -q docker /proc/1/cgroup 2>/dev/null; then
log_error "Container environment detected (Docker)"
log_error "NannyAgent does not support running inside containers or LXC"
exit 4
fi
if [ -f /proc/1/environ ] && grep -q "container=lxc" /proc/1/environ 2>/dev/null; then
log_error "LXC environment detected"
log_error "NannyAgent does not support running inside containers or LXC"
exit 4
fi
}
# Check kernel version (5.x or higher)
check_kernel_version() {
log_info "Checking kernel version..."
KERNEL_VERSION=$(uname -r)
KERNEL_MAJOR=$(echo "$KERNEL_VERSION" | cut -d. -f1)
log_info "Kernel version: $KERNEL_VERSION"
if [ "$KERNEL_MAJOR" -lt 5 ]; then
log_error "Kernel version $KERNEL_VERSION is not supported"
log_error "NannyAgent requires Linux kernel 5.x or higher"
log_error "Current kernel: $KERNEL_VERSION (major version: $KERNEL_MAJOR)"
exit 5
fi
log_success "Kernel version $KERNEL_VERSION is supported"
}
# Check if another instance is already installed
check_existing_installation() {
log_info "Checking for existing installation..."
# Check if lock file exists
if [ -f "$LOCKFILE" ]; then
log_error "An installation lock file exists at $LOCKFILE"
log_error "Another instance of NannyAgent may already be installed or running"
log_error "If you're sure no other instance exists, remove the lock file:"
log_error " sudo rm $LOCKFILE"
exit 6
fi
# Check if data directory exists and has files
if [ -d "$DATA_DIR" ]; then
FILE_COUNT=$(find "$DATA_DIR" -type f 2>/dev/null | wc -l)
if [ "$FILE_COUNT" -gt 0 ]; then
log_error "Data directory $DATA_DIR already exists with $FILE_COUNT files"
log_error "Another instance of NannyAgent may already be installed"
log_error "To reinstall, please remove the data directory first:"
log_error " sudo rm -rf $DATA_DIR"
exit 6
fi
fi
# Check if binary already exists
if [ -f "$INSTALL_DIR/$BINARY_NAME" ]; then
log_warning "Binary $INSTALL_DIR/$BINARY_NAME already exists"
log_warning "It will be replaced with the new version"
fi
log_success "No conflicting installation found"
}
# Install required dependencies (eBPF tools)
install_dependencies() {
log_info "Installing eBPF dependencies..."
# Detect package manager
if command -v apt-get &> /dev/null; then
PKG_MANAGER="apt-get"
log_info "Detected Debian/Ubuntu system"
# Update package list
log_info "Updating package list..."
apt-get update -qq || {
log_error "Failed to update package list"
exit 7
}
# Install bpfcc-tools and bpftrace
log_info "Installing bpfcc-tools and bpftrace..."
DEBIAN_FRONTEND=noninteractive apt-get install -y -qq bpfcc-tools bpftrace linux-headers-$(uname -r) 2>&1 | grep -v "^Reading" | grep -v "^Building" || {
log_error "Failed to install eBPF tools"
exit 7
}
elif command -v dnf &> /dev/null; then
PKG_MANAGER="dnf"
log_info "Detected Fedora/RHEL 8+ system"
log_info "Installing bcc-tools and bpftrace..."
dnf install -y -q bcc-tools bpftrace kernel-devel 2>&1 | grep -v "^Last metadata" || {
log_error "Failed to install eBPF tools"
exit 7
}
elif command -v yum &> /dev/null; then
PKG_MANAGER="yum"
log_info "Detected CentOS/RHEL 7 system"
log_info "Installing bcc-tools and bpftrace..."
yum install -y -q bcc-tools bpftrace kernel-devel 2>&1 | grep -v "^Loaded plugins" || {
log_error "Failed to install eBPF tools"
exit 7
}
else
log_error "Unsupported package manager"
log_error "Please install 'bpfcc-tools' and 'bpftrace' manually"
exit 7
fi
# Verify installations
if ! command -v bpftrace &> /dev/null; then
log_error "bpftrace installation failed or not in PATH"
exit 7
fi
# Check for BCC tools (RedHat systems may have them in /usr/share/bcc/tools/)
if [ -d "/usr/share/bcc/tools" ]; then
log_info "BCC tools found at /usr/share/bcc/tools/"
# Add to PATH if not already there
if [[ ":$PATH:" != *":/usr/share/bcc/tools:"* ]]; then
export PATH="/usr/share/bcc/tools:$PATH"
log_info "Added /usr/share/bcc/tools to PATH"
fi
fi
log_success "eBPF tools installed successfully"
}
# Check Go installation
check_go() {
log_info "Checking for Go installation..."
if ! command -v go &> /dev/null; then
log_error "Go is not installed"
log_error "Please install Go 1.23 or higher from https://golang.org/dl/"
exit 8
fi
GO_VERSION=$(go version | awk '{print $3}' | sed 's/go//')
log_info "Go version: $GO_VERSION"
log_success "Go is installed"
}
# Build the binary
build_binary() {
log_info "Building NannyAgent binary for $ARCH architecture..."
# Check if go.mod exists
if [ ! -f "go.mod" ]; then
log_error "go.mod not found. Are you in the correct directory?"
exit 9
fi
# Get Go dependencies
log_info "Downloading Go dependencies..."
go mod download || {
log_error "Failed to download Go dependencies"
exit 9
}
# Build the binary for the current architecture
log_info "Compiling binary for $ARCH..."
CGO_ENABLED=0 GOOS=linux GOARCH="$ARCH" go build -a -installsuffix cgo \
-ldflags "-w -s -X main.Version=$VERSION" \
-o "$BINARY_NAME" . || {
log_error "Failed to build binary for $ARCH"
exit 9
}
# Verify binary was created
if [ ! -f "$BINARY_NAME" ]; then
log_error "Binary not found after build"
exit 9
fi
# Verify binary is executable
chmod +x "$BINARY_NAME"
# Test the binary
if ./"$BINARY_NAME" --version &>/dev/null; then
log_success "Binary built and tested successfully for $ARCH"
else
log_error "Binary build succeeded but execution test failed"
exit 9
fi
}
# Check connectivity to Supabase
check_connectivity() {
log_info "Checking connectivity to Supabase..."
# Load SUPABASE_PROJECT_URL from .env if it exists
if [ -f ".env" ]; then
source .env 2>/dev/null || true
fi
if [ -z "$SUPABASE_PROJECT_URL" ]; then
log_warning "SUPABASE_PROJECT_URL not set in .env file"
log_warning "The agent may not work without proper configuration"
log_warning "Please configure $CONFIG_DIR/config.env after installation"
return
fi
log_info "Testing connection to $SUPABASE_PROJECT_URL..."
# Try to reach the Supabase endpoint
if command -v curl &> /dev/null; then
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 "$SUPABASE_PROJECT_URL" || echo "000")
if [ "$HTTP_CODE" = "000" ]; then
log_warning "Cannot connect to $SUPABASE_PROJECT_URL"
log_warning "Network connectivity issue detected"
log_warning "The agent will not work without connectivity to Supabase"
log_warning "Please check your network configuration and firewall settings"
elif [ "$HTTP_CODE" = "404" ] || [ "$HTTP_CODE" = "200" ] || [ "$HTTP_CODE" = "301" ] || [ "$HTTP_CODE" = "302" ]; then
log_success "Successfully connected to Supabase (HTTP $HTTP_CODE)"
else
log_warning "Received HTTP $HTTP_CODE from $SUPABASE_PROJECT_URL"
log_warning "The agent may not work correctly"
fi
else
log_warning "curl not found, skipping connectivity check"
fi
}
# Create necessary directories
create_directories() {
log_info "Creating directories..."
# Create config directory
mkdir -p "$CONFIG_DIR" || {
log_error "Failed to create config directory: $CONFIG_DIR"
exit 10
}
# Create data directory with restricted permissions
mkdir -p "$DATA_DIR" || {
log_error "Failed to create data directory: $DATA_DIR"
exit 10
}
chmod 700 "$DATA_DIR"
log_success "Directories created successfully"
}
# Install the binary
install_binary() {
log_info "Installing binary to $INSTALL_DIR..."
# Copy binary
cp "$BINARY_NAME" "$INSTALL_DIR/$BINARY_NAME" || {
log_error "Failed to copy binary to $INSTALL_DIR"
exit 11
}
# Set permissions
chmod 755 "$INSTALL_DIR/$BINARY_NAME"
# Copy .env to config if it exists
if [ -f ".env" ]; then
log_info "Copying configuration to $CONFIG_DIR..."
cp .env "$CONFIG_DIR/config.env"
chmod 600 "$CONFIG_DIR/config.env"
fi
# Create lock file
touch "$LOCKFILE"
echo "Installed at $(date)" > "$LOCKFILE"
log_success "Binary installed successfully"
}
# Display post-installation information
post_install_info() {
echo ""
echo "For Ubuntu/Debian:"
echo " sudo apt update && sudo apt install golang-go"
log_success "NannyAgent v$VERSION installed successfully!"
echo ""
echo "For RHEL/CentOS/Fedora:"
echo " sudo dnf install golang"
echo " # or"
echo " sudo yum install golang"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
exit 1
fi
echo " Configuration: $CONFIG_DIR/config.env"
echo " Data Directory: $DATA_DIR"
echo " Binary Location: $INSTALL_DIR/$BINARY_NAME"
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
echo "Next steps:"
echo ""
echo " 1. Configure your Supabase URL in $CONFIG_DIR/config.env"
echo " 2. Run the agent: sudo $BINARY_NAME"
echo " 3. Check version: $BINARY_NAME --version"
echo " 4. Get help: $BINARY_NAME --help"
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
}
echo "✅ Go is installed: $(go version)"
# Main installation flow
main() {
echo ""
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo " NannyAgent Installer v$VERSION"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
echo ""
# Build the application
echo "🔨 Building the application..."
go mod tidy
make build
check_root
detect_platform
check_kernel_version
check_existing_installation
install_dependencies
check_go
build_binary
check_connectivity
create_directories
install_binary
post_install_info
}
# Check if build was successful
if [ ! -f "./nanny-agent" ]; then
echo "❌ Build failed! nanny-agent binary not found."
exit 1
fi
echo "✅ Build successful!"
# Ask for installation preference
echo ""
echo "Installation options:"
echo "1. Install system-wide (/usr/local/bin) - requires sudo"
echo "2. Keep in current directory"
echo ""
read -p "Choose option (1 or 2): " choice
case $choice in
1)
echo "📦 Installing system-wide..."
sudo cp nanny-agent /usr/local/bin/
sudo chmod +x /usr/local/bin/nanny-agent
echo "✅ Agent installed to /usr/local/bin/nanny-agent"
echo ""
echo "You can now run the agent from anywhere with:"
echo " nanny-agent"
;;
2)
echo "✅ Agent ready in current directory"
echo ""
echo "Run the agent with:"
echo " ./nanny-agent"
;;
*)
echo "❌ Invalid choice. Agent is available in current directory."
echo "Run with: ./nanny-agent"
;;
esac
# Configuration
echo ""
echo "📝 Configuration:"
echo "Set these environment variables to configure the agent:"
echo ""
echo "export NANNYAPI_ENDPOINT=\"http://your-nannyapi-host:3000/openai/v1\""
echo "export NANNYAPI_MODEL=\"your-model-identifier\""
echo ""
echo "Or create a .env file in the working directory."
echo ""
echo "🎉 Installation complete!"
echo ""
echo "Example usage:"
echo " ./nanny-agent"
echo " > On /var filesystem I cannot create any file but df -h shows 30% free space available."
# Run main installation
main

View File

@@ -1,116 +0,0 @@
#!/bin/bash
# Linux Diagnostic Agent - Integration Tests
# This script creates realistic Linux problem scenarios for testing
set -e
AGENT_BINARY="./nanny-agent"
TEST_DIR="/tmp/nanny-agent-tests"
TEST_LOG="$TEST_DIR/integration_test.log"
# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Ensure test directory exists
mkdir -p "$TEST_DIR"
echo -e "${BLUE}🧪 Linux Diagnostic Agent - Integration Tests${NC}"
echo "================================================="
echo ""
# Check if agent binary exists
if [[ ! -f "$AGENT_BINARY" ]]; then
echo -e "${RED}❌ Agent binary not found at $AGENT_BINARY${NC}"
echo "Please run: make build"
exit 1
fi
# Function to run a test scenario
run_test() {
local test_name="$1"
local scenario="$2"
local expected_keywords="$3"
echo -e "${YELLOW}📋 Test: $test_name${NC}"
echo "Scenario: $scenario"
echo ""
# Run the agent with the scenario
echo "$scenario" | timeout 120s "$AGENT_BINARY" > "$TEST_LOG" 2>&1 || true
# Check if any expected keywords are found in the output
local found_keywords=0
IFS=',' read -ra KEYWORDS <<< "$expected_keywords"
for keyword in "${KEYWORDS[@]}"; do
keyword=$(echo "$keyword" | xargs) # trim whitespace
if grep -qi "$keyword" "$TEST_LOG"; then
echo -e "${GREEN} ✅ Found expected keyword: $keyword${NC}"
((found_keywords++))
else
echo -e "${RED} ❌ Missing keyword: $keyword${NC}"
fi
done
# Show summary
if [[ $found_keywords -gt 0 ]]; then
echo -e "${GREEN} ✅ Test PASSED ($found_keywords keywords found)${NC}"
else
echo -e "${RED} ❌ Test FAILED (no expected keywords found)${NC}"
fi
echo ""
echo "Full output saved to: $TEST_LOG"
echo "----------------------------------------"
echo ""
}
# Test Scenario 1: Disk Space Issues (Inode Exhaustion)
run_test "Disk Space - Inode Exhaustion" \
"I cannot create new files in /home directory even though df -h shows plenty of space available. Getting 'No space left on device' error when trying to touch new files." \
"inode,df -i,filesystem,inodes,exhausted"
# Test Scenario 2: Memory Issues
run_test "Memory Issues - OOM Killer" \
"My applications keep getting killed randomly and I see 'killed' messages in logs. The system becomes unresponsive for a few seconds before recovering. This happens especially when running memory-intensive tasks." \
"memory,oom,killed,dmesg,free,swap"
# Test Scenario 3: Network Connectivity Issues
run_test "Network Connectivity - DNS Resolution" \
"I can ping IP addresses directly (like 8.8.8.8) but cannot resolve domain names. Web browsing fails with DNS resolution errors, but ping 8.8.8.8 works fine." \
"dns,resolv.conf,nslookup,nameserver,dig"
# Test Scenario 4: Service/Process Issues
run_test "Service Issues - High Load" \
"System load average is consistently above 10.0 even when CPU usage appears normal. Applications are responding slowly and I notice high wait times. The server feels sluggish overall." \
"load,average,cpu,iostat,vmstat,processes"
# Test Scenario 5: File System Issues
run_test "Filesystem Issues - Permission Problems" \
"Web server returns 403 Forbidden errors for all pages. Files exist and seem readable, but nginx logs show permission denied errors. SELinux is disabled and file permissions look correct." \
"permission,403,nginx,chmod,chown,selinux"
# Test Scenario 6: Boot/System Issues
run_test "Boot Issues - Kernel Module" \
"System boots but some hardware devices are not working. Network interface shows as down, USB devices are not recognized, and dmesg shows module loading failures." \
"module,lsmod,dmesg,hardware,interface,usb"
# Test Scenario 7: Performance Issues
run_test "Performance Issues - I/O Bottleneck" \
"Database queries are extremely slow, taking 30+ seconds for simple SELECT statements. Disk activity LED is constantly on and system feels unresponsive during database operations." \
"iostat,iotop,disk,database,slow,performance"
echo -e "${BLUE}🏁 Integration Tests Complete${NC}"
echo ""
echo "Check individual test logs in: $TEST_DIR"
echo ""
echo -e "${YELLOW}💡 Tips:${NC}"
echo "- Tests use realistic scenarios that could occur on production systems"
echo "- Each test expects the AI to suggest relevant diagnostic commands"
echo "- Review the full logs to see the complete diagnostic conversation"
echo "- Tests timeout after 120 seconds to prevent hanging"
echo "- Make sure NANNYAPI_ENDPOINT and NANNYAPI_MODEL are set correctly"

510
internal/auth/auth.go Normal file
View File

@@ -0,0 +1,510 @@
package auth
import (
"bytes"
"encoding/base64"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"strings"
"time"
"nannyagentv2/internal/config"
"nannyagentv2/internal/logging"
"nannyagentv2/internal/types"
)
const (
// Token storage location (secure directory)
TokenStorageDir = "/var/lib/nannyagent"
TokenStorageFile = ".agent_token.json"
RefreshTokenFile = ".refresh_token"
// Polling configuration
MaxPollAttempts = 60 // 5 minutes (60 * 5 seconds)
PollInterval = 5 * time.Second
)
// AuthManager handles all authentication-related operations
type AuthManager struct {
config *config.Config
client *http.Client
}
// NewAuthManager creates a new authentication manager
func NewAuthManager(cfg *config.Config) *AuthManager {
return &AuthManager{
config: cfg,
client: &http.Client{
Timeout: 30 * time.Second,
},
}
}
// EnsureTokenStorageDir creates the token storage directory if it doesn't exist
func (am *AuthManager) EnsureTokenStorageDir() error {
// Check if running as root
if os.Geteuid() != 0 {
return fmt.Errorf("must run as root to create secure token storage directory")
}
// Create directory with restricted permissions (0700 - only root can access)
if err := os.MkdirAll(TokenStorageDir, 0700); err != nil {
return fmt.Errorf("failed to create token storage directory: %w", err)
}
return nil
}
// StartDeviceAuthorization initiates the OAuth device authorization flow
func (am *AuthManager) StartDeviceAuthorization() (*types.DeviceAuthResponse, error) {
payload := map[string]interface{}{
"client_id": "nannyagent-cli",
"scope": []string{"agent:register"},
}
jsonData, err := json.Marshal(payload)
if err != nil {
return nil, fmt.Errorf("failed to marshal payload: %w", err)
}
url := fmt.Sprintf("%s/device/authorize", am.config.DeviceAuthURL)
req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
if err != nil {
return nil, fmt.Errorf("failed to create request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
resp, err := am.client.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to start device authorization: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read response body: %w", err)
}
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("device authorization failed with status %d: %s", resp.StatusCode, string(body))
}
var deviceResp types.DeviceAuthResponse
if err := json.Unmarshal(body, &deviceResp); err != nil {
return nil, fmt.Errorf("failed to parse response: %w", err)
}
return &deviceResp, nil
}
// PollForToken polls the token endpoint until authorization is complete
func (am *AuthManager) PollForToken(deviceCode string) (*types.TokenResponse, error) {
logging.Info("Waiting for user authorization...")
for attempts := 0; attempts < MaxPollAttempts; attempts++ {
tokenReq := types.TokenRequest{
GrantType: "urn:ietf:params:oauth:grant-type:device_code",
DeviceCode: deviceCode,
}
jsonData, err := json.Marshal(tokenReq)
if err != nil {
return nil, fmt.Errorf("failed to marshal token request: %w", err)
}
url := fmt.Sprintf("%s/token", am.config.DeviceAuthURL)
req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
if err != nil {
return nil, fmt.Errorf("failed to create token request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
resp, err := am.client.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to poll for token: %w", err)
}
body, err := io.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
return nil, fmt.Errorf("failed to read token response: %w", err)
}
var tokenResp types.TokenResponse
if err := json.Unmarshal(body, &tokenResp); err != nil {
return nil, fmt.Errorf("failed to parse token response: %w", err)
}
if tokenResp.Error != "" {
if tokenResp.Error == "authorization_pending" {
fmt.Print(".")
time.Sleep(PollInterval)
continue
}
return nil, fmt.Errorf("authorization failed: %s", tokenResp.ErrorDescription)
}
if tokenResp.AccessToken != "" {
logging.Info("Authorization successful!")
return &tokenResp, nil
}
time.Sleep(PollInterval)
}
return nil, fmt.Errorf("authorization timed out after %d attempts", MaxPollAttempts)
}
// RefreshAccessToken refreshes an expired access token using the refresh token
func (am *AuthManager) RefreshAccessToken(refreshToken string) (*types.TokenResponse, error) {
tokenReq := types.TokenRequest{
GrantType: "refresh_token",
RefreshToken: refreshToken,
}
jsonData, err := json.Marshal(tokenReq)
if err != nil {
return nil, fmt.Errorf("failed to marshal refresh request: %w", err)
}
url := fmt.Sprintf("%s/token", am.config.DeviceAuthURL)
req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonData))
if err != nil {
return nil, fmt.Errorf("failed to create refresh request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
resp, err := am.client.Do(req)
if err != nil {
return nil, fmt.Errorf("failed to refresh token: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read refresh response: %w", err)
}
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("token refresh failed with status %d: %s", resp.StatusCode, string(body))
}
var tokenResp types.TokenResponse
if err := json.Unmarshal(body, &tokenResp); err != nil {
return nil, fmt.Errorf("failed to parse refresh response: %w", err)
}
if tokenResp.Error != "" {
return nil, fmt.Errorf("token refresh failed: %s", tokenResp.ErrorDescription)
}
return &tokenResp, nil
}
// SaveToken saves the authentication token to secure local storage
func (am *AuthManager) SaveToken(token *types.AuthToken) error {
if err := am.EnsureTokenStorageDir(); err != nil {
return fmt.Errorf("failed to ensure token storage directory: %w", err)
}
// Save main token file
tokenPath := am.getTokenPath()
jsonData, err := json.MarshalIndent(token, "", " ")
if err != nil {
return fmt.Errorf("failed to marshal token: %w", err)
}
if err := os.WriteFile(tokenPath, jsonData, 0600); err != nil {
return fmt.Errorf("failed to save token: %w", err)
}
// Also save refresh token separately for backup recovery
if token.RefreshToken != "" {
refreshTokenPath := filepath.Join(TokenStorageDir, RefreshTokenFile)
if err := os.WriteFile(refreshTokenPath, []byte(token.RefreshToken), 0600); err != nil {
// Don't fail if refresh token backup fails, just log
logging.Warning("Failed to save backup refresh token: %v", err)
}
}
return nil
} // LoadToken loads the authentication token from secure local storage
func (am *AuthManager) LoadToken() (*types.AuthToken, error) {
tokenPath := am.getTokenPath()
data, err := os.ReadFile(tokenPath)
if err != nil {
return nil, fmt.Errorf("failed to read token file: %w", err)
}
var token types.AuthToken
if err := json.Unmarshal(data, &token); err != nil {
return nil, fmt.Errorf("failed to parse token: %w", err)
}
// Check if token is expired
if time.Now().After(token.ExpiresAt.Add(-5 * time.Minute)) {
return nil, fmt.Errorf("token is expired or expiring soon")
}
return &token, nil
}
// IsTokenExpired checks if a token needs refresh
func (am *AuthManager) IsTokenExpired(token *types.AuthToken) bool {
// Consider token expired if it expires within the next 5 minutes
return time.Now().After(token.ExpiresAt.Add(-5 * time.Minute))
}
// RegisterDevice performs the complete device registration flow
func (am *AuthManager) RegisterDevice() (*types.AuthToken, error) {
// Step 1: Start device authorization
deviceAuth, err := am.StartDeviceAuthorization()
if err != nil {
return nil, fmt.Errorf("failed to start device authorization: %w", err)
}
logging.Info("Please visit: %s", deviceAuth.VerificationURI)
logging.Info("And enter code: %s", deviceAuth.UserCode)
// Step 2: Poll for token
tokenResp, err := am.PollForToken(deviceAuth.DeviceCode)
if err != nil {
return nil, fmt.Errorf("failed to get token: %w", err)
}
// Step 3: Create token storage
token := &types.AuthToken{
AccessToken: tokenResp.AccessToken,
RefreshToken: tokenResp.RefreshToken,
TokenType: tokenResp.TokenType,
ExpiresAt: time.Now().Add(time.Duration(tokenResp.ExpiresIn) * time.Second),
AgentID: tokenResp.AgentID,
}
// Step 4: Save token
if err := am.SaveToken(token); err != nil {
return nil, fmt.Errorf("failed to save token: %w", err)
}
return token, nil
}
// EnsureAuthenticated ensures the agent has a valid token, refreshing if necessary
func (am *AuthManager) EnsureAuthenticated() (*types.AuthToken, error) {
// Try to load existing token
token, err := am.LoadToken()
if err == nil && !am.IsTokenExpired(token) {
return token, nil
}
// Try to refresh with existing refresh token (even if access token is missing/expired)
var refreshToken string
if err == nil && token.RefreshToken != "" {
// Use refresh token from loaded token
refreshToken = token.RefreshToken
} else {
// Try to load refresh token from main token file even if load failed
if existingToken, loadErr := am.loadTokenIgnoringExpiry(); loadErr == nil && existingToken.RefreshToken != "" {
refreshToken = existingToken.RefreshToken
} else {
// Try to load refresh token from backup file
if backupRefreshToken, backupErr := am.loadRefreshTokenFromBackup(); backupErr == nil {
refreshToken = backupRefreshToken
logging.Debug("Found backup refresh token, attempting to use it...")
}
}
}
if refreshToken != "" {
logging.Debug("Attempting to refresh access token...")
refreshResp, refreshErr := am.RefreshAccessToken(refreshToken)
if refreshErr == nil {
// Get existing agent_id from current token or backup
var agentID string
if err == nil && token.AgentID != "" {
agentID = token.AgentID
} else if existingToken, loadErr := am.loadTokenIgnoringExpiry(); loadErr == nil {
agentID = existingToken.AgentID
}
// Create new token with refreshed values
newToken := &types.AuthToken{
AccessToken: refreshResp.AccessToken,
RefreshToken: refreshToken, // Keep existing refresh token
TokenType: refreshResp.TokenType,
ExpiresAt: time.Now().Add(time.Duration(refreshResp.ExpiresIn) * time.Second),
AgentID: agentID, // Preserve agent_id
}
// Update refresh token if a new one was provided
if refreshResp.RefreshToken != "" {
newToken.RefreshToken = refreshResp.RefreshToken
}
if saveErr := am.SaveToken(newToken); saveErr == nil {
return newToken, nil
}
} else {
fmt.Printf("⚠️ Token refresh failed: %v\n", refreshErr)
}
}
fmt.Println("📝 Initiating new device registration...")
return am.RegisterDevice()
}
// loadTokenIgnoringExpiry loads token file without checking expiry
func (am *AuthManager) loadTokenIgnoringExpiry() (*types.AuthToken, error) {
tokenPath := am.getTokenPath()
data, err := os.ReadFile(tokenPath)
if err != nil {
return nil, fmt.Errorf("failed to read token file: %w", err)
}
var token types.AuthToken
if err := json.Unmarshal(data, &token); err != nil {
return nil, fmt.Errorf("failed to parse token: %w", err)
}
return &token, nil
}
// loadRefreshTokenFromBackup tries to load refresh token from backup file
func (am *AuthManager) loadRefreshTokenFromBackup() (string, error) {
refreshTokenPath := filepath.Join(TokenStorageDir, RefreshTokenFile)
data, err := os.ReadFile(refreshTokenPath)
if err != nil {
return "", fmt.Errorf("failed to read refresh token backup: %w", err)
}
refreshToken := strings.TrimSpace(string(data))
if refreshToken == "" {
return "", fmt.Errorf("refresh token backup is empty")
}
return refreshToken, nil
}
// GetCurrentAgentID retrieves the agent ID from cache or JWT token
func (am *AuthManager) GetCurrentAgentID() (string, error) {
// First try to read from local cache
agentID, err := am.loadCachedAgentID()
if err == nil && agentID != "" {
return agentID, nil
}
// Cache miss - extract from JWT token and cache it
token, err := am.LoadToken()
if err != nil {
return "", fmt.Errorf("failed to load token: %w", err)
}
// Extract agent ID from JWT 'sub' field
agentID, err = am.extractAgentIDFromJWT(token.AccessToken)
if err != nil {
return "", fmt.Errorf("failed to extract agent ID from JWT: %w", err)
}
// Cache the agent ID for future use
if err := am.cacheAgentID(agentID); err != nil {
// Log warning but don't fail - we still have the agent ID
fmt.Printf("Warning: Failed to cache agent ID: %v\n", err)
}
return agentID, nil
}
// extractAgentIDFromJWT decodes the JWT token and extracts the agent ID from 'sub' field
func (am *AuthManager) extractAgentIDFromJWT(tokenString string) (string, error) {
// Basic JWT decoding without verification (since we trust Supabase)
parts := strings.Split(tokenString, ".")
if len(parts) != 3 {
return "", fmt.Errorf("invalid JWT token format")
}
// Decode the payload (second part)
payload := parts[1]
// Add padding if needed for base64 decoding
for len(payload)%4 != 0 {
payload += "="
}
decoded, err := base64.URLEncoding.DecodeString(payload)
if err != nil {
return "", fmt.Errorf("failed to decode JWT payload: %w", err)
}
// Parse JSON payload
var claims map[string]interface{}
if err := json.Unmarshal(decoded, &claims); err != nil {
return "", fmt.Errorf("failed to parse JWT claims: %w", err)
}
// The agent ID is in the 'sub' field (subject)
if agentID, ok := claims["sub"].(string); ok && agentID != "" {
return agentID, nil
}
return "", fmt.Errorf("agent ID (sub) not found in JWT claims")
}
// loadCachedAgentID reads the cached agent ID from local storage
func (am *AuthManager) loadCachedAgentID() (string, error) {
agentIDPath := filepath.Join(TokenStorageDir, "agent_id")
data, err := os.ReadFile(agentIDPath)
if err != nil {
return "", fmt.Errorf("failed to read cached agent ID: %w", err)
}
agentID := strings.TrimSpace(string(data))
if agentID == "" {
return "", fmt.Errorf("cached agent ID is empty")
}
return agentID, nil
}
// cacheAgentID stores the agent ID in local cache
func (am *AuthManager) cacheAgentID(agentID string) error {
// Ensure the directory exists
if err := am.EnsureTokenStorageDir(); err != nil {
return fmt.Errorf("failed to ensure storage directory: %w", err)
}
agentIDPath := filepath.Join(TokenStorageDir, "agent_id")
// Write agent ID to file with secure permissions
if err := os.WriteFile(agentIDPath, []byte(agentID), 0600); err != nil {
return fmt.Errorf("failed to write agent ID cache: %w", err)
}
return nil
}
func (am *AuthManager) getTokenPath() string {
if am.config.TokenPath != "" {
return am.config.TokenPath
}
return filepath.Join(TokenStorageDir, TokenStorageFile)
}
func getHostname() string {
if hostname, err := os.Hostname(); err == nil {
return hostname
}
return "unknown"
}

157
internal/config/config.go Normal file
View File

@@ -0,0 +1,157 @@
package config
import (
"fmt"
"os"
"path/filepath"
"strings"
"nannyagentv2/internal/logging"
"github.com/joho/godotenv"
)
type Config struct {
// Supabase Configuration
SupabaseProjectURL string
// Edge Function Endpoints (auto-generated from SupabaseProjectURL)
DeviceAuthURL string
AgentAuthURL string
// Agent Configuration
TokenPath string
MetricsInterval int
// Debug/Development
Debug bool
}
var DefaultConfig = Config{
TokenPath: "./token.json",
MetricsInterval: 30,
Debug: false,
}
// LoadConfig loads configuration from environment variables and .env file
func LoadConfig() (*Config, error) {
config := DefaultConfig
// Priority order for loading configuration:
// 1. /etc/nannyagent/config.env (system-wide installation)
// 2. Current directory .env file (development)
// 3. Parent directory .env file (development)
configLoaded := false
// Try system-wide config first
if _, err := os.Stat("/etc/nannyagent/config.env"); err == nil {
if err := godotenv.Load("/etc/nannyagent/config.env"); err != nil {
logging.Warning("Could not load /etc/nannyagent/config.env: %v", err)
} else {
logging.Info("Loaded configuration from /etc/nannyagent/config.env")
configLoaded = true
}
}
// If system config not found, try local .env file
if !configLoaded {
envFile := findEnvFile()
if envFile != "" {
if err := godotenv.Load(envFile); err != nil {
logging.Warning("Could not load .env file from %s: %v", envFile, err)
} else {
logging.Info("Loaded configuration from %s", envFile)
configLoaded = true
}
}
}
if !configLoaded {
logging.Warning("No configuration file found. Using environment variables only.")
}
// Load from environment variables
if url := os.Getenv("SUPABASE_PROJECT_URL"); url != "" {
config.SupabaseProjectURL = url
}
if tokenPath := os.Getenv("TOKEN_PATH"); tokenPath != "" {
config.TokenPath = tokenPath
}
if debug := os.Getenv("DEBUG"); debug == "true" || debug == "1" {
config.Debug = true
}
// Auto-generate edge function URLs from project URL
if config.SupabaseProjectURL != "" {
config.DeviceAuthURL = fmt.Sprintf("%s/functions/v1/device-auth", config.SupabaseProjectURL)
config.AgentAuthURL = fmt.Sprintf("%s/functions/v1/agent-auth-api", config.SupabaseProjectURL)
}
// Validate required configuration
if err := config.Validate(); err != nil {
return nil, fmt.Errorf("configuration validation failed: %w", err)
}
return &config, nil
}
// Validate checks if all required configuration is present
func (c *Config) Validate() error {
var missing []string
if c.SupabaseProjectURL == "" {
missing = append(missing, "SUPABASE_PROJECT_URL")
}
if c.DeviceAuthURL == "" {
missing = append(missing, "DEVICE_AUTH_URL (or SUPABASE_PROJECT_URL)")
}
if c.AgentAuthURL == "" {
missing = append(missing, "AGENT_AUTH_URL (or SUPABASE_PROJECT_URL)")
}
if len(missing) > 0 {
return fmt.Errorf("missing required environment variables: %s", strings.Join(missing, ", "))
}
return nil
}
// findEnvFile looks for .env file in current directory and parent directories
func findEnvFile() string {
dir, err := os.Getwd()
if err != nil {
return ""
}
for {
envPath := filepath.Join(dir, ".env")
if _, err := os.Stat(envPath); err == nil {
return envPath
}
parent := filepath.Dir(dir)
if parent == dir {
break
}
dir = parent
}
return ""
}
// PrintConfig prints the current configuration (masking sensitive values)
func (c *Config) PrintConfig() {
if !c.Debug {
return
}
logging.Debug("Configuration:")
logging.Debug(" Supabase Project URL: %s", c.SupabaseProjectURL)
logging.Debug(" Metrics Interval: %d seconds", c.MetricsInterval)
logging.Debug(" Debug: %v", c.Debug)
}

View File

@@ -0,0 +1,343 @@
package ebpf
import (
"bufio"
"io"
"regexp"
"strconv"
"strings"
"time"
)
// EventScanner parses bpftrace output and converts it to TraceEvent structs
type EventScanner struct {
scanner *bufio.Scanner
lastEvent *TraceEvent
lineRegex *regexp.Regexp
}
// NewEventScanner creates a new event scanner for parsing bpftrace output
func NewEventScanner(reader io.Reader) *EventScanner {
// Regex pattern to match our trace output format:
// TRACE|timestamp|pid|tid|comm|function|message
pattern := `^TRACE\|(\d+)\|(\d+)\|(\d+)\|([^|]+)\|([^|]+)\|(.*)$`
regex, _ := regexp.Compile(pattern)
return &EventScanner{
scanner: bufio.NewScanner(reader),
lineRegex: regex,
}
}
// Scan advances the scanner to the next event
func (es *EventScanner) Scan() bool {
for es.scanner.Scan() {
line := strings.TrimSpace(es.scanner.Text())
// Skip empty lines and non-trace lines
if line == "" || !strings.HasPrefix(line, "TRACE|") {
continue
}
// Parse the trace line
if event := es.parseLine(line); event != nil {
es.lastEvent = event
return true
}
}
return false
}
// Event returns the most recently parsed event
func (es *EventScanner) Event() *TraceEvent {
return es.lastEvent
}
// Error returns any scanning error
func (es *EventScanner) Error() error {
return es.scanner.Err()
}
// parseLine parses a single trace line into a TraceEvent
func (es *EventScanner) parseLine(line string) *TraceEvent {
matches := es.lineRegex.FindStringSubmatch(line)
if len(matches) != 7 {
return nil
}
// Parse timestamp (nanoseconds)
timestamp, err := strconv.ParseInt(matches[1], 10, 64)
if err != nil {
return nil
}
// Parse PID
pid, err := strconv.Atoi(matches[2])
if err != nil {
return nil
}
// Parse TID
tid, err := strconv.Atoi(matches[3])
if err != nil {
return nil
}
// Extract process name, function, and message
processName := strings.TrimSpace(matches[4])
function := strings.TrimSpace(matches[5])
message := strings.TrimSpace(matches[6])
event := &TraceEvent{
Timestamp: timestamp,
PID: pid,
TID: tid,
ProcessName: processName,
Function: function,
Message: message,
RawArgs: make(map[string]string),
}
// Try to extract additional information from the message
es.enrichEvent(event, message)
return event
}
// enrichEvent extracts additional information from the message
func (es *EventScanner) enrichEvent(event *TraceEvent, message string) {
// Parse common patterns in messages to extract arguments
// This is a simplified version - in a real implementation you'd want more sophisticated parsing
// Look for patterns like "arg1=value, arg2=value"
argPattern := regexp.MustCompile(`(\w+)=([^,\s]+)`)
matches := argPattern.FindAllStringSubmatch(message, -1)
for _, match := range matches {
if len(match) == 3 {
event.RawArgs[match[1]] = match[2]
}
}
// Look for numeric patterns that might be syscall arguments
numberPattern := regexp.MustCompile(`\b(\d+)\b`)
numbers := numberPattern.FindAllString(message, -1)
for i, num := range numbers {
argName := "arg" + strconv.Itoa(i+1)
event.RawArgs[argName] = num
}
}
// TraceEventFilter provides filtering capabilities for trace events
type TraceEventFilter struct {
MinTimestamp int64
MaxTimestamp int64
ProcessNames []string
PIDs []int
UIDs []int
Functions []string
MessageFilter string
}
// ApplyFilter applies filters to a slice of events
func (filter *TraceEventFilter) ApplyFilter(events []TraceEvent) []TraceEvent {
if filter == nil {
return events
}
var filtered []TraceEvent
for _, event := range events {
if filter.matchesEvent(&event) {
filtered = append(filtered, event)
}
}
return filtered
}
// matchesEvent checks if an event matches the filter criteria
func (filter *TraceEventFilter) matchesEvent(event *TraceEvent) bool {
// Check timestamp range
if filter.MinTimestamp > 0 && event.Timestamp < filter.MinTimestamp {
return false
}
if filter.MaxTimestamp > 0 && event.Timestamp > filter.MaxTimestamp {
return false
}
// Check process names
if len(filter.ProcessNames) > 0 {
found := false
for _, name := range filter.ProcessNames {
if strings.Contains(event.ProcessName, name) {
found = true
break
}
}
if !found {
return false
}
}
// Check PIDs
if len(filter.PIDs) > 0 {
found := false
for _, pid := range filter.PIDs {
if event.PID == pid {
found = true
break
}
}
if !found {
return false
}
}
// Check UIDs
if len(filter.UIDs) > 0 {
found := false
for _, uid := range filter.UIDs {
if event.UID == uid {
found = true
break
}
}
if !found {
return false
}
}
// Check functions
if len(filter.Functions) > 0 {
found := false
for _, function := range filter.Functions {
if strings.Contains(event.Function, function) {
found = true
break
}
}
if !found {
return false
}
}
// Check message filter
if filter.MessageFilter != "" {
if !strings.Contains(event.Message, filter.MessageFilter) {
return false
}
}
return true
}
// TraceEventAggregator provides aggregation capabilities for trace events
type TraceEventAggregator struct {
events []TraceEvent
}
// NewTraceEventAggregator creates a new event aggregator
func NewTraceEventAggregator(events []TraceEvent) *TraceEventAggregator {
return &TraceEventAggregator{
events: events,
}
}
// CountByProcess returns event counts grouped by process
func (agg *TraceEventAggregator) CountByProcess() map[string]int {
counts := make(map[string]int)
for _, event := range agg.events {
counts[event.ProcessName]++
}
return counts
}
// CountByFunction returns event counts grouped by function
func (agg *TraceEventAggregator) CountByFunction() map[string]int {
counts := make(map[string]int)
for _, event := range agg.events {
counts[event.Function]++
}
return counts
}
// CountByPID returns event counts grouped by PID
func (agg *TraceEventAggregator) CountByPID() map[int]int {
counts := make(map[int]int)
for _, event := range agg.events {
counts[event.PID]++
}
return counts
}
// GetTimeRange returns the time range of events
func (agg *TraceEventAggregator) GetTimeRange() (int64, int64) {
if len(agg.events) == 0 {
return 0, 0
}
minTime := agg.events[0].Timestamp
maxTime := agg.events[0].Timestamp
for _, event := range agg.events {
if event.Timestamp < minTime {
minTime = event.Timestamp
}
if event.Timestamp > maxTime {
maxTime = event.Timestamp
}
}
return minTime, maxTime
}
// GetEventRate calculates events per second
func (agg *TraceEventAggregator) GetEventRate() float64 {
if len(agg.events) < 2 {
return 0
}
minTime, maxTime := agg.GetTimeRange()
durationNs := maxTime - minTime
durationSeconds := float64(durationNs) / float64(time.Second)
if durationSeconds == 0 {
return 0
}
return float64(len(agg.events)) / durationSeconds
}
// GetTopProcesses returns the most active processes
func (agg *TraceEventAggregator) GetTopProcesses(limit int) []ProcessStat {
counts := agg.CountByProcess()
total := len(agg.events)
var stats []ProcessStat
for processName, count := range counts {
percentage := float64(count) / float64(total) * 100
stats = append(stats, ProcessStat{
ProcessName: processName,
EventCount: count,
Percentage: percentage,
})
}
// Simple sorting by event count (bubble sort for simplicity)
for i := 0; i < len(stats); i++ {
for j := i + 1; j < len(stats); j++ {
if stats[j].EventCount > stats[i].EventCount {
stats[i], stats[j] = stats[j], stats[i]
}
}
}
if limit > 0 && limit < len(stats) {
stats = stats[:limit]
}
return stats
}

View File

@@ -0,0 +1,587 @@
package ebpf
import (
"context"
"fmt"
"io"
"os"
"os/exec"
"strings"
"sync"
"time"
"nannyagentv2/internal/logging"
)
// TraceSpec represents a trace specification similar to BCC trace.py
type TraceSpec struct {
// Probe type: "p" (kprobe), "r" (kretprobe), "t" (tracepoint), "u" (uprobe)
ProbeType string `json:"probe_type"`
// Target function/syscall/tracepoint
Target string `json:"target"`
// Library for userspace probes (empty for kernel)
Library string `json:"library,omitempty"`
// Format string for output (e.g., "read %d bytes", arg3)
Format string `json:"format"`
// Arguments to extract (e.g., ["arg1", "arg2", "retval"])
Arguments []string `json:"arguments"`
// Filter condition (e.g., "arg3 > 20000")
Filter string `json:"filter,omitempty"`
// Duration in seconds
Duration int `json:"duration"`
// Process ID filter (optional)
PID int `json:"pid,omitempty"`
// Thread ID filter (optional)
TID int `json:"tid,omitempty"`
// UID filter (optional)
UID int `json:"uid,omitempty"`
// Process name filter (optional)
ProcessName string `json:"process_name,omitempty"`
}
// TraceEvent represents a captured event from eBPF
type TraceEvent struct {
Timestamp int64 `json:"timestamp"`
PID int `json:"pid"`
TID int `json:"tid"`
UID int `json:"uid"`
ProcessName string `json:"process_name"`
Function string `json:"function"`
Message string `json:"message"`
RawArgs map[string]string `json:"raw_args"`
CPU int `json:"cpu,omitempty"`
}
// TraceResult represents the results of a tracing session
type TraceResult struct {
TraceID string `json:"trace_id"`
Spec TraceSpec `json:"spec"`
Events []TraceEvent `json:"events"`
EventCount int `json:"event_count"`
StartTime time.Time `json:"start_time"`
EndTime time.Time `json:"end_time"`
Summary string `json:"summary"`
Statistics TraceStats `json:"statistics"`
}
// TraceStats provides statistics about the trace
type TraceStats struct {
TotalEvents int `json:"total_events"`
EventsByProcess map[string]int `json:"events_by_process"`
EventsByUID map[int]int `json:"events_by_uid"`
EventsPerSecond float64 `json:"events_per_second"`
TopProcesses []ProcessStat `json:"top_processes"`
}
// ProcessStat represents statistics for a process
type ProcessStat struct {
ProcessName string `json:"process_name"`
PID int `json:"pid"`
EventCount int `json:"event_count"`
Percentage float64 `json:"percentage"`
}
// BCCTraceManager implements advanced eBPF tracing similar to BCC trace.py
type BCCTraceManager struct {
traces map[string]*RunningTrace
tracesLock sync.RWMutex
traceCounter int
capabilities map[string]bool
}
// RunningTrace represents an active trace session
type RunningTrace struct {
ID string
Spec TraceSpec
Process *exec.Cmd
Events []TraceEvent
StartTime time.Time
Cancel context.CancelFunc
Context context.Context
Done chan struct{} // Signal when trace monitoring is complete
}
// NewBCCTraceManager creates a new BCC-style trace manager
func NewBCCTraceManager() *BCCTraceManager {
manager := &BCCTraceManager{
traces: make(map[string]*RunningTrace),
capabilities: make(map[string]bool),
}
manager.testCapabilities()
return manager
}
// testCapabilities checks what tracing capabilities are available
func (tm *BCCTraceManager) testCapabilities() {
// Test if bpftrace is available
if _, err := exec.LookPath("bpftrace"); err == nil {
tm.capabilities["bpftrace"] = true
} else {
tm.capabilities["bpftrace"] = false
}
// Test if perf is available for fallback
if _, err := exec.LookPath("perf"); err == nil {
tm.capabilities["perf"] = true
} else {
tm.capabilities["perf"] = false
}
// Test root privileges (required for eBPF)
tm.capabilities["root_access"] = os.Geteuid() == 0
// Test kernel version
cmd := exec.Command("uname", "-r")
output, err := cmd.Output()
if err == nil {
version := strings.TrimSpace(string(output))
// eBPF requires kernel 4.4+
tm.capabilities["kernel_ebpf"] = !strings.HasPrefix(version, "3.")
} else {
tm.capabilities["kernel_ebpf"] = false
}
// Test if we can access debugfs
if _, err := os.Stat("/sys/kernel/debug/tracing/available_events"); err == nil {
tm.capabilities["debugfs_access"] = true
} else {
tm.capabilities["debugfs_access"] = false
}
logging.Debug("BCC Trace capabilities: %+v", tm.capabilities)
}
// GetCapabilities returns available tracing capabilities
func (tm *BCCTraceManager) GetCapabilities() map[string]bool {
tm.tracesLock.RLock()
defer tm.tracesLock.RUnlock()
caps := make(map[string]bool)
for k, v := range tm.capabilities {
caps[k] = v
}
return caps
}
// StartTrace starts a new trace session based on the specification
func (tm *BCCTraceManager) StartTrace(spec TraceSpec) (string, error) {
if !tm.capabilities["bpftrace"] {
return "", fmt.Errorf("bpftrace not available - install bpftrace package")
}
if !tm.capabilities["root_access"] {
return "", fmt.Errorf("root access required for eBPF tracing")
}
if !tm.capabilities["kernel_ebpf"] {
return "", fmt.Errorf("kernel version does not support eBPF")
}
tm.tracesLock.Lock()
defer tm.tracesLock.Unlock()
// Generate trace ID
tm.traceCounter++
traceID := fmt.Sprintf("trace_%d", tm.traceCounter)
// Generate bpftrace script
script, err := tm.generateBpftraceScript(spec)
if err != nil {
return "", fmt.Errorf("failed to generate bpftrace script: %w", err)
}
// Debug: log the generated script
logging.Debug("Generated bpftrace script for %s:\n%s", spec.Target, script)
// Create context with timeout
ctx, cancel := context.WithTimeout(context.Background(), time.Duration(spec.Duration)*time.Second)
// Start bpftrace process
cmd := exec.CommandContext(ctx, "bpftrace", "-e", script)
// Create stdout pipe BEFORE starting
stdout, err := cmd.StdoutPipe()
if err != nil {
cancel()
return "", fmt.Errorf("failed to create stdout pipe: %w", err)
}
trace := &RunningTrace{
ID: traceID,
Spec: spec,
Process: cmd,
Events: []TraceEvent{},
StartTime: time.Now(),
Cancel: cancel,
Context: ctx,
Done: make(chan struct{}), // Initialize completion signal
}
// Start the trace
if err := cmd.Start(); err != nil {
cancel()
return "", fmt.Errorf("failed to start bpftrace: %w", err)
}
tm.traces[traceID] = trace
// Monitor the trace in a goroutine
go tm.monitorTrace(traceID, stdout)
logging.Debug("Started BCC-style trace %s for target %s", traceID, spec.Target)
return traceID, nil
} // generateBpftraceScript generates a bpftrace script based on the trace specification
func (tm *BCCTraceManager) generateBpftraceScript(spec TraceSpec) (string, error) {
var script strings.Builder
// Build probe specification
var probe string
switch spec.ProbeType {
case "p", "": // kprobe (default)
if strings.HasPrefix(spec.Target, "sys_") || strings.HasPrefix(spec.Target, "__x64_sys_") {
probe = fmt.Sprintf("kprobe:%s", spec.Target)
} else {
probe = fmt.Sprintf("kprobe:%s", spec.Target)
}
case "r": // kretprobe
if strings.HasPrefix(spec.Target, "sys_") || strings.HasPrefix(spec.Target, "__x64_sys_") {
probe = fmt.Sprintf("kretprobe:%s", spec.Target)
} else {
probe = fmt.Sprintf("kretprobe:%s", spec.Target)
}
case "t": // tracepoint
// If target already includes tracepoint prefix, use as-is
if strings.HasPrefix(spec.Target, "tracepoint:") {
probe = spec.Target
} else {
probe = fmt.Sprintf("tracepoint:%s", spec.Target)
}
case "u": // uprobe
if spec.Library == "" {
return "", fmt.Errorf("library required for uprobe")
}
probe = fmt.Sprintf("uprobe:%s:%s", spec.Library, spec.Target)
default:
return "", fmt.Errorf("unsupported probe type: %s", spec.ProbeType)
}
// Add BEGIN block
script.WriteString("BEGIN {\n")
script.WriteString(fmt.Sprintf(" printf(\"Starting trace for %s...\\n\");\n", spec.Target))
script.WriteString("}\n\n")
// Build the main probe
script.WriteString(fmt.Sprintf("%s {\n", probe))
// Add filters if specified
if tm.needsFiltering(spec) {
script.WriteString(" if (")
filters := tm.buildFilters(spec)
script.WriteString(strings.Join(filters, " && "))
script.WriteString(") {\n")
}
// Build output format
outputFormat := tm.buildOutputFormat(spec)
script.WriteString(fmt.Sprintf(" printf(\"%s\\n\"", outputFormat))
// Add arguments
args := tm.buildArgumentList(spec)
if len(args) > 0 {
script.WriteString(", ")
script.WriteString(strings.Join(args, ", "))
}
script.WriteString(");\n")
// Close filter if block
if tm.needsFiltering(spec) {
script.WriteString(" }\n")
}
script.WriteString("}\n\n")
// Add END block
script.WriteString("END {\n")
script.WriteString(fmt.Sprintf(" printf(\"Trace completed for %s\\n\");\n", spec.Target))
script.WriteString("}\n")
return script.String(), nil
}
// needsFiltering checks if any filters are needed
func (tm *BCCTraceManager) needsFiltering(spec TraceSpec) bool {
return spec.PID != 0 || spec.TID != 0 || spec.UID != -1 ||
spec.ProcessName != "" || spec.Filter != ""
}
// buildFilters builds the filter conditions
func (tm *BCCTraceManager) buildFilters(spec TraceSpec) []string {
var filters []string
if spec.PID != 0 {
filters = append(filters, fmt.Sprintf("pid == %d", spec.PID))
}
if spec.TID != 0 {
filters = append(filters, fmt.Sprintf("tid == %d", spec.TID))
}
if spec.UID != -1 {
filters = append(filters, fmt.Sprintf("uid == %d", spec.UID))
}
if spec.ProcessName != "" {
filters = append(filters, fmt.Sprintf("strncmp(comm, \"%s\", %d) == 0", spec.ProcessName, len(spec.ProcessName)))
}
// Add custom filter
if spec.Filter != "" {
// Convert common patterns to bpftrace syntax
customFilter := strings.ReplaceAll(spec.Filter, "arg", "arg")
filters = append(filters, customFilter)
}
return filters
}
// buildOutputFormat creates the output format string
func (tm *BCCTraceManager) buildOutputFormat(spec TraceSpec) string {
if spec.Format != "" {
// Use custom format
return fmt.Sprintf("TRACE|%%d|%%d|%%d|%%s|%s|%s", spec.Target, spec.Format)
}
// Default format
return fmt.Sprintf("TRACE|%%d|%%d|%%d|%%s|%s|called", spec.Target)
}
// buildArgumentList creates the argument list for printf
func (tm *BCCTraceManager) buildArgumentList(spec TraceSpec) []string {
// Always include timestamp, pid, tid, comm
args := []string{"nsecs", "pid", "tid", "comm"}
// Add custom arguments
for _, arg := range spec.Arguments {
switch arg {
case "arg1", "arg2", "arg3", "arg4", "arg5", "arg6":
args = append(args, fmt.Sprintf("arg%s", strings.TrimPrefix(arg, "arg")))
case "retval":
args = append(args, "retval")
case "cpu":
args = append(args, "cpu")
default:
// Custom expression
args = append(args, arg)
}
}
return args
}
// monitorTrace monitors a running trace and collects events
func (tm *BCCTraceManager) monitorTrace(traceID string, stdout io.ReadCloser) {
tm.tracesLock.Lock()
trace, exists := tm.traces[traceID]
if !exists {
tm.tracesLock.Unlock()
return
}
tm.tracesLock.Unlock()
// Start reading output in a goroutine
go func() {
scanner := NewEventScanner(stdout)
for scanner.Scan() {
event := scanner.Event()
if event != nil {
tm.tracesLock.Lock()
if t, exists := tm.traces[traceID]; exists {
t.Events = append(t.Events, *event)
}
tm.tracesLock.Unlock()
}
}
stdout.Close()
}()
// Wait for the process to complete
err := trace.Process.Wait()
// Clean up
trace.Cancel()
tm.tracesLock.Lock()
if err != nil && err.Error() != "signal: killed" {
logging.Warning("Trace %s completed with error: %v", traceID, err)
} else {
logging.Debug("Trace %s completed successfully with %d events",
traceID, len(trace.Events))
}
// Signal that monitoring is complete
close(trace.Done)
tm.tracesLock.Unlock()
}
// GetTraceResult returns the results of a completed trace
func (tm *BCCTraceManager) GetTraceResult(traceID string) (*TraceResult, error) {
tm.tracesLock.RLock()
trace, exists := tm.traces[traceID]
if !exists {
tm.tracesLock.RUnlock()
return nil, fmt.Errorf("trace %s not found", traceID)
}
tm.tracesLock.RUnlock()
// Wait for trace monitoring to complete
select {
case <-trace.Done:
// Trace monitoring completed
case <-time.After(5 * time.Second):
// Timeout waiting for completion
return nil, fmt.Errorf("timeout waiting for trace %s to complete", traceID)
}
// Now safely read the final results
tm.tracesLock.RLock()
defer tm.tracesLock.RUnlock()
result := &TraceResult{
TraceID: traceID,
Spec: trace.Spec,
Events: make([]TraceEvent, len(trace.Events)),
EventCount: len(trace.Events),
StartTime: trace.StartTime,
EndTime: time.Now(),
}
copy(result.Events, trace.Events)
// Calculate statistics
result.Statistics = tm.calculateStatistics(result.Events, result.EndTime.Sub(result.StartTime))
// Generate summary
result.Summary = tm.generateSummary(result)
return result, nil
}
// calculateStatistics calculates statistics for the trace results
func (tm *BCCTraceManager) calculateStatistics(events []TraceEvent, duration time.Duration) TraceStats {
stats := TraceStats{
TotalEvents: len(events),
EventsByProcess: make(map[string]int),
EventsByUID: make(map[int]int),
}
if duration > 0 {
stats.EventsPerSecond = float64(len(events)) / duration.Seconds()
}
// Calculate per-process and per-UID statistics
for _, event := range events {
stats.EventsByProcess[event.ProcessName]++
stats.EventsByUID[event.UID]++
}
// Calculate top processes
for processName, count := range stats.EventsByProcess {
percentage := float64(count) / float64(len(events)) * 100
stats.TopProcesses = append(stats.TopProcesses, ProcessStat{
ProcessName: processName,
EventCount: count,
Percentage: percentage,
})
}
return stats
}
// generateSummary generates a human-readable summary
func (tm *BCCTraceManager) generateSummary(result *TraceResult) string {
duration := result.EndTime.Sub(result.StartTime)
summary := fmt.Sprintf("Traced %s for %v, captured %d events (%.2f events/sec)",
result.Spec.Target, duration, result.EventCount, result.Statistics.EventsPerSecond)
if len(result.Statistics.TopProcesses) > 0 {
summary += fmt.Sprintf(", top process: %s (%d events)",
result.Statistics.TopProcesses[0].ProcessName,
result.Statistics.TopProcesses[0].EventCount)
}
return summary
}
// StopTrace stops an active trace
func (tm *BCCTraceManager) StopTrace(traceID string) error {
tm.tracesLock.Lock()
defer tm.tracesLock.Unlock()
trace, exists := tm.traces[traceID]
if !exists {
return fmt.Errorf("trace %s not found", traceID)
}
if trace.Process.ProcessState == nil {
// Process is still running, kill it
if err := trace.Process.Process.Kill(); err != nil {
return fmt.Errorf("failed to stop trace: %w", err)
}
}
trace.Cancel()
return nil
}
// ListActiveTraces returns a list of active trace IDs
func (tm *BCCTraceManager) ListActiveTraces() []string {
tm.tracesLock.RLock()
defer tm.tracesLock.RUnlock()
var active []string
for id, trace := range tm.traces {
if trace.Process.ProcessState == nil {
active = append(active, id)
}
}
return active
}
// GetSummary returns a summary of the trace manager state
func (tm *BCCTraceManager) GetSummary() map[string]interface{} {
tm.tracesLock.RLock()
defer tm.tracesLock.RUnlock()
activeCount := 0
completedCount := 0
for _, trace := range tm.traces {
if trace.Process.ProcessState == nil {
activeCount++
} else {
completedCount++
}
}
return map[string]interface{}{
"capabilities": tm.capabilities,
"active_traces": activeCount,
"completed_traces": completedCount,
"total_traces": len(tm.traces),
"active_trace_ids": tm.ListActiveTraces(),
}
}

View File

@@ -0,0 +1,396 @@
package ebpf
import (
"encoding/json"
"fmt"
"strings"
)
// TestTraceSpecs provides test trace specifications for unit testing the BCC-style tracing
// These are used to validate the tracing functionality without requiring remote API calls
var TestTraceSpecs = map[string]TraceSpec{
// Basic system call tracing for testing
"test_sys_open": {
ProbeType: "p",
Target: "__x64_sys_openat",
Format: "opening file: %s",
Arguments: []string{"arg2@user"}, // filename
Duration: 5, // Short duration for testing
},
"test_sys_read": {
ProbeType: "p",
Target: "__x64_sys_read",
Format: "read %d bytes from fd %d",
Arguments: []string{"arg3", "arg1"}, // count, fd
Filter: "arg3 > 100", // Only reads >100 bytes for testing
Duration: 5,
},
"test_sys_write": {
ProbeType: "p",
Target: "__x64_sys_write",
Format: "write %d bytes to fd %d",
Arguments: []string{"arg3", "arg1"}, // count, fd
Duration: 5,
},
"test_process_creation": {
ProbeType: "p",
Target: "__x64_sys_execve",
Format: "exec: %s",
Arguments: []string{"arg1@user"}, // filename
Duration: 5,
},
// Test with different probe types
"test_kretprobe": {
ProbeType: "r",
Target: "__x64_sys_openat",
Format: "open returned: %d",
Arguments: []string{"retval"},
Duration: 5,
},
"test_with_filter": {
ProbeType: "p",
Target: "__x64_sys_write",
Format: "stdout write: %d bytes",
Arguments: []string{"arg3"},
Filter: "arg1 == 1", // Only stdout writes
Duration: 5,
},
}
// GetTestSpec returns a pre-defined test trace specification
func GetTestSpec(name string) (TraceSpec, bool) {
spec, exists := TestTraceSpecs[name]
return spec, exists
}
// ListTestSpecs returns all available test trace specifications
func ListTestSpecs() map[string]string {
descriptions := map[string]string{
"test_sys_open": "Test file open operations",
"test_sys_read": "Test read operations (>100 bytes)",
"test_sys_write": "Test write operations",
"test_process_creation": "Test process execution",
"test_kretprobe": "Test kretprobe on file open",
"test_with_filter": "Test filtered writes to stdout",
}
return descriptions
}
// TraceSpecBuilder helps build custom trace specifications
type TraceSpecBuilder struct {
spec TraceSpec
}
// NewTraceSpecBuilder creates a new trace specification builder
func NewTraceSpecBuilder() *TraceSpecBuilder {
return &TraceSpecBuilder{
spec: TraceSpec{
ProbeType: "p", // Default to kprobe
Duration: 30, // Default 30 seconds
},
}
}
// Kprobe sets up a kernel probe
func (b *TraceSpecBuilder) Kprobe(function string) *TraceSpecBuilder {
b.spec.ProbeType = "p"
b.spec.Target = function
return b
}
// Kretprobe sets up a kernel return probe
func (b *TraceSpecBuilder) Kretprobe(function string) *TraceSpecBuilder {
b.spec.ProbeType = "r"
b.spec.Target = function
return b
}
// Tracepoint sets up a tracepoint
func (b *TraceSpecBuilder) Tracepoint(category, name string) *TraceSpecBuilder {
b.spec.ProbeType = "t"
b.spec.Target = fmt.Sprintf("%s:%s", category, name)
return b
}
// Uprobe sets up a userspace probe
func (b *TraceSpecBuilder) Uprobe(library, function string) *TraceSpecBuilder {
b.spec.ProbeType = "u"
b.spec.Library = library
b.spec.Target = function
return b
}
// Format sets the output format string
func (b *TraceSpecBuilder) Format(format string, args ...string) *TraceSpecBuilder {
b.spec.Format = format
b.spec.Arguments = args
return b
}
// Filter adds a filter condition
func (b *TraceSpecBuilder) Filter(condition string) *TraceSpecBuilder {
b.spec.Filter = condition
return b
}
// Duration sets the trace duration in seconds
func (b *TraceSpecBuilder) Duration(seconds int) *TraceSpecBuilder {
b.spec.Duration = seconds
return b
}
// PID filters by process ID
func (b *TraceSpecBuilder) PID(pid int) *TraceSpecBuilder {
b.spec.PID = pid
return b
}
// UID filters by user ID
func (b *TraceSpecBuilder) UID(uid int) *TraceSpecBuilder {
b.spec.UID = uid
return b
}
// ProcessName filters by process name
func (b *TraceSpecBuilder) ProcessName(name string) *TraceSpecBuilder {
b.spec.ProcessName = name
return b
}
// Build returns the constructed trace specification
func (b *TraceSpecBuilder) Build() TraceSpec {
return b.spec
}
// TraceSpecParser parses trace specifications from various formats
type TraceSpecParser struct{}
// NewTraceSpecParser creates a new parser
func NewTraceSpecParser() *TraceSpecParser {
return &TraceSpecParser{}
}
// ParseFromBCCStyle parses BCC trace.py style specifications
// Examples:
//
// "sys_open" -> trace sys_open syscall
// "p::do_sys_open" -> kprobe on do_sys_open
// "r::do_sys_open" -> kretprobe on do_sys_open
// "t:syscalls:sys_enter_open" -> tracepoint
// "sys_read (arg3 > 1024)" -> with filter
// "sys_read \"read %d bytes\", arg3" -> with format
func (p *TraceSpecParser) ParseFromBCCStyle(spec string) (TraceSpec, error) {
result := TraceSpec{
ProbeType: "p",
Duration: 30,
}
// Split by quotes to separate format string
parts := strings.Split(spec, "\"")
var probeSpec string
if len(parts) >= 1 {
probeSpec = strings.TrimSpace(parts[0])
}
var formatPart string
if len(parts) >= 2 {
formatPart = parts[1]
}
var argsPart string
if len(parts) >= 3 {
argsPart = strings.TrimSpace(parts[2])
if strings.HasPrefix(argsPart, ",") {
argsPart = strings.TrimSpace(argsPart[1:])
}
}
// Parse probe specification
if err := p.parseProbeSpec(probeSpec, &result); err != nil {
return result, err
}
// Parse format string
if formatPart != "" {
result.Format = formatPart
}
// Parse arguments
if argsPart != "" {
result.Arguments = p.parseArguments(argsPart)
}
return result, nil
}
// parseProbeSpec parses the probe specification part
func (p *TraceSpecParser) parseProbeSpec(spec string, result *TraceSpec) error {
// Handle filter conditions in parentheses
if idx := strings.Index(spec, "("); idx != -1 {
filterEnd := strings.LastIndex(spec, ")")
if filterEnd > idx {
result.Filter = strings.TrimSpace(spec[idx+1 : filterEnd])
spec = strings.TrimSpace(spec[:idx])
}
}
// Parse probe type and target
if strings.Contains(spec, ":") {
parts := strings.SplitN(spec, ":", 3)
if len(parts) >= 1 && parts[0] != "" {
switch parts[0] {
case "p":
result.ProbeType = "p"
case "r":
result.ProbeType = "r"
case "t":
result.ProbeType = "t"
case "u":
result.ProbeType = "u"
default:
return fmt.Errorf("unsupported probe type: %s", parts[0])
}
}
if len(parts) >= 2 {
result.Library = parts[1]
}
if len(parts) >= 3 {
result.Target = parts[2]
} else if len(parts) == 2 {
result.Target = parts[1]
result.Library = ""
}
} else {
// Simple function name
result.Target = spec
// Auto-detect syscall format
if strings.HasPrefix(spec, "sys_") && !strings.HasPrefix(spec, "__x64_sys_") {
result.Target = "__x64_sys_" + spec[4:]
}
}
return nil
}
// parseArguments parses the arguments part
func (p *TraceSpecParser) parseArguments(args string) []string {
var result []string
// Split by comma and clean up
parts := strings.Split(args, ",")
for _, part := range parts {
arg := strings.TrimSpace(part)
if arg != "" {
result = append(result, arg)
}
}
return result
}
// ParseFromJSON parses trace specification from JSON
func (p *TraceSpecParser) ParseFromJSON(jsonData []byte) (TraceSpec, error) {
var spec TraceSpec
err := json.Unmarshal(jsonData, &spec)
return spec, err
}
// GetCommonSpec returns a pre-defined test trace specification (renamed for backward compatibility)
func GetCommonSpec(name string) (TraceSpec, bool) {
// Map old names to new test names for compatibility
testName := name
if strings.HasPrefix(name, "trace_") {
testName = strings.Replace(name, "trace_", "test_", 1)
}
spec, exists := TestTraceSpecs[testName]
return spec, exists
}
// ListCommonSpecs returns all available test trace specifications (renamed for backward compatibility)
func ListCommonSpecs() map[string]string {
return ListTestSpecs()
}
// ValidateTraceSpec validates a trace specification
func ValidateTraceSpec(spec TraceSpec) error {
if spec.Target == "" {
return fmt.Errorf("target function/syscall is required")
}
if spec.Duration <= 0 {
return fmt.Errorf("duration must be positive")
}
if spec.Duration > 600 { // 10 minutes max
return fmt.Errorf("duration too long (max 600 seconds)")
}
switch spec.ProbeType {
case "p", "r", "t", "u":
// Valid probe types
case "":
// Default to kprobe
default:
return fmt.Errorf("unsupported probe type: %s", spec.ProbeType)
}
if spec.ProbeType == "u" && spec.Library == "" {
return fmt.Errorf("library required for userspace probes")
}
if spec.ProbeType == "t" && !strings.Contains(spec.Target, ":") {
return fmt.Errorf("tracepoint requires format 'category:name'")
}
return nil
}
// SuggestSyscallTargets suggests syscall targets based on the issue description
func SuggestSyscallTargets(issueDescription string) []string {
description := strings.ToLower(issueDescription)
var suggestions []string
// File I/O issues
if strings.Contains(description, "file") || strings.Contains(description, "disk") || strings.Contains(description, "io") {
suggestions = append(suggestions, "trace_sys_open", "trace_sys_read", "trace_sys_write", "trace_sys_unlink")
}
// Network issues
if strings.Contains(description, "network") || strings.Contains(description, "socket") || strings.Contains(description, "connection") {
suggestions = append(suggestions, "trace_sys_connect", "trace_sys_socket", "trace_sys_bind", "trace_sys_accept")
}
// Process issues
if strings.Contains(description, "process") || strings.Contains(description, "crash") || strings.Contains(description, "exec") {
suggestions = append(suggestions, "trace_sys_execve", "trace_sys_clone", "trace_sys_exit", "trace_sys_kill")
}
// Memory issues
if strings.Contains(description, "memory") || strings.Contains(description, "malloc") || strings.Contains(description, "leak") {
suggestions = append(suggestions, "trace_sys_mmap", "trace_sys_brk")
}
// Performance issues - trace common syscalls
if strings.Contains(description, "slow") || strings.Contains(description, "performance") || strings.Contains(description, "hang") {
suggestions = append(suggestions, "trace_sys_read", "trace_sys_write", "trace_sys_connect", "trace_sys_mmap")
}
// If no specific suggestions, provide general monitoring
if len(suggestions) == 0 {
suggestions = append(suggestions, "trace_sys_execve", "trace_sys_open", "trace_sys_connect")
}
return suggestions
}

View File

@@ -0,0 +1,921 @@
package ebpf
import (
"encoding/json"
"fmt"
"os"
"strings"
"testing"
"time"
)
// TestBCCTracing demonstrates and tests the new BCC-style tracing functionality
// This test documents the expected behavior and response format of the agent
func TestBCCTracing(t *testing.T) {
fmt.Println("=== BCC-Style eBPF Tracing Unit Tests ===")
fmt.Println()
// Test 1: List available test specifications
t.Run("ListTestSpecs", func(t *testing.T) {
specs := ListTestSpecs()
fmt.Printf("📋 Available Test Specifications:\n")
for name, description := range specs {
fmt.Printf(" - %s: %s\n", name, description)
}
fmt.Println()
if len(specs) == 0 {
t.Error("No test specifications available")
}
})
// Test 2: Parse BCC-style specifications
t.Run("ParseBCCStyle", func(t *testing.T) {
parser := NewTraceSpecParser()
testCases := []struct {
input string
expected string
}{
{
input: "sys_open",
expected: "__x64_sys_open",
},
{
input: "p::do_sys_open",
expected: "do_sys_open",
},
{
input: "r::sys_read",
expected: "sys_read",
},
{
input: "sys_write (arg1 == 1)",
expected: "__x64_sys_write",
},
}
fmt.Printf("🔍 Testing BCC-style parsing:\n")
for _, tc := range testCases {
spec, err := parser.ParseFromBCCStyle(tc.input)
if err != nil {
t.Errorf("Failed to parse '%s': %v", tc.input, err)
continue
}
fmt.Printf(" Input: '%s' -> Target: '%s', Type: '%s'\n",
tc.input, spec.Target, spec.ProbeType)
if spec.Target != tc.expected {
t.Errorf("Expected target '%s', got '%s'", tc.expected, spec.Target)
}
}
fmt.Println()
})
// Test 3: Validate trace specifications
t.Run("ValidateSpecs", func(t *testing.T) {
fmt.Printf("✅ Testing trace specification validation:\n")
// Valid spec
validSpec := TraceSpec{
ProbeType: "p",
Target: "__x64_sys_openat",
Format: "opening file",
Duration: 5,
}
if err := ValidateTraceSpec(validSpec); err != nil {
t.Errorf("Valid spec failed validation: %v", err)
} else {
fmt.Printf(" ✓ Valid specification passed\n")
}
// Invalid spec - no target
invalidSpec := TraceSpec{
ProbeType: "p",
Duration: 5,
}
if err := ValidateTraceSpec(invalidSpec); err == nil {
t.Error("Invalid spec (no target) should have failed validation")
} else {
fmt.Printf(" ✓ Invalid specification correctly rejected: %s\n", err.Error())
}
fmt.Println()
})
// Test 4: Simulate agent response format
t.Run("SimulateAgentResponse", func(t *testing.T) {
fmt.Printf("🤖 Simulating agent response for BCC-style tracing:\n")
// Get a test specification
testSpec, exists := GetTestSpec("test_sys_open")
if !exists {
t.Fatal("test_sys_open specification not found")
}
// Simulate what the agent would return
mockResponse := simulateTraceExecution(testSpec)
// Print the response format
responseJSON, _ := json.MarshalIndent(mockResponse, "", " ")
fmt.Printf(" Expected Response Format:\n%s\n", string(responseJSON))
// Validate response structure
if mockResponse["success"] != true {
t.Error("Expected successful trace execution")
}
if mockResponse["type"] != "bcc_trace" {
t.Error("Expected type to be 'bcc_trace'")
}
events, hasEvents := mockResponse["events"].([]TraceEvent)
if !hasEvents || len(events) == 0 {
t.Error("Expected trace events in response")
}
fmt.Println()
})
// Test 5: Test different probe types
t.Run("TestProbeTypes", func(t *testing.T) {
fmt.Printf("🔬 Testing different probe types:\n")
probeTests := []struct {
specName string
expected string
}{
{"test_sys_open", "kprobe"},
{"test_kretprobe", "kretprobe"},
{"test_with_filter", "kprobe with filter"},
}
for _, test := range probeTests {
spec, exists := GetTestSpec(test.specName)
if !exists {
t.Errorf("Test spec '%s' not found", test.specName)
continue
}
response := simulateTraceExecution(spec)
fmt.Printf(" %s -> %s: %d events captured\n",
test.specName, test.expected, response["event_count"])
}
fmt.Println()
})
// Test 6: Test trace spec builder
t.Run("TestTraceSpecBuilder", func(t *testing.T) {
fmt.Printf("🏗️ Testing trace specification builder:\n")
// Build a custom trace spec
spec := NewTraceSpecBuilder().
Kprobe("__x64_sys_write").
Format("write syscall: %d bytes", "arg3").
Filter("arg1 == 1").
Duration(3).
Build()
fmt.Printf(" Built spec: Target=%s, Format=%s, Filter=%s\n",
spec.Target, spec.Format, spec.Filter)
if spec.Target != "__x64_sys_write" {
t.Error("Builder failed to set target correctly")
}
if spec.ProbeType != "p" {
t.Error("Builder failed to set probe type correctly")
}
fmt.Println()
})
}
// simulateTraceExecution simulates what the agent would return for a trace execution
// This documents the expected response format from the agent
func simulateTraceExecution(spec TraceSpec) map[string]interface{} {
// Simulate some trace events
events := []TraceEvent{
{
Timestamp: time.Now().Unix(),
PID: 1234,
TID: 1234,
ProcessName: "test_process",
Function: spec.Target,
Message: fmt.Sprintf(spec.Format, "test_file.txt"),
RawArgs: map[string]string{
"arg1": "5",
"arg2": "test_file.txt",
"arg3": "1024",
},
},
{
Timestamp: time.Now().Unix(),
PID: 5678,
TID: 5678,
ProcessName: "another_process",
Function: spec.Target,
Message: fmt.Sprintf(spec.Format, "data.log"),
RawArgs: map[string]string{
"arg1": "3",
"arg2": "data.log",
"arg3": "512",
},
},
}
// Simulate trace statistics
stats := TraceStats{
TotalEvents: len(events),
EventsByProcess: map[string]int{"test_process": 1, "another_process": 1},
EventsByUID: map[int]int{1000: 2},
EventsPerSecond: float64(len(events)) / float64(spec.Duration),
TopProcesses: []ProcessStat{
{ProcessName: "test_process", EventCount: 1, Percentage: 50.0},
{ProcessName: "another_process", EventCount: 1, Percentage: 50.0},
},
}
// Return the expected agent response format
return map[string]interface{}{
"name": spec.Target,
"type": "bcc_trace",
"target": spec.Target,
"duration": spec.Duration,
"description": fmt.Sprintf("Traced %s for %d seconds", spec.Target, spec.Duration),
"status": "completed",
"success": true,
"event_count": len(events),
"events": events,
"statistics": stats,
"data_points": len(events),
"probe_type": spec.ProbeType,
"format": spec.Format,
"filter": spec.Filter,
}
}
// TestTraceManagerCapabilities tests the trace manager capabilities
func TestTraceManagerCapabilities(t *testing.T) {
fmt.Println("=== BCC Trace Manager Capabilities Test ===")
fmt.Println()
manager := NewBCCTraceManager()
caps := manager.GetCapabilities()
fmt.Printf("🔧 Trace Manager Capabilities:\n")
for capability, available := range caps {
status := "❌ Not Available"
if available {
status = "✅ Available"
}
fmt.Printf(" %s: %s\n", capability, status)
}
fmt.Println()
// Check essential capabilities
if !caps["kernel_ebpf"] {
fmt.Printf("⚠️ Warning: Kernel eBPF support not detected\n")
}
if !caps["bpftrace"] {
fmt.Printf("⚠️ Warning: bpftrace not available (install with: apt install bpftrace)\n")
}
if !caps["root_access"] {
fmt.Printf("⚠️ Warning: Root access required for eBPF tracing\n")
}
}
// BenchmarkTraceSpecParsing benchmarks the trace specification parsing
func BenchmarkTraceSpecParsing(b *testing.B) {
parser := NewTraceSpecParser()
testInput := "sys_open \"opening %s\", arg2@user"
b.ResetTimer()
for i := 0; i < b.N; i++ {
_, err := parser.ParseFromBCCStyle(testInput)
if err != nil {
b.Fatal(err)
}
}
}
// TestSyscallSuggestions tests the syscall suggestion functionality
func TestSyscallSuggestions(t *testing.T) {
fmt.Println("=== Syscall Suggestion Test ===")
fmt.Println()
testCases := []struct {
issue string
expected int // minimum expected suggestions
description string
}{
{
issue: "file not found error",
expected: 1,
description: "File I/O issue should suggest file-related syscalls",
},
{
issue: "network connection timeout",
expected: 1,
description: "Network issue should suggest network syscalls",
},
{
issue: "process crashes randomly",
expected: 1,
description: "Process issue should suggest process-related syscalls",
},
{
issue: "memory leak detected",
expected: 1,
description: "Memory issue should suggest memory syscalls",
},
{
issue: "application is slow",
expected: 1,
description: "Performance issue should suggest monitoring syscalls",
},
}
fmt.Printf("💡 Testing syscall suggestions:\n")
for _, tc := range testCases {
suggestions := SuggestSyscallTargets(tc.issue)
fmt.Printf(" Issue: '%s' -> %d suggestions: %v\n",
tc.issue, len(suggestions), suggestions)
if len(suggestions) < tc.expected {
t.Errorf("Expected at least %d suggestions for '%s', got %d",
tc.expected, tc.issue, len(suggestions))
}
}
fmt.Println()
}
// TestMain runs the tests and provides a summary
func TestMain(m *testing.M) {
fmt.Println("🚀 Starting BCC-Style eBPF Tracing Tests")
fmt.Println("========================================")
fmt.Println()
// Run capability check first
manager := NewBCCTraceManager()
caps := manager.GetCapabilities()
if !caps["kernel_ebpf"] {
fmt.Println("⚠️ Kernel eBPF support not detected - some tests may be limited")
}
if !caps["bpftrace"] {
fmt.Println("⚠️ bpftrace not available - install with: sudo apt install bpftrace")
}
if !caps["root_access"] {
fmt.Println("⚠️ Root access required for actual eBPF tracing")
}
fmt.Println()
// Run the tests
code := m.Run()
fmt.Println()
fmt.Println("========================================")
if code == 0 {
fmt.Println("✅ All BCC-Style eBPF Tracing Tests Passed!")
} else {
fmt.Println("❌ Some tests failed")
}
os.Exit(code)
}
// TestBCCTraceManagerRootTest tests the actual BCC trace manager with root privileges
// This test requires root access and will only run meaningful tests when root
func TestBCCTraceManagerRootTest(t *testing.T) {
fmt.Println("=== BCC Trace Manager Root Test ===")
// Check if running as root
if os.Geteuid() != 0 {
t.Skip("⚠️ Skipping root test - not running as root (use: sudo go test -run TestBCCTraceManagerRootTest)")
return
}
fmt.Println("✅ Running as root - can test actual eBPF functionality")
// Test 1: Create BCC trace manager and check capabilities
manager := NewBCCTraceManager()
caps := manager.GetCapabilities()
fmt.Printf("🔍 BCC Trace Manager Capabilities:\n")
for cap, available := range caps {
status := "❌"
if available {
status = "✅"
}
fmt.Printf(" %s %s: %v\n", status, cap, available)
}
// Require essential capabilities
if !caps["bpftrace"] {
t.Fatal("❌ bpftrace not available - install bpftrace package")
}
if !caps["root_access"] {
t.Fatal("❌ Root access not detected")
}
// Test 2: Create and execute a simple trace
fmt.Println("\n🔬 Testing actual eBPF trace execution...")
spec := TraceSpec{
ProbeType: "t", // tracepoint
Target: "syscalls:sys_enter_openat",
Format: "file access",
Arguments: []string{}, // Remove invalid arg2@user for tracepoints
Duration: 3, // 3 seconds
}
fmt.Printf("📝 Starting trace: %s for %d seconds\n", spec.Target, spec.Duration)
traceID, err := manager.StartTrace(spec)
if err != nil {
t.Fatalf("❌ Failed to start trace: %v", err)
}
fmt.Printf("🚀 Trace started with ID: %s\n", traceID)
// Generate some file access to capture
go func() {
time.Sleep(1 * time.Second)
// Create some file operations to trace
for i := 0; i < 3; i++ {
testFile := fmt.Sprintf("/tmp/bcc_test_%d.txt", i)
// This will trigger sys_openat syscalls
if file, err := os.Create(testFile); err == nil {
file.WriteString("BCC trace test")
file.Close()
os.Remove(testFile)
}
time.Sleep(500 * time.Millisecond)
}
}()
// Wait for trace to complete
time.Sleep(time.Duration(spec.Duration+1) * time.Second)
// Get results
result, err := manager.GetTraceResult(traceID)
if err != nil {
// Try to stop the trace if it's still running
manager.StopTrace(traceID)
t.Fatalf("❌ Failed to get trace results: %v", err)
}
fmt.Printf("\n📊 Trace Results Summary:\n")
fmt.Printf(" • Trace ID: %s\n", result.TraceID)
fmt.Printf(" • Target: %s\n", result.Spec.Target)
fmt.Printf(" • Duration: %v\n", result.EndTime.Sub(result.StartTime))
fmt.Printf(" • Events captured: %d\n", result.EventCount)
fmt.Printf(" • Events per second: %.2f\n", result.Statistics.EventsPerSecond)
fmt.Printf(" • Summary: %s\n", result.Summary)
if len(result.Events) > 0 {
fmt.Printf("\n📝 Sample Events (first 3):\n")
for i, event := range result.Events {
if i >= 3 {
break
}
fmt.Printf(" %d. PID:%d TID:%d Process:%s Message:%s\n",
i+1, event.PID, event.TID, event.ProcessName, event.Message)
}
if len(result.Events) > 3 {
fmt.Printf(" ... and %d more events\n", len(result.Events)-3)
}
}
// Test 3: Validate the trace produced real data
if result.EventCount == 0 {
fmt.Println("⚠️ Warning: No events captured - this might be normal for a quiet system")
} else {
fmt.Printf("✅ Successfully captured %d real eBPF events!\n", result.EventCount)
}
fmt.Println("\n🧪 Testing comprehensive system tracing (Network, Disk, CPU, Memory, Userspace)...")
testSpecs := []TraceSpec{
// === SYSCALL TRACING ===
{
ProbeType: "p", // kprobe
Target: "__x64_sys_write",
Format: "write: fd=%d count=%d",
Arguments: []string{"arg1", "arg3"},
Duration: 2,
},
{
ProbeType: "p", // kprobe
Target: "__x64_sys_read",
Format: "read: fd=%d count=%d",
Arguments: []string{"arg1", "arg3"},
Duration: 2,
},
{
ProbeType: "p", // kprobe
Target: "__x64_sys_connect",
Format: "network connect: fd=%d",
Arguments: []string{"arg1"},
Duration: 2,
},
{
ProbeType: "p", // kprobe
Target: "__x64_sys_accept",
Format: "network accept: fd=%d",
Arguments: []string{"arg1"},
Duration: 2,
},
// === BLOCK I/O TRACING ===
{
ProbeType: "t", // tracepoint
Target: "block:block_io_start",
Format: "block I/O start",
Arguments: []string{},
Duration: 2,
},
{
ProbeType: "t", // tracepoint
Target: "block:block_io_done",
Format: "block I/O complete",
Arguments: []string{},
Duration: 2,
},
// === CPU SCHEDULER TRACING ===
{
ProbeType: "t", // tracepoint
Target: "sched:sched_migrate_task",
Format: "task migration",
Arguments: []string{},
Duration: 2,
},
{
ProbeType: "t", // tracepoint
Target: "sched:sched_pi_setprio",
Format: "priority change",
Arguments: []string{},
Duration: 2,
},
// === MEMORY MANAGEMENT ===
{
ProbeType: "t", // tracepoint
Target: "syscalls:sys_enter_brk",
Format: "memory allocation: brk",
Arguments: []string{},
Duration: 2,
},
// === KERNEL MEMORY TRACING ===
{
ProbeType: "t", // tracepoint
Target: "kmem:kfree",
Format: "kernel memory free",
Arguments: []string{},
Duration: 2,
},
}
for i, testSpec := range testSpecs {
category := "unknown"
if strings.Contains(testSpec.Target, "sys_write") || strings.Contains(testSpec.Target, "sys_read") {
category = "filesystem"
} else if strings.Contains(testSpec.Target, "sys_connect") || strings.Contains(testSpec.Target, "sys_accept") {
category = "network"
} else if strings.Contains(testSpec.Target, "block:") {
category = "disk I/O"
} else if strings.Contains(testSpec.Target, "sched:") {
category = "CPU/scheduler"
} else if strings.Contains(testSpec.Target, "sys_brk") || strings.Contains(testSpec.Target, "kmem:") {
category = "memory"
}
fmt.Printf("\n 🔍 Test %d: [%s] Tracing %s for %d seconds\n", i+1, category, testSpec.Target, testSpec.Duration)
testTraceID, err := manager.StartTrace(testSpec)
if err != nil {
fmt.Printf(" ❌ Failed to start: %v\n", err)
continue
}
// Generate activity specific to this trace type
go func(target, probeType string) {
time.Sleep(500 * time.Millisecond)
switch {
case strings.Contains(target, "sys_write") || strings.Contains(target, "sys_read"):
// Generate file I/O
for j := 0; j < 3; j++ {
testFile := fmt.Sprintf("/tmp/io_test_%d.txt", j)
if file, err := os.Create(testFile); err == nil {
file.WriteString("BCC tracing test data for I/O operations")
file.Sync()
file.Close()
// Read the file back
if readFile, err := os.Open(testFile); err == nil {
buffer := make([]byte, 1024)
readFile.Read(buffer)
readFile.Close()
}
os.Remove(testFile)
}
time.Sleep(200 * time.Millisecond)
}
case strings.Contains(target, "block:"):
// Generate disk I/O to trigger block layer events
for j := 0; j < 3; j++ {
testFile := fmt.Sprintf("/tmp/block_test_%d.txt", j)
if file, err := os.Create(testFile); err == nil {
// Write substantial data to trigger block I/O
data := make([]byte, 1024*4) // 4KB
for k := range data {
data[k] = byte(k % 256)
}
file.Write(data)
file.Sync() // Force write to disk
file.Close()
}
os.Remove(testFile)
time.Sleep(300 * time.Millisecond)
}
case strings.Contains(target, "sched:"):
// Generate CPU activity to trigger scheduler events
go func() {
for j := 0; j < 100; j++ {
// Create short-lived goroutines to trigger scheduler activity
go func() {
time.Sleep(time.Millisecond * 1)
}()
time.Sleep(time.Millisecond * 10)
}
}()
case strings.Contains(target, "sys_brk") || strings.Contains(target, "kmem:"):
// Generate memory allocation activity
for j := 0; j < 5; j++ {
// Allocate and free memory to trigger memory management
data := make([]byte, 1024*1024) // 1MB
for k := range data {
data[k] = byte(k % 256)
}
data = nil // Allow GC
time.Sleep(200 * time.Millisecond)
}
case strings.Contains(target, "sys_connect") || strings.Contains(target, "sys_accept"):
// Network operations (these may not generate events in test environment)
fmt.Printf(" Note: Network syscalls may not trigger events without actual network activity\n")
default:
// Generic activity
for j := 0; j < 3; j++ {
testFile := fmt.Sprintf("/tmp/generic_test_%d.txt", j)
if file, err := os.Create(testFile); err == nil {
file.WriteString("Generic test activity")
file.Close()
}
os.Remove(testFile)
time.Sleep(300 * time.Millisecond)
}
}
}(testSpec.Target, testSpec.ProbeType)
// Wait for trace completion
time.Sleep(time.Duration(testSpec.Duration+1) * time.Second)
testResult, err := manager.GetTraceResult(testTraceID)
if err != nil {
manager.StopTrace(testTraceID)
fmt.Printf(" ⚠️ Result error: %v\n", err)
continue
}
fmt.Printf(" 📊 Results for %s:\n", testSpec.Target)
fmt.Printf(" • Total events: %d\n", testResult.EventCount)
fmt.Printf(" • Events/sec: %.2f\n", testResult.Statistics.EventsPerSecond)
fmt.Printf(" • Duration: %v\n", testResult.EndTime.Sub(testResult.StartTime))
// Show process breakdown
if len(testResult.Statistics.TopProcesses) > 0 {
fmt.Printf(" • Top processes:\n")
for j, proc := range testResult.Statistics.TopProcesses {
if j >= 3 { // Show top 3
break
}
fmt.Printf(" - %s: %d events (%.1f%%)\n",
proc.ProcessName, proc.EventCount, proc.Percentage)
}
}
// Show sample events with PIDs, counts, etc.
if len(testResult.Events) > 0 {
fmt.Printf(" • Sample events:\n")
for j, event := range testResult.Events {
if j >= 5 { // Show first 5 events
break
}
fmt.Printf(" [%d] PID:%d TID:%d Process:%s Message:%s\n",
j+1, event.PID, event.TID, event.ProcessName, event.Message)
}
if len(testResult.Events) > 5 {
fmt.Printf(" ... and %d more events\n", len(testResult.Events)-5)
}
}
if testResult.EventCount > 0 {
fmt.Printf(" ✅ Success: Captured %d real syscall events!\n", testResult.EventCount)
} else {
fmt.Printf(" ⚠️ No events captured (may be normal for this syscall)\n")
}
}
fmt.Println("\n🎉 BCC Trace Manager Root Test Complete!")
fmt.Println("✅ Real eBPF tracing is working and ready for production use!")
}
// TestAgentEBPFIntegration tests the agent's integration with BCC-style eBPF tracing
// This demonstrates the complete flow from agent to eBPF results
func TestAgentEBPFIntegration(t *testing.T) {
if os.Geteuid() != 0 {
t.Skip("⚠️ Skipping agent integration test - requires root access")
return
}
fmt.Println("\n=== Agent eBPF Integration Test ===")
fmt.Println("This test demonstrates the complete agent flow with BCC-style tracing")
// Create eBPF manager directly for testing
manager := NewBCCTraceManager()
// Test multiple syscalls that would be sent by remote API
testEBPFRequests := []struct {
Name string `json:"name"`
Type string `json:"type"`
Target string `json:"target"`
Duration int `json:"duration"`
Description string `json:"description"`
Filters map[string]string `json:"filters"`
}{
{
Name: "file_operations",
Type: "syscall",
Target: "sys_openat", // Will be converted to __x64_sys_openat
Duration: 3,
Description: "trace file open operations",
Filters: map[string]string{},
},
{
Name: "network_operations",
Type: "syscall",
Target: "__x64_sys_connect",
Duration: 2,
Description: "trace network connections",
Filters: map[string]string{},
},
{
Name: "io_operations",
Type: "syscall",
Target: "sys_write",
Duration: 2,
Description: "trace write operations",
Filters: map[string]string{},
},
}
fmt.Printf("🚀 Testing eBPF manager with %d eBPF programs...\n\n", len(testEBPFRequests))
// Convert to trace specs and execute using manager directly
var traceSpecs []TraceSpec
for _, req := range testEBPFRequests {
spec := TraceSpec{
ProbeType: "p", // kprobe
Target: "__x64_" + req.Target,
Format: req.Description,
Duration: req.Duration,
}
traceSpecs = append(traceSpecs, spec)
}
// Execute traces sequentially for testing
var results []map[string]interface{}
for i, spec := range traceSpecs {
fmt.Printf("Starting trace %d: %s\n", i+1, spec.Target)
traceID, err := manager.StartTrace(spec)
if err != nil {
fmt.Printf("Failed to start trace: %v\n", err)
continue
}
// Wait for trace duration
time.Sleep(time.Duration(spec.Duration) * time.Second)
traceResult, err := manager.GetTraceResult(traceID)
if err != nil {
fmt.Printf("Failed to get results: %v\n", err)
continue
}
result := map[string]interface{}{
"name": testEBPFRequests[i].Name,
"target": spec.Target,
"success": true,
"event_count": traceResult.EventCount,
"summary": traceResult.Summary,
}
results = append(results, result)
}
fmt.Printf("📊 Agent eBPF Execution Results:\n")
fmt.Printf("=" + strings.Repeat("=", 50) + "\n\n")
for i, result := range results {
fmt.Printf("🔍 Program %d: %s\n", i+1, result["name"])
fmt.Printf(" Target: %s\n", result["target"])
fmt.Printf(" Type: %s\n", result["type"])
fmt.Printf(" Status: %s\n", result["status"])
fmt.Printf(" Success: %v\n", result["success"])
if result["success"].(bool) {
if eventCount, ok := result["event_count"].(int); ok {
fmt.Printf(" Events captured: %d\n", eventCount)
}
if dataPoints, ok := result["data_points"].(int); ok {
fmt.Printf(" Data points: %d\n", dataPoints)
}
if summary, ok := result["summary"].(string); ok {
fmt.Printf(" Summary: %s\n", summary)
}
// Show events if available
if events, ok := result["events"].([]TraceEvent); ok && len(events) > 0 {
fmt.Printf(" Sample events:\n")
for j, event := range events {
if j >= 3 { // Show first 3
break
}
fmt.Printf(" [%d] PID:%d Process:%s Message:%s\n",
j+1, event.PID, event.ProcessName, event.Message)
}
if len(events) > 3 {
fmt.Printf(" ... and %d more events\n", len(events)-3)
}
}
// Show statistics if available
if stats, ok := result["statistics"].(TraceStats); ok {
fmt.Printf(" Statistics:\n")
fmt.Printf(" - Events/sec: %.2f\n", stats.EventsPerSecond)
fmt.Printf(" - Total processes: %d\n", len(stats.EventsByProcess))
if len(stats.TopProcesses) > 0 {
fmt.Printf(" - Top process: %s (%d events)\n",
stats.TopProcesses[0].ProcessName, stats.TopProcesses[0].EventCount)
}
}
} else {
if errMsg, ok := result["error"].(string); ok {
fmt.Printf(" Error: %s\n", errMsg)
}
}
fmt.Println()
}
// Validate expected agent response format
t.Run("ValidateAgentResponseFormat", func(t *testing.T) {
for i, result := range results {
// Check required fields
requiredFields := []string{"name", "type", "target", "duration", "description", "status", "success"}
for _, field := range requiredFields {
if _, exists := result[field]; !exists {
t.Errorf("Result %d missing required field: %s", i, field)
}
}
// If successful, check for data fields
if success, ok := result["success"].(bool); ok && success {
// Should have either event_count or data_points
hasEventCount := false
hasDataPoints := false
if _, ok := result["event_count"]; ok {
hasEventCount = true
}
if _, ok := result["data_points"]; ok {
hasDataPoints = true
}
if !hasEventCount && !hasDataPoints {
t.Errorf("Successful result %d should have event_count or data_points", i)
}
}
}
})
fmt.Println("✅ Agent eBPF Integration Test Complete!")
fmt.Println("📈 The agent correctly processes eBPF requests and returns detailed syscall data!")
}

View File

@@ -1,4 +1,4 @@
package main
package executor
import (
"context"
@@ -6,6 +6,8 @@ import (
"os/exec"
"strings"
"time"
"nannyagentv2/internal/types"
)
// CommandExecutor handles safe execution of diagnostic commands
@@ -21,8 +23,8 @@ func NewCommandExecutor(timeout time.Duration) *CommandExecutor {
}
// Execute executes a command safely with timeout and validation
func (ce *CommandExecutor) Execute(cmd Command) CommandResult {
result := CommandResult{
func (ce *CommandExecutor) Execute(cmd types.Command) types.CommandResult {
result := types.CommandResult{
ID: cmd.ID,
Command: cmd.Command,
}

183
internal/logging/logger.go Normal file
View File

@@ -0,0 +1,183 @@
package logging
import (
"fmt"
"log"
"log/syslog"
"os"
"strings"
)
// LogLevel defines the logging level
type LogLevel int
const (
LevelDebug LogLevel = iota
LevelInfo
LevelWarning
LevelError
)
func (l LogLevel) String() string {
switch l {
case LevelDebug:
return "DEBUG"
case LevelInfo:
return "INFO"
case LevelWarning:
return "WARN"
case LevelError:
return "ERROR"
default:
return "INFO"
}
}
// Logger provides structured logging with configurable levels
type Logger struct {
syslogWriter *syslog.Writer
level LogLevel
showEmoji bool
}
var defaultLogger *Logger
func init() {
defaultLogger = NewLogger()
}
// NewLogger creates a new logger with default configuration
func NewLogger() *Logger {
return NewLoggerWithLevel(getLogLevelFromEnv())
}
// NewLoggerWithLevel creates a logger with specified level
func NewLoggerWithLevel(level LogLevel) *Logger {
l := &Logger{
level: level,
showEmoji: os.Getenv("LOG_NO_EMOJI") != "true",
}
// Try to connect to syslog
if writer, err := syslog.New(syslog.LOG_INFO|syslog.LOG_DAEMON, "nannyagentv2"); err == nil {
l.syslogWriter = writer
}
return l
}
// getLogLevelFromEnv parses log level from environment variable
func getLogLevelFromEnv() LogLevel {
level := strings.ToUpper(os.Getenv("LOG_LEVEL"))
switch level {
case "DEBUG":
return LevelDebug
case "INFO", "":
return LevelInfo
case "WARN", "WARNING":
return LevelWarning
case "ERROR":
return LevelError
default:
return LevelInfo
}
}
// logMessage handles the actual logging
func (l *Logger) logMessage(level LogLevel, format string, args ...interface{}) {
if level < l.level {
return
}
msg := fmt.Sprintf(format, args...)
prefix := fmt.Sprintf("[%s]", level.String())
// Add emoji prefix if enabled
if l.showEmoji {
switch level {
case LevelDebug:
prefix = "🔍 " + prefix
case LevelInfo:
prefix = " " + prefix
case LevelWarning:
prefix = "⚠️ " + prefix
case LevelError:
prefix = "❌ " + prefix
}
}
// Log to syslog if available
if l.syslogWriter != nil {
switch level {
case LevelDebug:
l.syslogWriter.Debug(msg)
case LevelInfo:
l.syslogWriter.Info(msg)
case LevelWarning:
l.syslogWriter.Warning(msg)
case LevelError:
l.syslogWriter.Err(msg)
}
}
log.Printf("%s %s", prefix, msg)
}
func (l *Logger) Debug(format string, args ...interface{}) {
l.logMessage(LevelDebug, format, args...)
}
func (l *Logger) Info(format string, args ...interface{}) {
l.logMessage(LevelInfo, format, args...)
}
func (l *Logger) Warning(format string, args ...interface{}) {
l.logMessage(LevelWarning, format, args...)
}
func (l *Logger) Error(format string, args ...interface{}) {
l.logMessage(LevelError, format, args...)
}
// SetLevel changes the logging level
func (l *Logger) SetLevel(level LogLevel) {
l.level = level
}
// GetLevel returns current logging level
func (l *Logger) GetLevel() LogLevel {
return l.level
}
func (l *Logger) Close() {
if l.syslogWriter != nil {
l.syslogWriter.Close()
}
}
// Global logging functions
func Debug(format string, args ...interface{}) {
defaultLogger.Debug(format, args...)
}
func Info(format string, args ...interface{}) {
defaultLogger.Info(format, args...)
}
func Warning(format string, args ...interface{}) {
defaultLogger.Warning(format, args...)
}
func Error(format string, args ...interface{}) {
defaultLogger.Error(format, args...)
}
// SetLevel sets the global logger level
func SetLevel(level LogLevel) {
defaultLogger.SetLevel(level)
}
// GetLevel gets the global logger level
func GetLevel() LogLevel {
return defaultLogger.GetLevel()
}

View File

@@ -0,0 +1,318 @@
package metrics
import (
"bytes"
"crypto/sha256"
"encoding/json"
"fmt"
"io"
"math"
"net/http"
"strings"
"time"
"github.com/shirou/gopsutil/v3/cpu"
"github.com/shirou/gopsutil/v3/disk"
"github.com/shirou/gopsutil/v3/host"
"github.com/shirou/gopsutil/v3/load"
"github.com/shirou/gopsutil/v3/mem"
psnet "github.com/shirou/gopsutil/v3/net"
"nannyagentv2/internal/types"
)
// Collector handles system metrics collection
type Collector struct {
agentVersion string
}
// NewCollector creates a new metrics collector
func NewCollector(agentVersion string) *Collector {
return &Collector{
agentVersion: agentVersion,
}
}
// GatherSystemMetrics collects comprehensive system metrics
func (c *Collector) GatherSystemMetrics() (*types.SystemMetrics, error) {
metrics := &types.SystemMetrics{
Timestamp: time.Now(),
}
// System Information
if hostInfo, err := host.Info(); err == nil {
metrics.Hostname = hostInfo.Hostname
metrics.Platform = hostInfo.Platform
metrics.PlatformFamily = hostInfo.PlatformFamily
metrics.PlatformVersion = hostInfo.PlatformVersion
metrics.KernelVersion = hostInfo.KernelVersion
metrics.KernelArch = hostInfo.KernelArch
}
// CPU Metrics
if percentages, err := cpu.Percent(time.Second, false); err == nil && len(percentages) > 0 {
metrics.CPUUsage = math.Round(percentages[0]*100) / 100
}
if cpuInfo, err := cpu.Info(); err == nil && len(cpuInfo) > 0 {
metrics.CPUCores = len(cpuInfo)
metrics.CPUModel = cpuInfo[0].ModelName
}
// Memory Metrics
if memInfo, err := mem.VirtualMemory(); err == nil {
metrics.MemoryUsage = math.Round(float64(memInfo.Used)/(1024*1024)*100) / 100 // MB
metrics.MemoryTotal = memInfo.Total
metrics.MemoryUsed = memInfo.Used
metrics.MemoryFree = memInfo.Free
metrics.MemoryAvailable = memInfo.Available
}
if swapInfo, err := mem.SwapMemory(); err == nil {
metrics.SwapTotal = swapInfo.Total
metrics.SwapUsed = swapInfo.Used
metrics.SwapFree = swapInfo.Free
}
// Disk Metrics
if diskInfo, err := disk.Usage("/"); err == nil {
metrics.DiskUsage = math.Round(diskInfo.UsedPercent*100) / 100
metrics.DiskTotal = diskInfo.Total
metrics.DiskUsed = diskInfo.Used
metrics.DiskFree = diskInfo.Free
}
// Load Averages
if loadAvg, err := load.Avg(); err == nil {
metrics.LoadAvg1 = math.Round(loadAvg.Load1*100) / 100
metrics.LoadAvg5 = math.Round(loadAvg.Load5*100) / 100
metrics.LoadAvg15 = math.Round(loadAvg.Load15*100) / 100
}
// Process Count (simplified - using a constant for now)
// Note: gopsutil doesn't have host.Processes(), would need process.Processes()
metrics.ProcessCount = 0 // Placeholder
// Network Metrics
netIn, netOut := c.getNetworkStats()
metrics.NetworkInKbps = netIn
metrics.NetworkOutKbps = netOut
if netIOCounters, err := psnet.IOCounters(false); err == nil && len(netIOCounters) > 0 {
netIO := netIOCounters[0]
metrics.NetworkInBytes = netIO.BytesRecv
metrics.NetworkOutBytes = netIO.BytesSent
}
// IP Address and Location
metrics.IPAddress = c.getIPAddress()
metrics.Location = c.getLocation() // Placeholder
// Filesystem Information
metrics.FilesystemInfo = c.getFilesystemInfo()
// Block Devices
metrics.BlockDevices = c.getBlockDevices()
return metrics, nil
}
// getNetworkStats returns network input/output rates in Kbps
func (c *Collector) getNetworkStats() (float64, float64) {
netIOCounters, err := psnet.IOCounters(false)
if err != nil || len(netIOCounters) == 0 {
return 0.0, 0.0
}
// Use the first interface for aggregate stats
netIO := netIOCounters[0]
// Convert bytes to kilobits per second (simplified - cumulative bytes to kilobits)
netInKbps := float64(netIO.BytesRecv) * 8 / 1024
netOutKbps := float64(netIO.BytesSent) * 8 / 1024
return netInKbps, netOutKbps
}
// getIPAddress returns the primary IP address of the system
func (c *Collector) getIPAddress() string {
interfaces, err := psnet.Interfaces()
if err != nil {
return "unknown"
}
for _, iface := range interfaces {
if len(iface.Addrs) > 0 && !strings.Contains(iface.Addrs[0].Addr, "127.0.0.1") {
return strings.Split(iface.Addrs[0].Addr, "/")[0] // Remove CIDR if present
}
}
return "unknown"
}
// getLocation returns basic location information (placeholder)
func (c *Collector) getLocation() string {
return "unknown" // Would integrate with GeoIP service
}
// getFilesystemInfo returns information about mounted filesystems
func (c *Collector) getFilesystemInfo() []types.FilesystemInfo {
partitions, err := disk.Partitions(false)
if err != nil {
return []types.FilesystemInfo{}
}
var filesystems []types.FilesystemInfo
for _, partition := range partitions {
usage, err := disk.Usage(partition.Mountpoint)
if err != nil {
continue
}
fs := types.FilesystemInfo{
Mountpoint: partition.Mountpoint,
Fstype: partition.Fstype,
Total: usage.Total,
Used: usage.Used,
Free: usage.Free,
UsagePercent: math.Round(usage.UsedPercent*100) / 100,
}
filesystems = append(filesystems, fs)
}
return filesystems
}
// getBlockDevices returns information about block devices
func (c *Collector) getBlockDevices() []types.BlockDevice {
partitions, err := disk.Partitions(true)
if err != nil {
return []types.BlockDevice{}
}
var devices []types.BlockDevice
deviceMap := make(map[string]bool)
for _, partition := range partitions {
// Only include actual block devices
if strings.HasPrefix(partition.Device, "/dev/") {
deviceName := partition.Device
if !deviceMap[deviceName] {
deviceMap[deviceName] = true
device := types.BlockDevice{
Name: deviceName,
Model: "unknown",
Size: 0,
SerialNumber: "unknown",
}
devices = append(devices, device)
}
}
}
return devices
}
// SendMetrics sends system metrics to the agent-auth-api endpoint
func (c *Collector) SendMetrics(agentAuthURL, accessToken, agentID string, metrics *types.SystemMetrics) error {
// Create flattened metrics request for agent-auth-api
metricsReq := c.CreateMetricsRequest(agentID, metrics)
return c.sendMetricsRequest(agentAuthURL, accessToken, metricsReq)
}
// CreateMetricsRequest converts SystemMetrics to the flattened format expected by agent-auth-api
func (c *Collector) CreateMetricsRequest(agentID string, systemMetrics *types.SystemMetrics) *types.MetricsRequest {
return &types.MetricsRequest{
AgentID: agentID,
CPUUsage: systemMetrics.CPUUsage,
MemoryUsage: systemMetrics.MemoryUsage,
DiskUsage: systemMetrics.DiskUsage,
NetworkInKbps: systemMetrics.NetworkInKbps,
NetworkOutKbps: systemMetrics.NetworkOutKbps,
IPAddress: systemMetrics.IPAddress,
Location: systemMetrics.Location,
AgentVersion: c.agentVersion,
KernelVersion: systemMetrics.KernelVersion,
DeviceFingerprint: c.generateDeviceFingerprint(systemMetrics),
LoadAverages: map[string]float64{
"load1": systemMetrics.LoadAvg1,
"load5": systemMetrics.LoadAvg5,
"load15": systemMetrics.LoadAvg15,
},
OSInfo: map[string]string{
"cpu_cores": fmt.Sprintf("%d", systemMetrics.CPUCores),
"memory": fmt.Sprintf("%.1fGi", float64(systemMetrics.MemoryTotal)/(1024*1024*1024)),
"uptime": "unknown", // Will be calculated by the server or client
"platform": systemMetrics.Platform,
"platform_family": systemMetrics.PlatformFamily,
"platform_version": systemMetrics.PlatformVersion,
"kernel_version": systemMetrics.KernelVersion,
"kernel_arch": systemMetrics.KernelArch,
},
FilesystemInfo: systemMetrics.FilesystemInfo,
BlockDevices: systemMetrics.BlockDevices,
NetworkStats: map[string]uint64{
"bytes_sent": systemMetrics.NetworkOutBytes,
"bytes_recv": systemMetrics.NetworkInBytes,
"total_bytes": systemMetrics.NetworkInBytes + systemMetrics.NetworkOutBytes,
},
}
}
// sendMetricsRequest sends the metrics request to the agent-auth-api
func (c *Collector) sendMetricsRequest(agentAuthURL, accessToken string, metricsReq *types.MetricsRequest) error {
// Wrap metrics in the expected payload structure
payload := map[string]interface{}{
"metrics": metricsReq,
"timestamp": time.Now().UTC().Format(time.RFC3339),
}
jsonData, err := json.Marshal(payload)
if err != nil {
return fmt.Errorf("failed to marshal metrics: %w", err)
}
// Send to /metrics endpoint
metricsURL := fmt.Sprintf("%s/metrics", agentAuthURL)
req, err := http.NewRequest("POST", metricsURL, bytes.NewBuffer(jsonData))
if err != nil {
return fmt.Errorf("failed to create request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", accessToken))
client := &http.Client{Timeout: 30 * time.Second}
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("failed to send metrics: %w", err)
}
defer resp.Body.Close()
// Read response
body, err := io.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("failed to read response: %w", err)
}
// Check response status
if resp.StatusCode == http.StatusUnauthorized {
return fmt.Errorf("unauthorized")
}
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("metrics request failed with status %d: %s", resp.StatusCode, string(body))
}
return nil
}
// generateDeviceFingerprint creates a unique device identifier
func (c *Collector) generateDeviceFingerprint(metrics *types.SystemMetrics) string {
fingerprint := fmt.Sprintf("%s-%s-%s", metrics.Hostname, metrics.Platform, metrics.KernelVersion)
hasher := sha256.New()
hasher.Write([]byte(fingerprint))
return fmt.Sprintf("%x", hasher.Sum(nil))[:16]
}

View File

@@ -0,0 +1,529 @@
package server
import (
"encoding/json"
"fmt"
"net/http"
"os"
"strings"
"time"
"nannyagentv2/internal/auth"
"nannyagentv2/internal/logging"
"nannyagentv2/internal/metrics"
"nannyagentv2/internal/types"
"github.com/sashabaranov/go-openai"
)
// InvestigationRequest represents a request from Supabase to start an investigation
type InvestigationRequest struct {
InvestigationID string `json:"investigation_id"`
ApplicationGroup string `json:"application_group"`
Issue string `json:"issue"`
Context map[string]string `json:"context"`
Priority string `json:"priority"`
InitiatedBy string `json:"initiated_by"`
}
// InvestigationResponse represents the agent's response to an investigation
type InvestigationResponse struct {
AgentID string `json:"agent_id"`
InvestigationID string `json:"investigation_id"`
Status string `json:"status"`
Commands []types.CommandResult `json:"commands,omitempty"`
AIResponse string `json:"ai_response,omitempty"`
EpisodeID string `json:"episode_id,omitempty"`
Timestamp time.Time `json:"timestamp"`
Error string `json:"error,omitempty"`
}
// InvestigationServer handles reverse investigation requests from Supabase
type InvestigationServer struct {
agent types.DiagnosticAgent // Original agent for direct user interactions
applicationAgent types.DiagnosticAgent // Separate agent for application-initiated investigations
port string
agentID string
metricsCollector *metrics.Collector
authManager *auth.AuthManager
startTime time.Time
supabaseURL string
}
// NewInvestigationServer creates a new investigation server
func NewInvestigationServer(agent types.DiagnosticAgent, authManager *auth.AuthManager) *InvestigationServer {
port := os.Getenv("AGENT_PORT")
if port == "" {
port = "1234"
}
// Get agent ID from authentication system
var agentID string
if authManager != nil {
if id, err := authManager.GetCurrentAgentID(); err == nil {
agentID = id
} else {
logging.Error("Failed to get agent ID from auth manager: %v", err)
}
}
// Fallback to environment variable or generate one if auth fails
if agentID == "" {
agentID = os.Getenv("AGENT_ID")
if agentID == "" {
agentID = fmt.Sprintf("agent-%d", time.Now().Unix())
}
}
// Create metrics collector
metricsCollector := metrics.NewCollector("v2.0.0")
// TODO: Fix application agent creation - use main agent for now
// Create a separate agent for application-initiated investigations
// applicationAgent := NewLinuxDiagnosticAgent()
// Override the model to use the application-specific function
// applicationAgent.model = "tensorzero::function_name::diagnose_and_heal_application"
return &InvestigationServer{
agent: agent,
applicationAgent: agent, // Use same agent for now
port: port,
agentID: agentID,
metricsCollector: metricsCollector,
authManager: authManager,
startTime: time.Now(),
supabaseURL: os.Getenv("SUPABASE_PROJECT_URL"),
}
}
// DiagnoseIssueForApplication handles diagnostic requests initiated from application/portal
func (s *InvestigationServer) DiagnoseIssueForApplication(issue, episodeID string) error {
// Set the episode ID on the application agent for continuity
// TODO: Fix episode ID handling with interface
// s.applicationAgent.episodeID = episodeID
return s.applicationAgent.DiagnoseIssue(issue)
}
// Start starts the HTTP server and realtime polling for investigation requests
func (s *InvestigationServer) Start() error {
mux := http.NewServeMux()
// Health check endpoint
mux.HandleFunc("/health", s.handleHealth)
// Investigation endpoint
mux.HandleFunc("/investigate", s.handleInvestigation)
// Agent status endpoint
mux.HandleFunc("/status", s.handleStatus)
// Start realtime polling for backend-initiated investigations
if s.supabaseURL != "" && s.authManager != nil {
go s.startRealtimePolling()
logging.Info("Realtime investigation polling enabled")
} else {
logging.Warning("Realtime investigation polling disabled (missing Supabase config or auth)")
}
server := &http.Server{
Addr: ":" + s.port,
Handler: mux,
ReadTimeout: 30 * time.Second,
WriteTimeout: 30 * time.Second,
}
logging.Info("Investigation server started on port %s (Agent ID: %s)", s.port, s.agentID)
return server.ListenAndServe()
}
// handleHealth responds to health check requests
func (s *InvestigationServer) handleHealth(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
response := map[string]interface{}{
"status": "healthy",
"agent_id": s.agentID,
"timestamp": time.Now(),
"version": "v2.0.0",
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
// handleStatus responds with agent status and capabilities
func (s *InvestigationServer) handleStatus(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
// Collect current system metrics
systemMetrics, err := s.metricsCollector.GatherSystemMetrics()
if err != nil {
http.Error(w, fmt.Sprintf("Failed to collect metrics: %v", err), http.StatusInternalServerError)
return
}
// Convert to metrics request format for consistent data structure
metricsReq := s.metricsCollector.CreateMetricsRequest(s.agentID, systemMetrics)
response := map[string]interface{}{
"agent_id": s.agentID,
"status": "ready",
"capabilities": []string{"system_diagnostics", "ebpf_monitoring", "command_execution", "ai_analysis"},
"system_info": map[string]interface{}{
"os": fmt.Sprintf("%s %s", metricsReq.OSInfo["platform"], metricsReq.OSInfo["platform_version"]),
"kernel": metricsReq.KernelVersion,
"architecture": metricsReq.OSInfo["kernel_arch"],
"cpu_cores": metricsReq.OSInfo["cpu_cores"],
"memory": metricsReq.MemoryUsage,
"private_ips": metricsReq.IPAddress,
"load_average": fmt.Sprintf("%.2f, %.2f, %.2f",
metricsReq.LoadAverages["load1"],
metricsReq.LoadAverages["load5"],
metricsReq.LoadAverages["load15"]),
"disk_usage": fmt.Sprintf("Root: %.0fG/%.0fG (%.0f%% used)",
float64(metricsReq.FilesystemInfo[0].Used)/1024/1024/1024,
float64(metricsReq.FilesystemInfo[0].Total)/1024/1024/1024,
metricsReq.DiskUsage),
},
"uptime": time.Since(s.startTime),
"last_contact": time.Now(),
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
}
// sendCommandResultsToTensorZero sends command results back to TensorZero and continues conversation
func (s *InvestigationServer) sendCommandResultsToTensorZero(diagnosticResp types.DiagnosticResponse, commandResults []types.CommandResult) (interface{}, error) {
// Build conversation history like in agent.go
messages := []openai.ChatCompletionMessage{
// Add the original diagnostic response as assistant message
{
Role: openai.ChatMessageRoleAssistant,
Content: fmt.Sprintf(`{"response_type":"diagnostic","reasoning":"%s","commands":%s}`,
diagnosticResp.Reasoning,
mustMarshalJSON(diagnosticResp.Commands)),
},
}
// Add command results as user message (same as agent.go does)
resultsJSON, err := json.MarshalIndent(commandResults, "", " ")
if err != nil {
return nil, fmt.Errorf("failed to marshal command results: %w", err)
}
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleUser,
Content: string(resultsJSON),
})
// Send to TensorZero via application agent's sendRequest method
logging.Debug("Sending command results to TensorZero for analysis")
response, err := s.applicationAgent.SendRequest(messages)
if err != nil {
return nil, fmt.Errorf("failed to send request to TensorZero: %w", err)
}
if len(response.Choices) == 0 {
return nil, fmt.Errorf("no choices in TensorZero response")
}
content := response.Choices[0].Message.Content
logging.Debug("TensorZero continued analysis: %s", content)
// Try to parse the response to determine if it's diagnostic or resolution
var diagnosticNextResp types.DiagnosticResponse
var resolutionResp types.ResolutionResponse
// Check if it's another diagnostic response
if err := json.Unmarshal([]byte(content), &diagnosticNextResp); err == nil && diagnosticNextResp.ResponseType == "diagnostic" {
logging.Debug("TensorZero requests %d more commands", len(diagnosticNextResp.Commands))
return map[string]interface{}{
"type": "diagnostic",
"response": diagnosticNextResp,
"raw": content,
}, nil
}
// Check if it's a resolution response
if err := json.Unmarshal([]byte(content), &resolutionResp); err == nil && resolutionResp.ResponseType == "resolution" {
return map[string]interface{}{
"type": "resolution",
"response": resolutionResp,
"raw": content,
}, nil
}
// Return raw response if we can't parse it
return map[string]interface{}{
"type": "unknown",
"raw": content,
}, nil
}
// Helper function to marshal JSON without errors
func mustMarshalJSON(v interface{}) string {
data, _ := json.Marshal(v)
return string(data)
}
// processInvestigation handles the actual investigation using TensorZero
// This endpoint receives either:
// 1. DiagnosticResponse - Commands and eBPF programs to execute
// 2. ResolutionResponse - Final resolution (no execution needed)
func (s *InvestigationServer) handleInvestigation(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "Method not allowed - only POST accepted", http.StatusMethodNotAllowed)
return
}
// Parse the request body to determine what type of response this is
var requestBody map[string]interface{}
if err := json.NewDecoder(r.Body).Decode(&requestBody); err != nil {
http.Error(w, fmt.Sprintf("Invalid JSON: %v", err), http.StatusBadRequest)
return
}
// Check the response_type field to determine how to handle this
responseType, ok := requestBody["response_type"].(string)
if !ok {
http.Error(w, "Missing or invalid response_type field", http.StatusBadRequest)
return
}
logging.Debug("Received investigation payload with response_type: %s", responseType)
switch responseType {
case "diagnostic":
// This is a DiagnosticResponse with commands to execute
response := s.handleDiagnosticExecution(requestBody)
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(response)
case "resolution":
// This is a ResolutionResponse - final result, just acknowledge
fmt.Printf("📋 Received final resolution from backend\n")
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(map[string]interface{}{
"success": true,
"message": "Resolution received and acknowledged",
"agent_id": s.agentID,
})
default:
http.Error(w, fmt.Sprintf("Unknown response_type: %s", responseType), http.StatusBadRequest)
return
}
}
// handleDiagnosticExecution executes commands from a DiagnosticResponse
func (s *InvestigationServer) handleDiagnosticExecution(requestBody map[string]interface{}) map[string]interface{} {
// Parse as DiagnosticResponse
var diagnosticResp types.DiagnosticResponse
// Convert the map back to JSON and then parse it properly
jsonData, err := json.Marshal(requestBody)
if err != nil {
return map[string]interface{}{
"success": false,
"error": fmt.Sprintf("Failed to re-marshal request: %v", err),
"agent_id": s.agentID,
}
}
if err := json.Unmarshal(jsonData, &diagnosticResp); err != nil {
return map[string]interface{}{
"success": false,
"error": fmt.Sprintf("Failed to parse DiagnosticResponse: %v", err),
"agent_id": s.agentID,
}
}
fmt.Printf("📋 Executing %d commands from backend\n", len(diagnosticResp.Commands))
// Execute all commands
commandResults := make([]types.CommandResult, 0, len(diagnosticResp.Commands))
for _, cmd := range diagnosticResp.Commands {
fmt.Printf("⚙️ Executing command '%s': %s\n", cmd.ID, cmd.Command)
// Use the agent's executor to run the command
result := s.agent.ExecuteCommand(cmd)
commandResults = append(commandResults, result)
if result.Error != "" {
fmt.Printf("⚠️ Command '%s' had error: %s\n", cmd.ID, result.Error)
}
}
// Send command results back to TensorZero for continued analysis
fmt.Printf("🔄 Sending %d command results back to TensorZero for continued analysis\n", len(commandResults))
nextResponse, err := s.sendCommandResultsToTensorZero(diagnosticResp, commandResults)
if err != nil {
return map[string]interface{}{
"success": false,
"error": fmt.Sprintf("Failed to continue TensorZero conversation: %v", err),
"agent_id": s.agentID,
"command_results": commandResults, // Still return the results
}
}
// Return both the command results and the next response from TensorZero
return map[string]interface{}{
"success": true,
"agent_id": s.agentID,
"command_results": commandResults,
"commands_executed": len(commandResults),
"next_response": nextResponse,
"timestamp": time.Now().Format(time.RFC3339),
}
}
// PendingInvestigation represents a pending investigation from the database
type PendingInvestigation struct {
ID string `json:"id"`
InvestigationID string `json:"investigation_id"`
AgentID string `json:"agent_id"`
DiagnosticPayload map[string]interface{} `json:"diagnostic_payload"`
EpisodeID *string `json:"episode_id"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
}
// startRealtimePolling begins polling for pending investigations
func (s *InvestigationServer) startRealtimePolling() {
fmt.Printf("🔄 Starting realtime investigation polling for agent %s\n", s.agentID)
ticker := time.NewTicker(5 * time.Second) // Poll every 5 seconds
defer ticker.Stop()
for range ticker.C {
s.checkForPendingInvestigations()
}
}
// checkForPendingInvestigations checks for new pending investigations
func (s *InvestigationServer) checkForPendingInvestigations() {
url := fmt.Sprintf("%s/rest/v1/pending_investigations?agent_id=eq.%s&status=eq.pending&order=created_at.desc",
s.supabaseURL, s.agentID)
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return // Silent fail for polling
}
// Get token from auth manager
authToken, err := s.authManager.LoadToken()
if err != nil {
return // Silent fail for polling
}
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", authToken.AccessToken))
req.Header.Set("Accept", "application/json")
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
return // Silent fail for polling
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return // Silent fail for polling
}
var investigations []PendingInvestigation
err = json.NewDecoder(resp.Body).Decode(&investigations)
if err != nil {
return // Silent fail for polling
}
for _, investigation := range investigations {
fmt.Printf("🔍 Found pending investigation: %s\n", investigation.ID)
go s.handlePendingInvestigation(investigation)
}
}
// handlePendingInvestigation processes a single pending investigation
func (s *InvestigationServer) handlePendingInvestigation(investigation PendingInvestigation) {
fmt.Printf("🚀 Processing realtime investigation %s\n", investigation.InvestigationID)
// Mark as executing
err := s.updateInvestigationStatus(investigation.ID, "executing", nil, nil)
if err != nil {
fmt.Printf("❌ Failed to mark investigation as executing: %v\n", err)
return
}
// Execute diagnostic commands using existing handleDiagnosticExecution method
results := s.handleDiagnosticExecution(investigation.DiagnosticPayload)
// Mark as completed with results
err = s.updateInvestigationStatus(investigation.ID, "completed", results, nil)
if err != nil {
fmt.Printf("❌ Failed to mark investigation as completed: %v\n", err)
return
}
}
// updateInvestigationStatus updates the status of a pending investigation
func (s *InvestigationServer) updateInvestigationStatus(id, status string, results map[string]interface{}, errorMsg *string) error {
updateData := map[string]interface{}{
"status": status,
}
if status == "executing" {
updateData["started_at"] = time.Now().UTC().Format(time.RFC3339)
} else if status == "completed" {
updateData["completed_at"] = time.Now().UTC().Format(time.RFC3339)
if results != nil {
updateData["command_results"] = results
}
} else if status == "failed" && errorMsg != nil {
updateData["error_message"] = *errorMsg
updateData["completed_at"] = time.Now().UTC().Format(time.RFC3339)
}
jsonData, err := json.Marshal(updateData)
if err != nil {
return fmt.Errorf("failed to marshal update data: %v", err)
}
url := fmt.Sprintf("%s/rest/v1/pending_investigations?id=eq.%s", s.supabaseURL, id)
req, err := http.NewRequest("PATCH", url, strings.NewReader(string(jsonData)))
if err != nil {
return fmt.Errorf("failed to create request: %v", err)
}
// Get token from auth manager
authToken, err := s.authManager.LoadToken()
if err != nil {
return fmt.Errorf("failed to load auth token: %v", err)
}
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", authToken.AccessToken))
req.Header.Set("Content-Type", "application/json")
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("failed to update investigation: %v", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 && resp.StatusCode != 204 {
return fmt.Errorf("supabase update error: %d", resp.StatusCode)
}
return nil
}

View File

@@ -1,4 +1,4 @@
package main
package system
import (
"fmt"
@@ -6,6 +6,9 @@ import (
"runtime"
"strings"
"time"
"nannyagentv2/internal/executor"
"nannyagentv2/internal/types"
)
// SystemInfo represents basic system information
@@ -25,42 +28,42 @@ type SystemInfo struct {
// GatherSystemInfo collects basic system information
func GatherSystemInfo() *SystemInfo {
info := &SystemInfo{}
executor := NewCommandExecutor(5 * time.Second)
executor := executor.NewCommandExecutor(5 * time.Second)
// Basic system info
if result := executor.Execute(Command{ID: "hostname", Command: "hostname"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "hostname", Command: "hostname"}); result.ExitCode == 0 {
info.Hostname = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "os", Command: "lsb_release -d 2>/dev/null | cut -f2 || cat /etc/os-release | grep PRETTY_NAME | cut -d'=' -f2 | tr -d '\"'"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "os", Command: "lsb_release -d 2>/dev/null | cut -f2 || cat /etc/os-release | grep PRETTY_NAME | cut -d'=' -f2 | tr -d '\"'"}); result.ExitCode == 0 {
info.OS = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "kernel", Command: "uname -r"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "kernel", Command: "uname -r"}); result.ExitCode == 0 {
info.Kernel = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "arch", Command: "uname -m"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "arch", Command: "uname -m"}); result.ExitCode == 0 {
info.Architecture = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "cores", Command: "nproc"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "cores", Command: "nproc"}); result.ExitCode == 0 {
info.CPUCores = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "memory", Command: "free -h | grep Mem | awk '{print $2}'"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "memory", Command: "free -h | grep Mem | awk '{print $2}'"}); result.ExitCode == 0 {
info.Memory = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "uptime", Command: "uptime -p"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "uptime", Command: "uptime -p"}); result.ExitCode == 0 {
info.Uptime = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "load", Command: "uptime | awk -F'load average:' '{print $2}' | xargs"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "load", Command: "uptime | awk -F'load average:' '{print $2}' | xargs"}); result.ExitCode == 0 {
info.LoadAverage = strings.TrimSpace(result.Output)
}
if result := executor.Execute(Command{ID: "disk", Command: "df -h / | tail -1 | awk '{print \"Root: \" $3 \"/\" $2 \" (\" $5 \" used)\"}'"}); result.ExitCode == 0 {
if result := executor.Execute(types.Command{ID: "disk", Command: "df -h / | tail -1 | awk '{print \"Root: \" $3 \"/\" $2 \" (\" $5 \" used)\"}'"}); result.ExitCode == 0 {
info.DiskUsage = strings.TrimSpace(result.Output)
}
@@ -152,50 +155,3 @@ ISSUE DESCRIPTION:`,
info.PrivateIPs,
runtime.Version())
}
// FormatSystemInfoWithEBPFForPrompt formats system information including eBPF capabilities
func FormatSystemInfoWithEBPFForPrompt(info *SystemInfo, ebpfManager EBPFManagerInterface) string {
baseInfo := FormatSystemInfoForPrompt(info)
if ebpfManager == nil {
return baseInfo + "\neBPF CAPABILITIES: Not available\n"
}
capabilities := ebpfManager.GetCapabilities()
summary := ebpfManager.GetSummary()
ebpfInfo := fmt.Sprintf(`
eBPF MONITORING CAPABILITIES:
- System Call Tracing: %v
- Network Activity Tracing: %v
- Process Monitoring: %v
- File System Monitoring: %v
- Performance Monitoring: %v
- Security Event Monitoring: %v
eBPF INTEGRATION GUIDE:
To request eBPF monitoring during diagnosis, include these fields in your JSON response:
{
"response_type": "diagnostic",
"reasoning": "explanation of why eBPF monitoring is needed",
"commands": [regular diagnostic commands],
"ebpf_capabilities": ["syscall_trace", "network_trace", "process_trace"],
"ebpf_duration_seconds": 15,
"ebpf_filters": {"pid": "process_id", "comm": "process_name", "path": "/specific/path"}
}
Available eBPF capabilities: %v
eBPF Status: %v
`,
capabilities["tracepoint"],
capabilities["kprobe"],
capabilities["kernel_support"],
capabilities["tracepoint"],
capabilities["kernel_support"],
capabilities["bpftrace_available"],
capabilities,
summary)
return baseInfo + ebpfInfo
}

290
internal/types/types.go Normal file
View File

@@ -0,0 +1,290 @@
package types
import (
"time"
"nannyagentv2/internal/ebpf"
"github.com/sashabaranov/go-openai"
)
// SystemMetrics represents comprehensive system performance metrics
type SystemMetrics struct {
// System Information
Hostname string `json:"hostname"`
Platform string `json:"platform"`
PlatformFamily string `json:"platform_family"`
PlatformVersion string `json:"platform_version"`
KernelVersion string `json:"kernel_version"`
KernelArch string `json:"kernel_arch"`
// CPU Metrics
CPUUsage float64 `json:"cpu_usage"`
CPUCores int `json:"cpu_cores"`
CPUModel string `json:"cpu_model"`
// Memory Metrics
MemoryUsage float64 `json:"memory_usage"`
MemoryTotal uint64 `json:"memory_total"`
MemoryUsed uint64 `json:"memory_used"`
MemoryFree uint64 `json:"memory_free"`
MemoryAvailable uint64 `json:"memory_available"`
SwapTotal uint64 `json:"swap_total"`
SwapUsed uint64 `json:"swap_used"`
SwapFree uint64 `json:"swap_free"`
// Disk Metrics
DiskUsage float64 `json:"disk_usage"`
DiskTotal uint64 `json:"disk_total"`
DiskUsed uint64 `json:"disk_used"`
DiskFree uint64 `json:"disk_free"`
// Network Metrics
NetworkInKbps float64 `json:"network_in_kbps"`
NetworkOutKbps float64 `json:"network_out_kbps"`
NetworkInBytes uint64 `json:"network_in_bytes"`
NetworkOutBytes uint64 `json:"network_out_bytes"`
// System Load
LoadAvg1 float64 `json:"load_avg_1"`
LoadAvg5 float64 `json:"load_avg_5"`
LoadAvg15 float64 `json:"load_avg_15"`
// Process Information
ProcessCount int `json:"process_count"`
// Network Information
IPAddress string `json:"ip_address"`
Location string `json:"location"`
// Filesystem Information
FilesystemInfo []FilesystemInfo `json:"filesystem_info"`
BlockDevices []BlockDevice `json:"block_devices"`
// Timestamp
Timestamp time.Time `json:"timestamp"`
}
// FilesystemInfo represents filesystem information
type FilesystemInfo struct {
Device string `json:"device"`
Mountpoint string `json:"mountpoint"`
Type string `json:"type"`
Fstype string `json:"fstype"`
Total uint64 `json:"total"`
Used uint64 `json:"used"`
Free uint64 `json:"free"`
Usage float64 `json:"usage"`
UsagePercent float64 `json:"usage_percent"`
}
// BlockDevice represents a block device
type BlockDevice struct {
Name string `json:"name"`
Size uint64 `json:"size"`
Type string `json:"type"`
Model string `json:"model,omitempty"`
SerialNumber string `json:"serial_number"`
}
// NetworkStats represents network interface statistics
type NetworkStats struct {
Interface string `json:"interface"`
BytesRecv uint64 `json:"bytes_recv"`
BytesSent uint64 `json:"bytes_sent"`
PacketsRecv uint64 `json:"packets_recv"`
PacketsSent uint64 `json:"packets_sent"`
ErrorsIn uint64 `json:"errors_in"`
ErrorsOut uint64 `json:"errors_out"`
DropsIn uint64 `json:"drops_in"`
DropsOut uint64 `json:"drops_out"`
}
// AuthToken represents an authentication token
type AuthToken struct {
AccessToken string `json:"access_token"`
RefreshToken string `json:"refresh_token"`
TokenType string `json:"token_type"`
ExpiresAt time.Time `json:"expires_at"`
AgentID string `json:"agent_id"`
}
// DeviceAuthRequest represents the device authorization request
type DeviceAuthRequest struct {
ClientID string `json:"client_id"`
Scope string `json:"scope,omitempty"`
}
// DeviceAuthResponse represents the device authorization response
type DeviceAuthResponse struct {
DeviceCode string `json:"device_code"`
UserCode string `json:"user_code"`
VerificationURI string `json:"verification_uri"`
ExpiresIn int `json:"expires_in"`
Interval int `json:"interval"`
}
// TokenRequest represents the token request for device flow
type TokenRequest struct {
GrantType string `json:"grant_type"`
DeviceCode string `json:"device_code,omitempty"`
RefreshToken string `json:"refresh_token,omitempty"`
ClientID string `json:"client_id,omitempty"`
}
// TokenResponse represents the token response
type TokenResponse struct {
AccessToken string `json:"access_token"`
RefreshToken string `json:"refresh_token"`
TokenType string `json:"token_type"`
ExpiresIn int `json:"expires_in"`
AgentID string `json:"agent_id,omitempty"`
Error string `json:"error,omitempty"`
ErrorDescription string `json:"error_description,omitempty"`
}
// HeartbeatRequest represents the agent heartbeat request
type HeartbeatRequest struct {
AgentID string `json:"agent_id"`
Status string `json:"status"`
Metrics SystemMetrics `json:"metrics"`
}
// MetricsRequest represents the flattened metrics payload expected by agent-auth-api
type MetricsRequest struct {
// Agent identification
AgentID string `json:"agent_id"`
// Basic metrics
CPUUsage float64 `json:"cpu_usage"`
MemoryUsage float64 `json:"memory_usage"`
DiskUsage float64 `json:"disk_usage"`
// Network metrics
NetworkInKbps float64 `json:"network_in_kbps"`
NetworkOutKbps float64 `json:"network_out_kbps"`
// System information
IPAddress string `json:"ip_address"`
Location string `json:"location"`
AgentVersion string `json:"agent_version"`
KernelVersion string `json:"kernel_version"`
DeviceFingerprint string `json:"device_fingerprint"`
// Structured data (JSON fields in database)
LoadAverages map[string]float64 `json:"load_averages"`
OSInfo map[string]string `json:"os_info"`
FilesystemInfo []FilesystemInfo `json:"filesystem_info"`
BlockDevices []BlockDevice `json:"block_devices"`
NetworkStats map[string]uint64 `json:"network_stats"`
}
// Agent types for TensorZero integration
type DiagnosticResponse struct {
ResponseType string `json:"response_type"`
Reasoning string `json:"reasoning"`
Commands []Command `json:"commands"`
}
// ResolutionResponse represents a resolution response
type ResolutionResponse struct {
ResponseType string `json:"response_type"`
RootCause string `json:"root_cause"`
ResolutionPlan string `json:"resolution_plan"`
Confidence string `json:"confidence"`
}
// Command represents a command to execute
type Command struct {
ID string `json:"id"`
Command string `json:"command"`
Description string `json:"description"`
}
// CommandResult represents the result of an executed command
type CommandResult struct {
ID string `json:"id"`
Command string `json:"command"`
Description string `json:"description"`
Output string `json:"output"`
ExitCode int `json:"exit_code"`
Error string `json:"error,omitempty"`
}
// EBPFRequest represents an eBPF trace request from external API
type EBPFRequest struct {
Name string `json:"name"`
Type string `json:"type"` // "tracepoint", "kprobe", "kretprobe"
Target string `json:"target"` // tracepoint path or function name
Duration int `json:"duration"` // seconds
Filters map[string]string `json:"filters,omitempty"`
Description string `json:"description"`
}
// EBPFEnhancedDiagnosticResponse represents enhanced diagnostic response with eBPF
type EBPFEnhancedDiagnosticResponse struct {
ResponseType string `json:"response_type"`
Reasoning string `json:"reasoning"`
Commands []string `json:"commands"` // Changed to []string to match current prompt format
EBPFPrograms []EBPFRequest `json:"ebpf_programs"`
NextActions []string `json:"next_actions,omitempty"`
}
// TensorZeroRequest represents a request to TensorZero
type TensorZeroRequest struct {
Model string `json:"model"`
Messages []map[string]interface{} `json:"messages"`
EpisodeID string `json:"tensorzero::episode_id,omitempty"`
}
// TensorZeroResponse represents a response from TensorZero
type TensorZeroResponse struct {
Choices []map[string]interface{} `json:"choices"`
EpisodeID string `json:"episode_id"`
}
// SystemInfo represents system information (for compatibility)
type SystemInfo struct {
Hostname string `json:"hostname"`
Platform string `json:"platform"`
PlatformInfo map[string]string `json:"platform_info"`
KernelVersion string `json:"kernel_version"`
Uptime string `json:"uptime"`
LoadAverage []float64 `json:"load_average"`
CPUInfo map[string]string `json:"cpu_info"`
MemoryInfo map[string]string `json:"memory_info"`
DiskInfo []map[string]string `json:"disk_info"`
}
// AgentConfig represents agent configuration
type AgentConfig struct {
TensorZeroAPIKey string `json:"tensorzero_api_key"`
APIURL string `json:"api_url"`
Timeout int `json:"timeout"`
Debug bool `json:"debug"`
MaxRetries int `json:"max_retries"`
BackoffFactor int `json:"backoff_factor"`
EpisodeID string `json:"episode_id,omitempty"`
}
// PendingInvestigation represents a pending investigation from the database
type PendingInvestigation struct {
ID string `json:"id"`
InvestigationID string `json:"investigation_id"`
AgentID string `json:"agent_id"`
DiagnosticPayload map[string]interface{} `json:"diagnostic_payload"`
EpisodeID *string `json:"episode_id"`
Status string `json:"status"`
CreatedAt time.Time `json:"created_at"`
}
// DiagnosticAgent interface for agent functionality needed by other packages
type DiagnosticAgent interface {
DiagnoseIssue(issue string) error
// Exported method names to match what websocket client calls
ConvertEBPFProgramsToTraceSpecs(ebpfRequests []EBPFRequest) []ebpf.TraceSpec
ExecuteEBPFTraces(traceSpecs []ebpf.TraceSpec) []map[string]interface{}
SendRequestWithEpisode(messages []openai.ChatCompletionMessage, episodeID string) (*openai.ChatCompletionResponse, error)
SendRequest(messages []openai.ChatCompletionMessage) (*openai.ChatCompletionResponse, error)
ExecuteCommand(cmd Command) CommandResult
}

View File

@@ -0,0 +1,842 @@
package websocket
import (
"context"
"encoding/json"
"fmt"
"log"
"net"
"net/http"
"os"
"os/exec"
"strings"
"time"
"nannyagentv2/internal/auth"
"nannyagentv2/internal/logging"
"nannyagentv2/internal/metrics"
"nannyagentv2/internal/types"
"github.com/gorilla/websocket"
"github.com/sashabaranov/go-openai"
)
// Helper function for minimum of two integers
// WebSocketMessage represents a message sent over WebSocket
type WebSocketMessage struct {
Type string `json:"type"`
Data interface{} `json:"data"`
}
// InvestigationTask represents a task sent to the agent
type InvestigationTask struct {
TaskID string `json:"task_id"`
InvestigationID string `json:"investigation_id"`
AgentID string `json:"agent_id"`
DiagnosticPayload map[string]interface{} `json:"diagnostic_payload"`
EpisodeID string `json:"episode_id,omitempty"`
}
// TaskResult represents the result of a completed task
type TaskResult struct {
TaskID string `json:"task_id"`
Success bool `json:"success"`
CommandResults map[string]interface{} `json:"command_results,omitempty"`
Error string `json:"error,omitempty"`
}
// HeartbeatData represents heartbeat information
type HeartbeatData struct {
AgentID string `json:"agent_id"`
Timestamp time.Time `json:"timestamp"`
Version string `json:"version"`
}
// WebSocketClient handles WebSocket connection to Supabase backend
type WebSocketClient struct {
agent types.DiagnosticAgent // DiagnosticAgent interface
conn *websocket.Conn
agentID string
authManager *auth.AuthManager
metricsCollector *metrics.Collector
supabaseURL string
token string
ctx context.Context
cancel context.CancelFunc
consecutiveFailures int // Track consecutive connection failures
}
// NewWebSocketClient creates a new WebSocket client
func NewWebSocketClient(agent types.DiagnosticAgent, authManager *auth.AuthManager) *WebSocketClient {
// Get agent ID from authentication system
var agentID string
if authManager != nil {
if id, err := authManager.GetCurrentAgentID(); err == nil {
agentID = id
// Agent ID retrieved successfully
} else {
logging.Error("Failed to get agent ID from auth manager: %v", err)
}
}
// Fallback to environment variable or generate one if auth fails
if agentID == "" {
agentID = os.Getenv("AGENT_ID")
if agentID == "" {
agentID = fmt.Sprintf("agent-%d", time.Now().Unix())
}
}
supabaseURL := os.Getenv("SUPABASE_PROJECT_URL")
if supabaseURL == "" {
log.Fatal("❌ SUPABASE_PROJECT_URL environment variable is required")
}
// Create metrics collector
metricsCollector := metrics.NewCollector("v2.0.0")
ctx, cancel := context.WithCancel(context.Background())
return &WebSocketClient{
agent: agent,
agentID: agentID,
authManager: authManager,
metricsCollector: metricsCollector,
supabaseURL: supabaseURL,
ctx: ctx,
cancel: cancel,
}
}
// Start starts the WebSocket connection and message handling
func (w *WebSocketClient) Start() error {
// Starting WebSocket client
if err := w.connect(); err != nil {
return fmt.Errorf("failed to establish WebSocket connection: %v", err)
}
// Start message reading loop
go w.handleMessages()
// Start heartbeat
go w.startHeartbeat()
// Start database polling for pending investigations
go w.pollPendingInvestigations()
// WebSocket client started
return nil
}
// Stop closes the WebSocket connection
func (c *WebSocketClient) Stop() {
c.cancel()
if c.conn != nil {
c.conn.Close()
}
}
// getAuthToken retrieves authentication token
func (c *WebSocketClient) getAuthToken() error {
if c.authManager == nil {
return fmt.Errorf("auth manager not available")
}
token, err := c.authManager.EnsureAuthenticated()
if err != nil {
return fmt.Errorf("authentication failed: %v", err)
}
c.token = token.AccessToken
return nil
}
// connect establishes WebSocket connection
func (c *WebSocketClient) connect() error {
// Get fresh auth token
if err := c.getAuthToken(); err != nil {
return fmt.Errorf("failed to get auth token: %v", err)
}
// Convert HTTP URL to WebSocket URL
wsURL := strings.Replace(c.supabaseURL, "https://", "wss://", 1)
wsURL = strings.Replace(wsURL, "http://", "ws://", 1)
wsURL += "/functions/v1/websocket-agent-handler"
// Connecting to WebSocket
// Set up headers
headers := http.Header{}
headers.Set("Authorization", "Bearer "+c.token)
// Connect
dialer := websocket.Dialer{
HandshakeTimeout: 10 * time.Second,
}
conn, resp, err := dialer.Dial(wsURL, headers)
if err != nil {
c.consecutiveFailures++
if c.consecutiveFailures >= 5 && resp != nil {
logging.Error("WebSocket handshake failed with status: %d (failure #%d)", resp.StatusCode, c.consecutiveFailures)
}
return fmt.Errorf("websocket connection failed: %v", err)
}
c.conn = conn
// WebSocket client connected
return nil
}
// handleMessages processes incoming WebSocket messages
func (c *WebSocketClient) handleMessages() {
defer func() {
if c.conn != nil {
// Closing WebSocket connection
c.conn.Close()
}
}()
// Started WebSocket message listener
connectionStart := time.Now()
for {
select {
case <-c.ctx.Done():
// Only log context cancellation if there have been failures
if c.consecutiveFailures >= 5 {
logging.Debug("Context cancelled after %v, stopping message handler", time.Since(connectionStart))
}
return
default:
// Set read deadline to detect connection issues
c.conn.SetReadDeadline(time.Now().Add(90 * time.Second))
var message WebSocketMessage
readStart := time.Now()
err := c.conn.ReadJSON(&message)
readDuration := time.Since(readStart)
if err != nil {
connectionDuration := time.Since(connectionStart)
// Only log specific errors after failure threshold
if c.consecutiveFailures >= 5 {
if websocket.IsCloseError(err, websocket.CloseNormalClosure, websocket.CloseGoingAway) {
logging.Debug("WebSocket closed normally after %v: %v", connectionDuration, err)
} else if websocket.IsUnexpectedCloseError(err, websocket.CloseGoingAway, websocket.CloseAbnormalClosure) {
logging.Error("ABNORMAL CLOSE after %v (code 1006 = server-side timeout/kill): %v", connectionDuration, err)
logging.Debug("Last read took %v, connection lived %v", readDuration, connectionDuration)
} else if netErr, ok := err.(net.Error); ok && netErr.Timeout() {
logging.Warning("READ TIMEOUT after %v: %v", connectionDuration, err)
} else {
logging.Error("WebSocket error after %v: %v", connectionDuration, err)
}
}
// Track consecutive failures for diagnostic threshold
c.consecutiveFailures++
// Only show diagnostics after multiple failures
if c.consecutiveFailures >= 5 {
logging.Debug("DIAGNOSTIC - Connection failed #%d after %v", c.consecutiveFailures, connectionDuration)
}
// Attempt reconnection instead of returning immediately
go c.attemptReconnection()
return
}
// Received WebSocket message successfully - reset failure counter
c.consecutiveFailures = 0
switch message.Type {
case "connection_ack":
// Connection acknowledged
case "heartbeat_ack":
// Heartbeat acknowledged
case "investigation_task":
// Received investigation task - processing
go c.handleInvestigationTask(message.Data)
case "task_result_ack":
// Task result acknowledged
default:
logging.Warning("Unknown message type: %s", message.Type)
}
}
}
}
// handleInvestigationTask processes investigation tasks from the backend
func (c *WebSocketClient) handleInvestigationTask(data interface{}) {
// Parse task data
taskBytes, err := json.Marshal(data)
if err != nil {
logging.Error("Error marshaling task data: %v", err)
return
}
var task InvestigationTask
err = json.Unmarshal(taskBytes, &task)
if err != nil {
logging.Error("Error unmarshaling investigation task: %v", err)
return
}
// Processing investigation task
// Execute diagnostic commands
results, err := c.executeDiagnosticCommands(task.DiagnosticPayload)
// Prepare task result
taskResult := TaskResult{
TaskID: task.TaskID,
Success: err == nil,
}
if err != nil {
taskResult.Error = err.Error()
logging.Error("Task execution failed: %v", err)
} else {
taskResult.CommandResults = results
// Task executed successfully
}
// Send result back
c.sendTaskResult(taskResult)
}
// executeDiagnosticCommands executes the commands from a diagnostic response
func (c *WebSocketClient) executeDiagnosticCommands(diagnosticPayload map[string]interface{}) (map[string]interface{}, error) {
results := map[string]interface{}{
"agent_id": c.agentID,
"execution_time": time.Now().UTC().Format(time.RFC3339),
"command_results": []map[string]interface{}{},
}
// Extract commands from diagnostic payload
commands, ok := diagnosticPayload["commands"].([]interface{})
if !ok {
return nil, fmt.Errorf("no commands found in diagnostic payload")
}
var commandResults []map[string]interface{}
for _, cmd := range commands {
cmdMap, ok := cmd.(map[string]interface{})
if !ok {
continue
}
id, _ := cmdMap["id"].(string)
command, _ := cmdMap["command"].(string)
description, _ := cmdMap["description"].(string)
if command == "" {
continue
}
// Executing command
// Execute the command
output, exitCode, err := c.executeCommand(command)
result := map[string]interface{}{
"id": id,
"command": command,
"description": description,
"output": output,
"exit_code": exitCode,
"success": err == nil && exitCode == 0,
}
if err != nil {
result["error"] = err.Error()
logging.Warning("Command [%s] failed: %v (exit code: %d)", id, err, exitCode)
}
commandResults = append(commandResults, result)
}
results["command_results"] = commandResults
results["total_commands"] = len(commandResults)
results["successful_commands"] = c.countSuccessfulCommands(commandResults)
// Execute eBPF programs if present
ebpfPrograms, hasEBPF := diagnosticPayload["ebpf_programs"].([]interface{})
if hasEBPF && len(ebpfPrograms) > 0 {
ebpfResults := c.executeEBPFPrograms(ebpfPrograms)
results["ebpf_results"] = ebpfResults
results["total_ebpf_programs"] = len(ebpfPrograms)
}
return results, nil
}
// executeEBPFPrograms executes eBPF monitoring programs using the real eBPF manager
func (c *WebSocketClient) executeEBPFPrograms(ebpfPrograms []interface{}) []map[string]interface{} {
var ebpfRequests []types.EBPFRequest
// Convert interface{} to EBPFRequest structs
for _, prog := range ebpfPrograms {
progMap, ok := prog.(map[string]interface{})
if !ok {
continue
}
name, _ := progMap["name"].(string)
progType, _ := progMap["type"].(string)
target, _ := progMap["target"].(string)
duration, _ := progMap["duration"].(float64)
description, _ := progMap["description"].(string)
if name == "" || progType == "" || target == "" {
continue
}
ebpfRequests = append(ebpfRequests, types.EBPFRequest{
Name: name,
Type: progType,
Target: target,
Duration: int(duration),
Description: description,
})
}
// Execute eBPF programs using the agent's new BCC concurrent execution logic
traceSpecs := c.agent.ConvertEBPFProgramsToTraceSpecs(ebpfRequests)
return c.agent.ExecuteEBPFTraces(traceSpecs)
}
// executeCommandsFromPayload executes commands from a payload and returns results
func (c *WebSocketClient) executeCommandsFromPayload(commands []interface{}) []map[string]interface{} {
var commandResults []map[string]interface{}
for _, cmd := range commands {
cmdMap, ok := cmd.(map[string]interface{})
if !ok {
continue
}
id, _ := cmdMap["id"].(string)
command, _ := cmdMap["command"].(string)
description, _ := cmdMap["description"].(string)
if command == "" {
continue
}
// Execute the command
output, exitCode, err := c.executeCommand(command)
result := map[string]interface{}{
"id": id,
"command": command,
"description": description,
"output": output,
"exit_code": exitCode,
"success": err == nil && exitCode == 0,
}
if err != nil {
result["error"] = err.Error()
logging.Warning("Command [%s] failed: %v (exit code: %d)", id, err, exitCode)
}
commandResults = append(commandResults, result)
}
return commandResults
}
// executeCommand executes a shell command and returns output, exit code, and error
func (c *WebSocketClient) executeCommand(command string) (string, int, error) {
// Parse command into parts
parts := strings.Fields(command)
if len(parts) == 0 {
return "", -1, fmt.Errorf("empty command")
}
// Create command with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, parts[0], parts[1:]...)
cmd.Env = os.Environ()
output, err := cmd.CombinedOutput()
exitCode := 0
if err != nil {
if exitError, ok := err.(*exec.ExitError); ok {
exitCode = exitError.ExitCode()
} else {
exitCode = -1
}
}
return string(output), exitCode, err
}
// countSuccessfulCommands counts the number of successful commands
func (c *WebSocketClient) countSuccessfulCommands(results []map[string]interface{}) int {
count := 0
for _, result := range results {
if success, ok := result["success"].(bool); ok && success {
count++
}
}
return count
}
// sendTaskResult sends a task result back to the backend
func (c *WebSocketClient) sendTaskResult(result TaskResult) {
message := WebSocketMessage{
Type: "task_result",
Data: result,
}
err := c.conn.WriteJSON(message)
if err != nil {
logging.Error("Error sending task result: %v", err)
}
}
// startHeartbeat sends periodic heartbeat messages
func (c *WebSocketClient) startHeartbeat() {
ticker := time.NewTicker(30 * time.Second) // Heartbeat every 30 seconds
defer ticker.Stop()
// Starting heartbeat
for {
select {
case <-c.ctx.Done():
logging.Debug("Heartbeat stopped due to context cancellation")
return
case <-ticker.C:
// Sending heartbeat
heartbeat := WebSocketMessage{
Type: "heartbeat",
Data: HeartbeatData{
AgentID: c.agentID,
Timestamp: time.Now(),
Version: "v2.0.0",
},
}
err := c.conn.WriteJSON(heartbeat)
if err != nil {
logging.Error("Error sending heartbeat: %v", err)
logging.Debug("Heartbeat failed, connection likely dead")
return
}
// Heartbeat sent
}
}
}
// pollPendingInvestigations polls the database for pending investigations
func (c *WebSocketClient) pollPendingInvestigations() {
// Starting database polling
ticker := time.NewTicker(5 * time.Second) // Poll every 5 seconds
defer ticker.Stop()
for {
select {
case <-c.ctx.Done():
return
case <-ticker.C:
c.checkForPendingInvestigations()
}
}
}
// checkForPendingInvestigations checks the database for new pending investigations via proxy
func (c *WebSocketClient) checkForPendingInvestigations() {
// Use Edge Function proxy instead of direct database access
url := fmt.Sprintf("%s/functions/v1/agent-database-proxy/pending-investigations", c.supabaseURL)
// Poll database for pending investigations
req, err := http.NewRequest("GET", url, nil)
if err != nil {
// Request creation failed
return
}
// Only JWT token needed for proxy - no API keys exposed
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", c.token))
req.Header.Set("Accept", "application/json")
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
// Database request failed
return
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return
}
var investigations []types.PendingInvestigation
err = json.NewDecoder(resp.Body).Decode(&investigations)
if err != nil {
// Response decode failed
return
}
for _, investigation := range investigations {
go c.handlePendingInvestigation(investigation)
}
}
// handlePendingInvestigation processes a pending investigation from database polling
func (c *WebSocketClient) handlePendingInvestigation(investigation types.PendingInvestigation) {
// Processing pending investigation
// Mark as executing
err := c.updateInvestigationStatus(investigation.ID, "executing", nil, nil)
if err != nil {
return
}
// Execute diagnostic commands
results, err := c.executeDiagnosticCommands(investigation.DiagnosticPayload)
// Prepare the base results map we'll send to DB
resultsForDB := map[string]interface{}{
"agent_id": c.agentID,
"execution_time": time.Now().UTC().Format(time.RFC3339),
"command_results": results,
}
// If command execution failed, mark investigation as failed
if err != nil {
errorMsg := err.Error()
// Include partial results when possible
if results != nil {
resultsForDB["command_results"] = results
}
c.updateInvestigationStatus(investigation.ID, "failed", resultsForDB, &errorMsg)
// Investigation failed
return
}
// Try to continue the TensorZero conversation by sending command results back
// Build messages: assistant = diagnostic payload, user = command results
diagJSON, _ := json.Marshal(investigation.DiagnosticPayload)
commandsJSON, _ := json.MarshalIndent(results, "", " ")
messages := []openai.ChatCompletionMessage{
{
Role: openai.ChatMessageRoleAssistant,
Content: string(diagJSON),
},
{
Role: openai.ChatMessageRoleUser,
Content: string(commandsJSON),
},
}
// Use the episode ID from the investigation to maintain conversation continuity
episodeID := ""
if investigation.EpisodeID != nil {
episodeID = *investigation.EpisodeID
}
// Continue conversation until resolution (same as agent)
var finalAIContent string
for {
tzResp, tzErr := c.agent.SendRequestWithEpisode(messages, episodeID)
if tzErr != nil {
logging.Warning("TensorZero continuation failed: %v", tzErr)
// Fall back to marking completed with command results only
c.updateInvestigationStatus(investigation.ID, "completed", resultsForDB, nil)
return
}
if len(tzResp.Choices) == 0 {
logging.Warning("No choices in TensorZero response")
c.updateInvestigationStatus(investigation.ID, "completed", resultsForDB, nil)
return
}
aiContent := tzResp.Choices[0].Message.Content
if len(aiContent) > 300 {
// AI response received successfully
} else {
logging.Debug("AI Response: %s", aiContent)
}
// Check if this is a resolution response (final)
var resolutionResp struct {
ResponseType string `json:"response_type"`
RootCause string `json:"root_cause"`
ResolutionPlan string `json:"resolution_plan"`
Confidence string `json:"confidence"`
}
logging.Debug("Analyzing AI response type...")
if err := json.Unmarshal([]byte(aiContent), &resolutionResp); err == nil && resolutionResp.ResponseType == "resolution" {
// This is the final resolution - show summary and complete
logging.Info("=== DIAGNOSIS COMPLETE ===")
logging.Info("Root Cause: %s", resolutionResp.RootCause)
logging.Info("Resolution Plan: %s", resolutionResp.ResolutionPlan)
logging.Info("Confidence: %s", resolutionResp.Confidence)
finalAIContent = aiContent
break
}
// Check if this is another diagnostic response requiring more commands
var diagnosticResp struct {
ResponseType string `json:"response_type"`
Commands []interface{} `json:"commands"`
EBPFPrograms []interface{} `json:"ebpf_programs"`
}
if err := json.Unmarshal([]byte(aiContent), &diagnosticResp); err == nil && diagnosticResp.ResponseType == "diagnostic" {
logging.Debug("AI requested additional diagnostics, executing...")
// Execute additional commands if any
additionalResults := map[string]interface{}{
"command_results": []map[string]interface{}{},
}
if len(diagnosticResp.Commands) > 0 {
logging.Debug("Executing %d additional diagnostic commands", len(diagnosticResp.Commands))
commandResults := c.executeCommandsFromPayload(diagnosticResp.Commands)
additionalResults["command_results"] = commandResults
}
// Execute additional eBPF programs if any
if len(diagnosticResp.EBPFPrograms) > 0 {
ebpfResults := c.executeEBPFPrograms(diagnosticResp.EBPFPrograms)
additionalResults["ebpf_results"] = ebpfResults
}
// Add AI response and additional results to conversation
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleAssistant,
Content: aiContent,
})
additionalResultsJSON, _ := json.MarshalIndent(additionalResults, "", " ")
messages = append(messages, openai.ChatCompletionMessage{
Role: openai.ChatMessageRoleUser,
Content: string(additionalResultsJSON),
})
continue
}
// If neither resolution nor diagnostic, treat as final response
logging.Warning("Unknown response type - treating as final response")
finalAIContent = aiContent
break
}
// Attach final AI response to results for DB and mark as completed_with_analysis
resultsForDB["ai_response"] = finalAIContent
c.updateInvestigationStatus(investigation.ID, "completed_with_analysis", resultsForDB, nil)
}
// updateInvestigationStatus updates the status of a pending investigation
func (c *WebSocketClient) updateInvestigationStatus(id, status string, results map[string]interface{}, errorMsg *string) error {
updateData := map[string]interface{}{
"status": status,
}
if status == "executing" {
updateData["started_at"] = time.Now().UTC().Format(time.RFC3339)
} else if status == "completed" {
updateData["completed_at"] = time.Now().UTC().Format(time.RFC3339)
if results != nil {
updateData["command_results"] = results
}
} else if status == "failed" && errorMsg != nil {
updateData["error_message"] = *errorMsg
updateData["completed_at"] = time.Now().UTC().Format(time.RFC3339)
}
jsonData, err := json.Marshal(updateData)
if err != nil {
return fmt.Errorf("failed to marshal update data: %v", err)
}
url := fmt.Sprintf("%s/functions/v1/agent-database-proxy/pending-investigations/%s", c.supabaseURL, id)
req, err := http.NewRequest("PATCH", url, strings.NewReader(string(jsonData)))
if err != nil {
return fmt.Errorf("failed to create request: %v", err)
}
// Only JWT token needed for proxy - no API keys exposed
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", c.token))
req.Header.Set("Content-Type", "application/json")
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
return fmt.Errorf("failed to update investigation: %v", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 && resp.StatusCode != 204 {
return fmt.Errorf("supabase update error: %d", resp.StatusCode)
}
return nil
}
// attemptReconnection attempts to reconnect the WebSocket with backoff
func (c *WebSocketClient) attemptReconnection() {
backoffDurations := []time.Duration{
2 * time.Second,
5 * time.Second,
10 * time.Second,
20 * time.Second,
30 * time.Second,
}
for i, backoff := range backoffDurations {
select {
case <-c.ctx.Done():
return
default:
c.consecutiveFailures++
// Only show messages after 5 consecutive failures
if c.consecutiveFailures >= 5 {
logging.Info("Attempting WebSocket reconnection (attempt %d/%d) - %d consecutive failures", i+1, len(backoffDurations), c.consecutiveFailures)
}
time.Sleep(backoff)
if err := c.connect(); err != nil {
if c.consecutiveFailures >= 5 {
logging.Warning("Reconnection attempt %d failed: %v", i+1, err)
}
continue
}
// Successfully reconnected - reset failure counter
if c.consecutiveFailures >= 5 {
logging.Info("WebSocket reconnected successfully after %d failures", c.consecutiveFailures)
}
c.consecutiveFailures = 0
go c.handleMessages() // Restart message handling
return
}
}
logging.Error("Failed to reconnect after %d attempts, giving up", len(backoffDurations))
}

265
main.go
View File

@@ -2,6 +2,7 @@ package main
import (
"bufio"
"flag"
"fmt"
"log"
"os"
@@ -9,26 +10,74 @@ import (
"strconv"
"strings"
"syscall"
"time"
"nannyagentv2/internal/auth"
"nannyagentv2/internal/config"
"nannyagentv2/internal/logging"
"nannyagentv2/internal/metrics"
"nannyagentv2/internal/types"
"nannyagentv2/internal/websocket"
)
const Version = "0.0.1"
// showVersion displays the version information
func showVersion() {
fmt.Printf("nannyagent version %s\n", Version)
fmt.Println("Linux diagnostic agent with eBPF capabilities")
os.Exit(0)
}
// showHelp displays the help information
func showHelp() {
fmt.Println("NannyAgent - Linux Diagnostic Agent with eBPF Monitoring")
fmt.Printf("Version: %s\n\n", Version)
fmt.Println("USAGE:")
fmt.Printf(" sudo %s [OPTIONS]\n\n", os.Args[0])
fmt.Println("OPTIONS:")
fmt.Println(" --version, -v Show version information")
fmt.Println(" --help, -h Show this help message")
fmt.Println()
fmt.Println("DESCRIPTION:")
fmt.Println(" NannyAgent is an AI-powered Linux diagnostic tool that uses eBPF")
fmt.Println(" for deep system monitoring and analysis. It requires root privileges")
fmt.Println(" to run for eBPF functionality.")
fmt.Println()
fmt.Println("REQUIREMENTS:")
fmt.Println(" - Linux kernel 5.x or higher")
fmt.Println(" - Root privileges (sudo)")
fmt.Println(" - bpftrace and bpfcc-tools installed")
fmt.Println(" - Network connectivity to Supabase")
fmt.Println()
fmt.Println("CONFIGURATION:")
fmt.Println(" Configuration file: /etc/nannyagent/config.env")
fmt.Println(" Data directory: /var/lib/nannyagent")
fmt.Println()
fmt.Println("EXAMPLES:")
fmt.Printf(" # Run the agent\n")
fmt.Printf(" sudo %s\n\n", os.Args[0])
fmt.Printf(" # Show version (no sudo required)\n")
fmt.Printf(" %s --version\n\n", os.Args[0])
fmt.Println("For more information, visit: https://github.com/yourusername/nannyagent")
os.Exit(0)
}
// checkRootPrivileges ensures the program is running as root
func checkRootPrivileges() {
if os.Geteuid() != 0 {
fmt.Fprintf(os.Stderr, "❌ ERROR: This program must be run as root for eBPF functionality.\n")
fmt.Fprintf(os.Stderr, "Please run with: sudo %s\n", os.Args[0])
fmt.Fprintf(os.Stderr, "Reason: eBPF programs require root privileges to:\n")
fmt.Fprintf(os.Stderr, " - Load programs into the kernel\n")
fmt.Fprintf(os.Stderr, " - Attach to kernel functions and tracepoints\n")
fmt.Fprintf(os.Stderr, " - Access kernel memory maps\n")
logging.Error("This program must be run as root for eBPF functionality")
logging.Error("Please run with: sudo %s", os.Args[0])
logging.Error("Reason: eBPF programs require root privileges to:\n - Load programs into the kernel\n - Attach to kernel functions and tracepoints\n - Access kernel memory maps")
os.Exit(1)
}
}
// checkKernelVersionCompatibility ensures kernel version is 4.4 or higher
// checkKernelVersionCompatibility ensures kernel version is 5.x or higher
func checkKernelVersionCompatibility() {
output, err := exec.Command("uname", "-r").Output()
if err != nil {
fmt.Fprintf(os.Stderr, "❌ ERROR: Cannot determine kernel version: %v\n", err)
logging.Error("Cannot determine kernel version: %v", err)
os.Exit(1)
}
@@ -37,81 +86,51 @@ func checkKernelVersionCompatibility() {
// Parse version (e.g., "5.15.0-56-generic" -> major=5, minor=15)
parts := strings.Split(kernelVersion, ".")
if len(parts) < 2 {
fmt.Fprintf(os.Stderr, "❌ ERROR: Cannot parse kernel version: %s\n", kernelVersion)
logging.Error("Cannot parse kernel version: %s", kernelVersion)
os.Exit(1)
}
major, err := strconv.Atoi(parts[0])
if err != nil {
fmt.Fprintf(os.Stderr, "❌ ERROR: Cannot parse major kernel version: %s\n", parts[0])
logging.Error("Cannot parse major kernel version: %s", parts[0])
os.Exit(1)
}
minor, err := strconv.Atoi(parts[1])
if err != nil {
fmt.Fprintf(os.Stderr, "❌ ERROR: Cannot parse minor kernel version: %s\n", parts[1])
// Check if kernel is 5.x or higher
if major < 5 {
logging.Error("Kernel version %s is not supported", kernelVersion)
logging.Error("Required: Linux kernel 5.x or higher")
logging.Error("Current: %s (major version: %d)", kernelVersion, major)
logging.Error("Reason: NannyAgent requires modern kernel features:\n - Advanced eBPF capabilities\n - BTF (BPF Type Format) support\n - Enhanced security and stability")
os.Exit(1)
}
// Check if kernel is 4.4 or higher
if major < 4 || (major == 4 && minor < 4) {
fmt.Fprintf(os.Stderr, "❌ ERROR: Kernel version %s is too old for eBPF.\n", kernelVersion)
fmt.Fprintf(os.Stderr, "Required: Linux kernel 4.4 or higher\n")
fmt.Fprintf(os.Stderr, "Current: %s\n", kernelVersion)
fmt.Fprintf(os.Stderr, "Reason: eBPF requires kernel features introduced in 4.4+:\n")
fmt.Fprintf(os.Stderr, " - BPF system call support\n")
fmt.Fprintf(os.Stderr, " - eBPF program types (kprobe, tracepoint)\n")
fmt.Fprintf(os.Stderr, " - BPF maps and helper functions\n")
os.Exit(1)
}
fmt.Printf("✅ Kernel version %s is compatible with eBPF\n", kernelVersion)
}
// checkEBPFSupport validates eBPF subsystem availability
func checkEBPFSupport() {
// Check if /sys/kernel/debug/tracing exists (debugfs mounted)
if _, err := os.Stat("/sys/kernel/debug/tracing"); os.IsNotExist(err) {
fmt.Fprintf(os.Stderr, "⚠️ WARNING: debugfs not mounted. Some eBPF features may not work.\n")
fmt.Fprintf(os.Stderr, "To fix: sudo mount -t debugfs debugfs /sys/kernel/debug\n")
logging.Warning("debugfs not mounted. Some eBPF features may not work")
logging.Info("To fix: sudo mount -t debugfs debugfs /sys/kernel/debug")
}
// Check if we can access BPF syscall
fd, _, errno := syscall.Syscall(321, 0, 0, 0) // BPF syscall number on x86_64
if errno != 0 && errno != syscall.EINVAL {
fmt.Fprintf(os.Stderr, "❌ ERROR: BPF syscall not available (errno: %v)\n", errno)
fmt.Fprintf(os.Stderr, "This may indicate:\n")
fmt.Fprintf(os.Stderr, " - Kernel compiled without BPF support\n")
fmt.Fprintf(os.Stderr, " - BPF syscall disabled in kernel config\n")
logging.Error("BPF syscall not available (errno: %v)", errno)
logging.Error("This may indicate:\n - Kernel compiled without BPF support\n - BPF syscall disabled in kernel config")
os.Exit(1)
}
if fd > 0 {
syscall.Close(int(fd))
}
fmt.Printf("✅ eBPF syscall is available\n")
}
func main() {
fmt.Println("🔍 Linux eBPF-Enhanced Diagnostic Agent")
fmt.Println("=======================================")
// Perform system compatibility checks
fmt.Println("Performing system compatibility checks...")
checkRootPrivileges()
checkKernelVersionCompatibility()
checkEBPFSupport()
fmt.Println("✅ All system checks passed")
fmt.Println("")
// Initialize the agent
agent := NewLinuxDiagnosticAgent()
// Start the interactive session
fmt.Println("Linux Diagnostic Agent Started")
fmt.Println("Enter a system issue description (or 'quit' to exit):")
// runInteractiveDiagnostics starts the interactive diagnostic session
func runInteractiveDiagnostics(agent *LinuxDiagnosticAgent) {
logging.Info("=== Linux eBPF-Enhanced Diagnostic Agent ===")
logging.Info("Linux Diagnostic Agent Started")
logging.Info("Enter a system issue description (or 'quit' to exit):")
scanner := bufio.NewScanner(os.Stdin)
for {
@@ -129,9 +148,9 @@ func main() {
continue
}
// Process the issue with eBPF capabilities
if err := agent.DiagnoseWithEBPF(input); err != nil {
fmt.Printf("Error: %v\n", err)
// Process the issue with AI capabilities via TensorZero
if err := agent.DiagnoseIssue(input); err != nil {
logging.Error("Diagnosis failed: %v", err)
}
}
@@ -139,5 +158,133 @@ func main() {
log.Fatal(err)
}
fmt.Println("Goodbye!")
logging.Info("Goodbye!")
}
func main() {
// Define flags with both long and short versions
versionFlag := flag.Bool("version", false, "Show version information")
versionFlagShort := flag.Bool("v", false, "Show version information (short)")
helpFlag := flag.Bool("help", false, "Show help information")
helpFlagShort := flag.Bool("h", false, "Show help information (short)")
flag.Parse()
// Handle --version or -v flag (no root required)
if *versionFlag || *versionFlagShort {
showVersion()
}
// Handle --help or -h flag (no root required)
if *helpFlag || *helpFlagShort {
showHelp()
}
logging.Info("NannyAgent v%s starting...", Version)
// Perform system compatibility checks first
logging.Info("Performing system compatibility checks...")
checkRootPrivileges()
checkKernelVersionCompatibility()
checkEBPFSupport()
logging.Info("All system checks passed")
// Load configuration
cfg, err := config.LoadConfig()
if err != nil {
log.Fatalf("❌ Failed to load configuration: %v", err)
}
cfg.PrintConfig()
// Initialize components
authManager := auth.NewAuthManager(cfg)
metricsCollector := metrics.NewCollector(Version)
// Ensure authentication
token, err := authManager.EnsureAuthenticated()
if err != nil {
log.Fatalf("❌ Authentication failed: %v", err)
}
logging.Info("Authentication successful!")
// Initialize the diagnostic agent for interactive CLI use with authentication
agent := NewLinuxDiagnosticAgentWithAuth(authManager)
// Initialize a separate agent for WebSocket investigations using the application model
applicationAgent := NewLinuxDiagnosticAgent()
applicationAgent.model = "tensorzero::function_name::diagnose_and_heal_application"
// Start WebSocket client for backend communications and investigations
wsClient := websocket.NewWebSocketClient(applicationAgent, authManager)
go func() {
if err := wsClient.Start(); err != nil {
logging.Error("WebSocket client error: %v", err)
}
}()
// Start background metrics collection in a goroutine
go func() {
logging.Debug("Starting background metrics collection and heartbeat...")
ticker := time.NewTicker(time.Duration(cfg.MetricsInterval) * time.Second)
defer ticker.Stop()
// Send initial heartbeat
if err := sendHeartbeat(cfg, token, metricsCollector); err != nil {
logging.Warning("Initial heartbeat failed: %v", err)
}
// Main heartbeat loop
for range ticker.C {
// Check if token needs refresh
if authManager.IsTokenExpired(token) {
logging.Debug("Token expiring soon, refreshing...")
newToken, refreshErr := authManager.EnsureAuthenticated()
if refreshErr != nil {
logging.Warning("Token refresh failed: %v", refreshErr)
continue
}
token = newToken
logging.Debug("Token refreshed successfully")
}
// Send heartbeat
if err := sendHeartbeat(cfg, token, metricsCollector); err != nil {
logging.Warning("Heartbeat failed: %v", err)
// If unauthorized, try to refresh token
if err.Error() == "unauthorized" {
logging.Debug("Unauthorized, attempting token refresh...")
newToken, refreshErr := authManager.EnsureAuthenticated()
if refreshErr != nil {
logging.Warning("Token refresh failed: %v", refreshErr)
continue
}
token = newToken
// Retry heartbeat with new token (silently)
if retryErr := sendHeartbeat(cfg, token, metricsCollector); retryErr != nil {
logging.Warning("Retry heartbeat failed: %v", retryErr)
}
}
}
// No logging for successful heartbeats - they should be silent
}
}()
// Start the interactive diagnostic session (blocking)
runInteractiveDiagnostics(agent)
}
// sendHeartbeat collects metrics and sends heartbeat to the server
func sendHeartbeat(cfg *config.Config, token *types.AuthToken, collector *metrics.Collector) error {
// Collect system metrics
systemMetrics, err := collector.GatherSystemMetrics()
if err != nil {
return fmt.Errorf("failed to gather system metrics: %w", err)
}
// Send metrics using the collector with correct agent_id from token
return collector.SendMetrics(cfg.AgentAuthURL, token.AccessToken, token.AgentID, systemMetrics)
}