Files
nannyagent/EBPF_TENSORZERO_INTEGRATION.md
Harshavardhan Musanalli 4b442ab169 Adding ebpf capability now
2025-09-28 12:10:52 +02:00

6.2 KiB

eBPF Integration Summary for TensorZero

🎯 Overview

Your Linux diagnostic agent now has advanced eBPF monitoring capabilities integrated with the Cilium eBPF Go library. This enables real-time kernel-level monitoring alongside traditional system commands for unprecedented diagnostic precision.

🔄 Key Changes from Previous System Prompt

Before (Traditional Commands Only):

{
  "response_type": "diagnostic",
  "reasoning": "Need to check network connections",
  "commands": [
    {"id": "net_check", "command": "netstat -tulpn", "description": "Check connections"}
  ]
}

After (eBPF-Enhanced):

{
  "response_type": "diagnostic", 
  "reasoning": "Network timeout issues require monitoring TCP connections and system calls to identify bottlenecks",
  "commands": [
    {"id": "net_status", "command": "ss -tulpn", "description": "Current network connections"}
  ],
  "ebpf_programs": [
    {
      "name": "tcp_connect_monitor",
      "type": "kprobe",
      "target": "tcp_connect", 
      "duration": 15,
      "description": "Monitor TCP connection attempts in real-time"
    }
  ]
}

🔧 TensorZero Configuration Steps

1. Update System Prompt

Replace your current system prompt with the content from TENSORZERO_SYSTEM_PROMPT.md. Key additions:

  • eBPF program request format in diagnostic responses
  • Comprehensive eBPF guidelines for different issue types
  • Enhanced resolution format with ebpf_evidence field
  • Specific tracepoint/kprobe recommendations per issue category

2. Response Format Changes

Diagnostic Phase (Enhanced):

{
  "response_type": "diagnostic",
  "reasoning": "Analysis explanation...",
  "commands": [...],
  "ebpf_programs": [
    {
      "name": "program_name",
      "type": "tracepoint|kprobe|kretprobe", 
      "target": "kernel_function_or_tracepoint",
      "duration": 10-30,
      "filters": {"comm": "process_name", "pid": 1234},
      "description": "Why this monitoring is needed"
    }
  ]
}

Resolution Phase (Enhanced):

{
  "response_type": "resolution",
  "root_cause": "Definitive root cause statement",
  "resolution_plan": "Step-by-step fix plan", 
  "confidence": "High|Medium|Low",
  "ebpf_evidence": "Summary of eBPF findings that led to diagnosis"
}

3. eBPF Program Categories (AI Guidelines)

The system prompt now includes specific eBPF program recommendations:

Issue Type Recommended eBPF Programs
Network syscalls/sys_enter_connect, kprobe:tcp_connect, kprobe:tcp_sendmsg
Process syscalls/sys_enter_execve, sched/sched_process_exit, kprobe:do_fork
File I/O syscalls/sys_enter_openat, kprobe:vfs_read, kprobe:vfs_write
Performance syscalls/sys_enter_*, kprobe:schedule, irq/irq_handler_entry
Memory kprobe:__alloc_pages_nodemask, kmem/kmalloc

🔍 Data Flow

1. AI Request → Agent

{
  "ebpf_programs": [
    {"name": "tcp_monitor", "type": "kprobe", "target": "tcp_connect", "duration": 15}
  ]
}

2. Agent → eBPF Manager

programID, err := ebpfManager.StartEBPFProgram(ebpfRequest)

3. eBPF Results → AI

{
  "ebpf_results": {
    "tcp_monitor_1695902400": {
      "program_name": "tcp_monitor", 
      "event_count": 42,
      "events": [
        {
          "timestamp": 1695902400000000000,
          "process_id": 1234,
          "process_name": "curl",
          "event_type": "tcp_connect", 
          "data": {"destination": "192.168.1.1:443"}
        }
      ],
      "summary": "Captured 42 TCP connection attempts over 15 seconds"
    }
  }
}

Validation Checklist

Before deploying to TensorZero:

  • System Prompt Updated: Copy complete content from TENSORZERO_SYSTEM_PROMPT.md
  • JSON Format Validated: Ensure AI model can generate structured eBPF requests
  • Agent Endpoint: Verify NANNYAPI_MODEL environment variable points to your TensorZero function
  • Test Scenarios: Prepare test cases for network, process, file, and performance issues
  • Root Privileges: Ensure production agent runs with sufficient privileges for eBPF

🚀 Example Diagnostic Flow

User Issue: "Application randomly fails to connect to database"

AI Response:

{
  "response_type": "diagnostic",
  "reasoning": "Database connection failures could be due to network issues, DNS resolution, or connection pool exhaustion. Need to monitor both system-level network activity and application behavior.",
  "commands": [
    {"id": "db_connections", "command": "ss -tlnp | grep :5432", "description": "Check PostgreSQL connections"},
    {"id": "dns_check", "command": "nslookup db.example.com", "description": "Verify DNS resolution"}
  ],
  "ebpf_programs": [
    {
      "name": "tcp_connect_db",
      "type": "kprobe", 
      "target": "tcp_connect",
      "duration": 20,
      "filters": {"comm": "myapp"},
      "description": "Monitor TCP connection attempts from application"
    },
    {
      "name": "dns_queries",
      "type": "tracepoint",
      "target": "syscalls/sys_enter_connect", 
      "duration": 20,
      "description": "Monitor network system calls for DNS resolution"
    }
  ]
}

Agent Execution:

  1. Runs ss and nslookup commands
  2. Starts eBPF programs to monitor TCP connections and DNS queries
  3. Collects real-time kernel events for 20 seconds
  4. Returns combined traditional + eBPF results to AI

AI Analysis:

{
  "response_type": "resolution",
  "root_cause": "Application is experiencing DNS resolution timeouts. eBPF traces show successful TCP connections to IP addresses but failed connections when using hostname.",
  "resolution_plan": "1. Configure application to use IP address directly\n2. Fix DNS timeout: echo 'nameserver 8.8.8.8' >> /etc/resolv.conf\n3. Test connectivity: dig db.example.com",
  "confidence": "High",
  "ebpf_evidence": "eBPF tcp_connect traces show 15 successful connections to 10.0.1.50:5432 but 8 failed connection attempts during DNS lookups. DNS query monitoring revealed 3-5 second delays in resolution."
}

This integration provides your diagnostic agent with professional-grade system monitoring capabilities that were previously only available in dedicated observability tools!