1) add-bpf-capability 2) Not so clean but for now it's okay to start with Co-authored-by: Harshavardhan Musanalli <harshavmb@gmail.com> Reviewed-on: #1
6.2 KiB
6.2 KiB
eBPF Integration Summary for TensorZero
🎯 Overview
Your Linux diagnostic agent now has advanced eBPF monitoring capabilities integrated with the Cilium eBPF Go library. This enables real-time kernel-level monitoring alongside traditional system commands for unprecedented diagnostic precision.
🔄 Key Changes from Previous System Prompt
Before (Traditional Commands Only):
{
"response_type": "diagnostic",
"reasoning": "Need to check network connections",
"commands": [
{"id": "net_check", "command": "netstat -tulpn", "description": "Check connections"}
]
}
After (eBPF-Enhanced):
{
"response_type": "diagnostic",
"reasoning": "Network timeout issues require monitoring TCP connections and system calls to identify bottlenecks",
"commands": [
{"id": "net_status", "command": "ss -tulpn", "description": "Current network connections"}
],
"ebpf_programs": [
{
"name": "tcp_connect_monitor",
"type": "kprobe",
"target": "tcp_connect",
"duration": 15,
"description": "Monitor TCP connection attempts in real-time"
}
]
}
🔧 TensorZero Configuration Steps
1. Update System Prompt
Replace your current system prompt with the content from TENSORZERO_SYSTEM_PROMPT.md. Key additions:
- eBPF program request format in diagnostic responses
- Comprehensive eBPF guidelines for different issue types
- Enhanced resolution format with
ebpf_evidencefield - Specific tracepoint/kprobe recommendations per issue category
2. Response Format Changes
Diagnostic Phase (Enhanced):
{
"response_type": "diagnostic",
"reasoning": "Analysis explanation...",
"commands": [...],
"ebpf_programs": [
{
"name": "program_name",
"type": "tracepoint|kprobe|kretprobe",
"target": "kernel_function_or_tracepoint",
"duration": 10-30,
"filters": {"comm": "process_name", "pid": 1234},
"description": "Why this monitoring is needed"
}
]
}
Resolution Phase (Enhanced):
{
"response_type": "resolution",
"root_cause": "Definitive root cause statement",
"resolution_plan": "Step-by-step fix plan",
"confidence": "High|Medium|Low",
"ebpf_evidence": "Summary of eBPF findings that led to diagnosis"
}
3. eBPF Program Categories (AI Guidelines)
The system prompt now includes specific eBPF program recommendations:
| Issue Type | Recommended eBPF Programs |
|---|---|
| Network | syscalls/sys_enter_connect, kprobe:tcp_connect, kprobe:tcp_sendmsg |
| Process | syscalls/sys_enter_execve, sched/sched_process_exit, kprobe:do_fork |
| File I/O | syscalls/sys_enter_openat, kprobe:vfs_read, kprobe:vfs_write |
| Performance | syscalls/sys_enter_*, kprobe:schedule, irq/irq_handler_entry |
| Memory | kprobe:__alloc_pages_nodemask, kmem/kmalloc |
🔍 Data Flow
1. AI Request → Agent
{
"ebpf_programs": [
{"name": "tcp_monitor", "type": "kprobe", "target": "tcp_connect", "duration": 15}
]
}
2. Agent → eBPF Manager
programID, err := ebpfManager.StartEBPFProgram(ebpfRequest)
3. eBPF Results → AI
{
"ebpf_results": {
"tcp_monitor_1695902400": {
"program_name": "tcp_monitor",
"event_count": 42,
"events": [
{
"timestamp": 1695902400000000000,
"process_id": 1234,
"process_name": "curl",
"event_type": "tcp_connect",
"data": {"destination": "192.168.1.1:443"}
}
],
"summary": "Captured 42 TCP connection attempts over 15 seconds"
}
}
}
✅ Validation Checklist
Before deploying to TensorZero:
- System Prompt Updated: Copy complete content from
TENSORZERO_SYSTEM_PROMPT.md - JSON Format Validated: Ensure AI model can generate structured eBPF requests
- Agent Endpoint: Verify
NANNYAPI_MODELenvironment variable points to your TensorZero function - Test Scenarios: Prepare test cases for network, process, file, and performance issues
- Root Privileges: Ensure production agent runs with sufficient privileges for eBPF
🚀 Example Diagnostic Flow
User Issue: "Application randomly fails to connect to database"
AI Response:
{
"response_type": "diagnostic",
"reasoning": "Database connection failures could be due to network issues, DNS resolution, or connection pool exhaustion. Need to monitor both system-level network activity and application behavior.",
"commands": [
{"id": "db_connections", "command": "ss -tlnp | grep :5432", "description": "Check PostgreSQL connections"},
{"id": "dns_check", "command": "nslookup db.example.com", "description": "Verify DNS resolution"}
],
"ebpf_programs": [
{
"name": "tcp_connect_db",
"type": "kprobe",
"target": "tcp_connect",
"duration": 20,
"filters": {"comm": "myapp"},
"description": "Monitor TCP connection attempts from application"
},
{
"name": "dns_queries",
"type": "tracepoint",
"target": "syscalls/sys_enter_connect",
"duration": 20,
"description": "Monitor network system calls for DNS resolution"
}
]
}
Agent Execution:
- Runs
ssandnslookupcommands - Starts eBPF programs to monitor TCP connections and DNS queries
- Collects real-time kernel events for 20 seconds
- Returns combined traditional + eBPF results to AI
AI Analysis:
{
"response_type": "resolution",
"root_cause": "Application is experiencing DNS resolution timeouts. eBPF traces show successful TCP connections to IP addresses but failed connections when using hostname.",
"resolution_plan": "1. Configure application to use IP address directly\n2. Fix DNS timeout: echo 'nameserver 8.8.8.8' >> /etc/resolv.conf\n3. Test connectivity: dig db.example.com",
"confidence": "High",
"ebpf_evidence": "eBPF tcp_connect traces show 15 successful connections to 10.0.1.50:5432 but 8 failed connection attempts during DNS lookups. DNS query monitoring revealed 3-5 second delays in resolution."
}
This integration provides your diagnostic agent with professional-grade system monitoring capabilities that were previously only available in dedicated observability tools!