✓ What Works Well

Redis for A2A

LPUSH/BRPOP pattern: simple, reliable, observable. Under 10ms latency.

Idempotency Cache

24hr TTL prevents duplicates. Hash: device_id + command + timestamp.

TextFSM Parsing

Converts Cisco CLI to JSON. Templates from ntc-templates repo.

Graceful Fallback

HF endpoint fails → auto-switch to OpenAI. Zero downtime demos.

⚠ Challenges & Solutions

Challenge: HF Cold Starts

Dedicated endpoints sleep after 15min idle → 60-90s first call

Solution: Warmup ping every 10min + OpenAI fallback

Challenge: LLM Token Noise

Llama3 outputs <|eot_id|>, system messages leak through

Solution: Regex cleanup + structured prompts with examples

Challenge: SSH Tunnel Instability

Azure VM → Router connections drop randomly

Solution: Auto-reconnect with exponential backoff + health checks

Challenge: Duplicate Approvals

Same incident → multiple agents → N approval cards

Solution: Dedup by (device_id + agent_type), show latest only

💡 Best Practices

1. Start Simple

Single agent + OpenAI first. Add specialists & fine-tuning later.

2. Log Everything

Trace IDs, timestamps, provider used. Essential for debugging.

3. Test with Mocks

Mock routers, LLM responses. E2E tests without production risk.

4. Version Everything

LoRA adapters, prompts, schemas. Roll back on regression.

Implementation Insights & Lessons Learned