How an AI Agent Killed Our Alert Storms

It started with a calendar invite: “Google ADK Hackathon - All Night.”

The Google team was joining live from our Denver office - big leaders, on a call, watching. From Hyderabad, that meant one thing: we were doing this through the night.

By 9 PM our office was a different place. The usual hum of meetings and standups was gone. It was just our team of four, laptops open, empty coffee cups slowly accumulating, and a Google Meet window in the corner showing Denver. That particular mix of exhaustion and adrenaline that only a night hackathon produces - you either know it or you don’t.

The stakes felt real. This wasn’t an internal demo for a small team. Senior leaders were watching from across the world, and we had until morning to show them something worth staying up for.

Our team huddled up and asked the question every hackathon team asks: what problem do we actually care about solving? We didn’t have to think long.

Anyone who’s worked in an office with network infrastructure knows the pain. Monitoring tools like ManageEngine, Statseeker, and SD-WAN vManage each generate their own stream of alerts - high latency here, packet loss there, an interface going down at 2 AM. And the tricky part? A single real incident can trigger alerts across all three tools at once, leaving an engineer jumping between dashboards to piece together what’s actually happening. That triage process? It’s repetitive, mentally taxing, and it happens dozens of times a day.

We thought: what if an AI agent could do the first pass?

What We Built

By the end of the hackathon, we had a working agent - let’s call it NetSense - that could:

That last point is what made it genuinely useful. Instead of an engineer staring at 47 alerts and deciding where to start, NetSense would hand them a prioritized briefing - and tell them what resolved the same issue six months ago.

Think of it as a first-responder that never sleeps, never gets alert fatigue, and has a perfect memory of everything that’s ever gone wrong on your network.

Why BigQuery?

We loaded 6 months of historical ManageEngine alert data into BigQuery and gave the agent a tool to query it. This was a game-changer.

Without history, the agent could only describe what’s happening right now. With history, it could answer questions like:

Enter Google ADK

Google’s Agent Development Kit (ADK) is a framework for building multi-step AI agents powered by Gemini. What makes it interesting is that it’s not just a wrapper around an LLM - it gives you a structured way to define:

Here’s a simplified version of how we set up a tool for the agent:

from google.adk.agents import Agent
from google.adk.tools import FunctionTool
from google.cloud import bigquery

def get_alert_history(device_id: str, hours: int = 24) -> dict:
    """Fetch recent alerts for a specific network device."""
    # In production, this calls ManageEngine API
    return {
        "device_id": device_id,
        "alerts": [
            {"time": "2026-05-24T08:12:00", "type": "high_latency", "value": "320ms"},
            {"time": "2026-05-24T09:45:00", "type": "packet_loss", "value": "4.2%"},
        ]
    }

def get_device_info(device_id: str) -> dict:
    """Get metadata about a network device."""
    return {
        "device_id": device_id,
        "name": "core-switch-floor3",
        "location": "Building A, Floor 3",
        "criticality": "high"
    }

def query_historical_alerts(device_id: str, alert_type: str) -> dict:
    """Query 6 months of historical alert data from BigQuery to find patterns and past resolutions."""
    client = bigquery.Client()
    query = f"""
        SELECT
            alert_type,
            occurred_at,
            resolution,
            resolved_by,
            time_to_resolve_mins
        FROM `your_project.network_ops.alerts_history`
        WHERE device_id = '{device_id}'
          AND alert_type = '{alert_type}'
          AND occurred_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 180 DAY)
        ORDER BY occurred_at DESC
        LIMIT 10
    """
    results = client.query(query).result()
    return {
        "historical_incidents": [dict(row) for row in results],
        "message": f"Found past incidents for {device_id} — includes resolution steps"
    }

# Register all three tools and create the agent
alert_tool = FunctionTool(get_alert_history)
device_tool = FunctionTool(get_device_info)
history_tool = FunctionTool(query_historical_alerts)

agent = Agent(
    model="gemini-2.0-flash",
    tools=[alert_tool, device_tool, history_tool],
    system_prompt="""You are a network operations analyst with access to
    real-time alerts and 6 months of historical incident data.
    When given an alert, investigate it using available tools, check if this
    is a recurring issue, identify the root cause, assess urgency, and suggest
    solutions based on what has worked in the past."""
)

What the agent then does - on its own - is decide which tools to call, in what order, based on what it learns along the way. You don’t hardcode the logic. The agent reasons through it.

The Moment It Clicked

About four hours into the hackathon, we fed the agent a batch of real-looking test alerts we’d mocked up — things like:

[ALERT] core-switch-floor3 | Latency > 300ms | 08:12 AM
[ALERT] core-switch-floor3 | Packet loss 4.2% | 09:45 AM
[ALERT] ap-conf-room-b | Client disconnections | 09:50 AM
[ALERT] ap-hallway-2 | Client disconnections | 09:52 AM

Without any guidance, the agent:

  1. Pulled the alert history for core-switch-floor3
  2. Noticed the latency issue had started before the packet loss
  3. Recognized that the access point disconnections happened downstream of the same switch
  4. Concluded that the switch was the likely root cause, not four separate problems Its output looked like this:

Summary: High-priority incident on core-switch-floor3 (Building A, Floor 3). Degraded performance beginning ~8:12 AM likely cascaded into packet loss and downstream client disconnections on two access points. This appears to be a single root cause, not isolated events.

Historical context: This device has had 3 similar latency spikes in the past 6 months — all on Monday mornings, all resolved by restarting the uplink interface. Last occurrence: March 11, 2026. Resolved in 22 minutes by the network team.

Urgency: High affects conference room connectivity during business hours.

Suggested fix: Restart uplink interface on core-switch-floor3 (worked in all 3 prior incidents). If issue persists beyond 15 minutes, escalate to network team lead and check for firmware updates flagged in the February maintenance log.

The room got quiet for a second. That’s the kind of analysis that usually takes a network engineer 15-20 minutes to piece together. The agent did it in seconds.

What Surprised Us

What We’d Build Next

We barely scratched the surface in a single hackathon day. The natural next steps would be:

The Bigger Picture

What struck me most about this hackathon wasn’t the technology - it was how quickly a small team could build something genuinely useful with Google ADK and Gemini.

We’re at a point where the gap between “interesting demo” and “solves a real problem” is narrowing fast. Network alert triage is just one example. The same pattern - agent ingests noisy data, uses tools to gather context, reasons about root cause, surfaces a clear recommendation - applies to security operations, customer support, IT helpdesk, financial monitoring, and more.

The agents aren’t replacing the experts. They’re doing the exhausting first pass so the experts can focus on the decisions that actually need human judgment.

That feels like the right division of labor to me.