Why General-Purpose LLMs Fail in the SOC

Why Traditional SOAR Is Broken – and How the Agentic Mesh Fixes It

July 23, 2025

Why Generic LLMs Fail at Security Work
The Security-Native LLM Blueprint (What “Good” Looks Like)
What to Measure (So It’s Not Fluff)
Close Strong

In April 2025, researchers demonstrated a sobering attack against Google Gemini: a poisoned calendar invite injected hidden instructions into the model’s context. When Gemini synced with a connected smart-home system, it unlocked the front door—no clicks, no warnings.

At Black Hat, another team took it further. A single malicious Google Drive document, when accessed through a ChatGPT integration, silently exfiltrated authentication tokens to an attacker-controlled endpoint. The model was simply “helping” with a request unaware it was being manipulated.

These weren’t hypothetical red-team stunts. They were real-world, production-grade attacks exploiting a simple truth: when you connect a general-purpose LLM to sensitive tools and data, you expand your attack surface dramatically. If your SOC drops a generic model into email, repos, or calendars without hardening, you’re not just adopting AI—you’re arming your adversary.

Why Generic LLMs Fail at Security Work

Think like an attacker for a moment.

You’ve got a target that just wired a general-purpose LLM into their SOC: it can read mailboxes, query threat intel, maybe even trigger playbooks. You don’t need to break the model’s encryption — you just need to talk to it. You poison the inputs, nudge its reasoning, and watch as it faithfully carries out your instructions… right inside the organization’s most sensitive workflows.

This is the risk profile every SOC inherits the moment it deploys a drop-in LLM. And it’s why “working AI” isn’t the same as safe AI.

Generic LLMs are brilliant at sounding right. That’s their job. But in the SOC, “convincing” isn’t enough — it has to be correct, provable, and resilient under attack. Here’s why the drop-in models break under operational pressure.

They reward fluency over falsifiability

General-purpose LLMs are optimized to produce the most likely next word, not the most verifiable fact. In a 2025 study, ChatGPT generated convincing advisories for both real and fake CVE IDs. It did not flag a single fake as invalid. The problem isn’t intent—it’s architecture. The model is rewarded for sounding authoritative, not for proving its claims.

They produce insecure code

GitHub Copilot research by Gang Wang’s team found that ~40% of generated programs across 89 scenarios contained vulnerabilities—many in the CWE Top 25 category. This isn’t just about coding mistakes. In security automation, an LLM writing a detection rule or remediation script that’s subtly wrong is a gift-wrapped bypass for attackers.

RAG helps, but doesn’t cure

Retrieval-Augmented Generation can lower hallucination rates by pulling from curated sources, but even in high-stakes legal AI, Stanford researchers found persistent misgrounding. RAG is a control loop, not a cure. Without strict provenance checks, models can misinterpret retrieved facts or mix them with hallucinations.

They are exploitable as systems

The OWASP LLM Top 10 reads like a breach post-mortem for insecure SOC deployments: prompt injection, insecure output handling, training-data poisoning, and over-permissive tool use. Every one of these has been seen in the wild, and none are fixed by “better prompting.”

They lack principled uncertainty

State-of-the-art methods can detect a subset of hallucinations using entropy or uncertainty signals—but “be more truthful” fine-tuning is not enough. Without architectural mechanisms to catch low-confidence generations before they hit production tools, errors will slip through.

The Security-Native LLM Blueprint (What “Good” Looks Like)

A secure SOC LLM isn’t just better trained. It’s built from the ground up with security-first principles.

a) Domain Specialization

Training & tuning on CVE/NVD diffs, MITRE ATT&CK, CTI reports, incident post-mortems, Sigma/YARA rules, vendor APIs, and schemas.
Task-specific heads for triage, enrichment, hypothesis generation, playbook translation, and forensics reporting.

b) Grounding & Context Discipline

RAG with strong provenance: source only from authoritative datasets (e.g., NVD, internal KBs) and attach citations to every output.
Entity grounding: link alerts → assets → users → I/O artifacts so reasoning is over facts, not free text.

c) Tooling with Least Privilege

Allow-list tools and permissions (e.g., “enrich IP” = yes; “delete user” = no).
Output sanitization & policy checks before executing any action, mitigating LLM02 risks (insecure output handling).

d) Red-Team & Evals Built-In

Continuous AI red-teaming with PyRIT plus human adversaries to test jailbreak, injection, and misuse resilience.
Map controls to NIST AI RMF and UK NCSC secure AI lifecycle for verifiable compliance.

e) Uncertainty & Refusal

Entropy-gated actions: when confidence is low, the model must fetch more context, clarify, or refuse.

f) Governance

Auditability: every action is reproducible with logged inputs, retrieved data, tool calls, and policy decisions.
Supply chain security for prompts, tools, datasets, and model versions to prevent drift and hidden exploits.

What to Measure (So It’s Not Fluff)

Security-grade AI is measurable. Here’s what to track:

Hallucination rate on security tasks, fact-checked against NVD/ATT&CK.
Injection-resistance score (PyRIT + internal red team scenarios).
Safe tool-use rate (blocked dangerous actions ÷ total attempted).
Triage quality: precision/recall for enrichment & deduplication, plus time-to-first-action.
Provenance coverage: percentage of answers with validated citations.
Code security delta: CWE findings before/after adopting AI-generated detection/scripts.

Close Strong

A general-purpose LLM can sound right while being dangerously wrong — and in security, that’s not a bug; it’s a breach vector.

Verifiability, provenance, guardrails, and adversarial hardening aren’t optional extras; they’re the foundations. A security-native model isn’t just trained differently — it’s architected for containment, proof, and safe failure.

The industry doesn’t need AI that talks security.
It needs AI that thinks in security terms, acts within strict boundaries, and can prove every claim it makes.

Because in the SOC, an unverified answer isn’t insightful. It’s an incident waiting to happen.

By Features

By Usecase

Recent Blog