Dynamic Security Testing for Agentic Systems
By Miriam Horovicz // May 2026
A dynamic multi-agent security testing system for finding real risks in agentic tool workflows beyond what static red-team scripts can catch.
Dynamic Security Testing for Agentic Systems
By Miriam Horovicz // May 2026
Dynamic Security Testing for Agentic Systems
Agentic systems are becoming part of real production workflows. They do not just answer questions. They call tools, read external data, write records, trigger workflows, and make decisions across connected services.
That changes the security model.
Traditional application security testing is still important, but agentic systems add a new kind of attack surface: tool descriptions, model instructions, stored context, generated plans, multi-step reasoning, and the behavior of an agent after it reads untrusted data.
We built a dynamic security testing system to explore this new surface.
Why Static Scripts Are Not Enough Anymore
Hard-coded red-team scripts are useful when the target behavior is known in advance. A script can send a payload, check for a specific response, and repeat that pattern across many targets. For classic vulnerabilities, this can work well.
But it does not make sense to attack smart agents only with prewritten scripts.
The power is not equal.
A static script follows a fixed path. A smart agent reasons, chooses tools, interprets data, stores context, and may behave differently depending on what it sees. The risky behavior is often not visible in one request and one response. It may depend on the wording of a tool schema, whether a value is stored and later summarized, whether one tool can influence another, or whether the model treats returned data as information or as an instruction.
To test that kind of system, the attacker also needs to be dynamic.
This is the superpower of our approach: it uses a dynamic multi-agent system to evaluate other agentic systems. Instead of running a fixed checklist, the system discovers the actual tool surface, understands the parameters, chooses relevant tests, calls tools directly, studies the responses, and adapts. If something looks suspicious, it can try variants, follow a new path, or chain behavior across tools.
That makes the assessment closer to how a human red teamer works: observe, form a hypothesis, test it, refine, and validate. The system applies that loop systematically across the whole server.
How It Works
The assessment runs through four main phases.
First, it discovers the server. It connects to the MCP endpoint, enumerates tools, reads schemas, maps parameter types, and builds a profile of the attack surface.
Next, it attacks. The scanner agent uses a generated playbook of security knowledge covering injection, LLM safety, access control, data exposure, infrastructure resilience, data integrity, and business logic. The playbook gives guidance and payloads, but the agent decides what applies and how to execute it.
Then it validates. A separate validation phase challenges the initial findings with follow-up attempts. This step is intentionally skeptical. A finding should answer basic questions: who is affected, what boundary was crossed, did the server process the input or merely echo it, and can this behavior cause real harm?
Finally, it reports. The system turns validated findings into a clear security report with evidence, severity, remediation guidance, and compliance mapping.
Tests as Knowledge, Not Code
One of the core ideas is that tests should be knowledge, not rigid code.
The test library contains payloads, detection rules, and guidance. But the scanner still has to reason. It must decide which tools are relevant, craft inputs that match the schema, interpret responses, filter false positives, and adapt when it sees interesting behavior.
This matters because agentic vulnerabilities rarely fit one static pattern. A tool may be safe against direct prompt injection but vulnerable to fake transcript injection. A write tool may look harmless until a read tool later summarizes the stored value. A schema description may be metadata in one runtime and an instruction in another.
The system is built for those gray areas.
What We Found
The scans surfaced a few practical examples of why this matters.
In one case, tool responses reflected raw HTML/script-like input back to the caller. On a normal API this might look like a simple output-encoding bug. In an agentic workflow, the risk changes: MCP output may be rendered inside a browser-based assistant, dashboard, or developer tool with access to a privileged session. That turns "just reflected text" into a potential downstream XSS path.
In another case, the traditional web/API flow enforced an access check, but the newer MCP tool path did not enforce the same boundary consistently. The important lesson was not the specific product or endpoint. The lesson was architectural: when agent-facing tools are added beside existing APIs, they must inherit the same authorization, validation, and rate-limiting rules. Otherwise the MCP layer becomes the weaker door into the same data.
We also saw cases where scary-looking payloads were correctly rejected or merely echoed. Those were dismissed. The goal is not to produce dramatic findings; it is to separate real cross-boundary risk from noise.
Fighting False Positives
Security automation becomes much less useful when it floods teams with findings that cannot be reproduced or do not matter.
The system is deliberately skeptical. It distinguishes between reflection and execution, between harmless self-contained behavior and real workflow impact, and between public data exposure and sensitive data exposure.
The validation phase can dismiss findings. That is a feature. A shorter list of well-supported issues is more useful than a long list of guesses.
The Future Potential
As more systems expose tools to agents, security testing needs to understand both software behavior and model-mediated behavior. This cannot stay as a one-time checklist before launch. Agentic systems change as tools, prompts, permissions, models, and workflows change.
The future is continuous evaluation: dynamic agents testing dynamic agents, with enough structure to be repeatable and enough reasoning to adapt. These systems should be able to scan new tools, replay old findings, validate fixes, discover new chains, and explain the risk in language security and product teams can act on.
If agents are going to operate software, security testing needs agents too.