Blueinfy's blog: Offensive AI Is Here – Anthropic red team & CyberExplorer benchmarking

The landscape of AI-driven security research is rapidly transforming as large language models begin to operate not just as assistants but as autonomous vulnerability researchers. Recent breakthroughs from Anthropic’s red teaming initiatives and the CyberExplorer benchmarking framework reveal how models like Claude Opus and Gemini are moving beyond static analysis or brute-force fuzzing—effectively reasoning through complex codebases, prioritizing attack surfaces, and uncovering previously unimaginable classes of software flaws. Together, these efforts signify a paradigm shift where LLMs are reshaping both offensive and defensive cyber operations at scale.

Anthropic Red Teaming (Read here):

Claude Opus 4.6 markedly advances AI-driven vulnerability discovery by autonomously finding more than 500 high‑severity 0‑days in widely used open source projects, even in codebases that have already undergone extensive fuzzing and review. Unlike traditional fuzzers that bombard programs with random inputs, the model reasons like a human analyst: it reads code and commit histories, looks for partially fixed patterns, and targets historically risky APIs, enabling it to uncover subtle memory corruption bugs in projects like GhostScript, OpenSC, and CGIF that had eluded years of automated testing.

To mitigate the dual‑use risk of these capabilities, Anthropic validates all discovered vulnerabilities with human security researchers, coordinates patches with maintainers, and is working to automate safe patch development. In parallel, they are deploying activation‑level “probes” and updated enforcement pipelines to detect and intervene in cyber misuse of Claude in real time, acknowledging that as LLMs begin to outpace expert researchers in speed and scale, vulnerability disclosure norms and defensive workflows—such as 90‑day disclosure windows—will need to evolve to keep up.

CyberExplorer Benchmarking - LLM driven pentesting (Read here):

CyberExplorer introduces an open-environment benchmark and multi-agent framework to evaluate how LLM-based offensive security agents behave in realistic, multi-target attack settings rather than isolated CTF-style tasks. The benchmark runs 40 vulnerable web services concurrently inside a VM, forcing agents to autonomously perform reconnaissance, prioritize targets, and iteratively refine exploitation strategies without any prior knowledge of where vulnerabilities reside, while the framework logs rich behavioral and technical signals (service discovery, interaction dynamics, vulnerability hints) instead of only checking for flag capture.

The architecture chains short-lived, sandboxed agents with tools, supervisor and critic roles, and budget-aware self-reflection to explore discovered services in parallel and then synthesize evidence into structured vulnerability reports, including severity and affected components. Experiments across several leading closed and open models (e.g., Claude Opus 4.5, Gemini 3 Pro, DeepSeek V3, Qwen 3) show that success depends less on sheer interaction volume and more on early, stable decision-making, revealing characteristic failure modes such as uncertainty-driven agent escalation and persistent dead-end exploration, while also demonstrating that even failed exploits can still produce useful security insights broadly aligned with OWASP Top-10 vulnerability categories.

The evaluation shows a clear performance stratification: models like Claude Opus 4.5 and Gemini 3 Pro tend to discover more unique vulnerabilities with higher precision and lower instability, while models such as Qwen 3 and DeepSeek V3 often generate longer, more meandering attack trajectories that waste budget without proportionate gains. CyberExplorer further finds that successful offensive performance correlates with early, decisive scoping and consistent hypothesis management rather than raw token usage, and that even unsuccessful exploitation attempts still surface valuable security intelligence (e.g., service fingerprints, weak configurations, partial injection vectors) that can feed into human-centric or downstream defensive analysis.

Pages

Offensive AI Is Here – Anthropic red team & CyberExplorer benchmarking