DockerDash - Docker Metadata Context Injection Vulnerability

AI has quickly become embedded in the software supply chain, and Docker’s Ask Gordon assistant is a prime example of how that convenience can open a new attack surface. DockerDash, a vulnerability uncovered by Noma Labs ( Read Here ), shows how a single malicious Docker label can hijack Gordon’s reasoning flow, turning innocuous metadata into executable instructions routed through the Gordon AI → MCP Gateway → MCP Tools pipeline. Depending on whether Gordon runs in a CLI/cloud or desktop context, the same Meta‑Context Injection primitive can yield either full Remote Code Execution via Docker CLI or extensive data exfiltration using read‑only inspection tools.

What makes DockerDash so concerning is not just the specific bug, but the pattern it exposes: AI agents are now blindly trusting context—Docker labels, configs, tool outputs—as if it were safe, pre‑authorized instruction. This breaks traditional trust boundaries and lets attackers weaponize “informational” fields to drive tool calls, stop containers, and leak sensitive environment details, all behind a friendly chat interface. Docker’s mitigations in Desktop 4.50.0—blocking user‑supplied image URLs in Ask Gordon and forcing human confirmation before any MCP tool runs—are an important first step, but the research is a clear warning: AI‑driven pipelines now demand zero‑trust validation on every piece of context they consume.

Reference:

DockerDash: Two Attack Paths, One AI Supply Chain Crisis - https://noma.security/blog/dockerdash-two-attack-paths-one-ai-supply-chain-crisis/

Offensive AI Is Here – Anthropic red team & CyberExplorer benchmarking

The landscape of AI-driven security research is rapidly transforming as large language models begin to operate not just as assistants but as autonomous vulnerability researchers. Recent breakthroughs from Anthropic’s red teaming initiatives and the CyberExplorer benchmarking framework reveal how models like Claude Opus and Gemini are moving beyond static analysis or brute-force fuzzing—effectively reasoning through complex codebases, prioritizing attack surfaces, and uncovering previously unimaginable classes of software flaws. Together, these efforts signify a paradigm shift where LLMs are reshaping both offensive and defensive cyber operations at scale. 

Anthropic Red Teaming (Read here):

Claude Opus 4.6 markedly advances AI-driven vulnerability discovery by autonomously finding more than 500 high‑severity 0‑days in widely used open source projects, even in codebases that have already undergone extensive fuzzing and review. Unlike traditional fuzzers that bombard programs with random inputs, the model reasons like a human analyst: it reads code and commit histories, looks for partially fixed patterns, and targets historically risky APIs, enabling it to uncover subtle memory corruption bugs in projects like GhostScript, OpenSC, and CGIF that had eluded years of automated testing.

To mitigate the dual‑use risk of these capabilities, Anthropic validates all discovered vulnerabilities with human security researchers, coordinates patches with maintainers, and is working to automate safe patch development. In parallel, they are deploying activation‑level “probes” and updated enforcement pipelines to detect and intervene in cyber misuse of Claude in real time, acknowledging that as LLMs begin to outpace expert researchers in speed and scale, vulnerability disclosure norms and defensive workflows—such as 90‑day disclosure windows—will need to evolve to keep up.

CyberExplorer Benchmarking - LLM driven pentesting (Read here):

CyberExplorer introduces an open-environment benchmark and multi-agent framework to evaluate how LLM-based offensive security agents behave in realistic, multi-target attack settings rather than isolated CTF-style tasks. The benchmark runs 40 vulnerable web services concurrently inside a VM, forcing agents to autonomously perform reconnaissance, prioritize targets, and iteratively refine exploitation strategies without any prior knowledge of where vulnerabilities reside, while the framework logs rich behavioral and technical signals (service discovery, interaction dynamics, vulnerability hints) instead of only checking for flag capture.

The architecture chains short-lived, sandboxed agents with tools, supervisor and critic roles, and budget-aware self-reflection to explore discovered services in parallel and then synthesize evidence into structured vulnerability reports, including severity and affected components. Experiments across several leading closed and open models (e.g., Claude Opus 4.5, Gemini 3 Pro, DeepSeek V3, Qwen 3) show that success depends less on sheer interaction volume and more on early, stable decision-making, revealing characteristic failure modes such as uncertainty-driven agent escalation and persistent dead-end exploration, while also demonstrating that even failed exploits can still produce useful security insights broadly aligned with OWASP Top-10 vulnerability categories.

The evaluation shows a clear performance stratification: models like Claude Opus 4.5 and Gemini 3 Pro tend to discover more unique vulnerabilities with higher precision and lower instability, while models such as Qwen 3 and DeepSeek V3 often generate longer, more meandering attack trajectories that waste budget without proportionate gains. CyberExplorer further finds that successful offensive performance correlates with early, decisive scoping and consistent hypothesis management rather than raw token usage, and that even unsuccessful exploitation attempts still surface valuable security intelligence (e.g., service fingerprints, weak configurations, partial injection vectors) that can feed into human-centric or downstream defensive analysis.

"PenTestPrompt" 2.0 (Beta): Improvements in LLM Security Testing Tool

New prompt injection and model manipulation techniques are identified at a very fast pace, often resembling zero-day vulnerabilities in traditional penetration testing. At the same time, organizations are moving beyond single-provider deployments - adopting multiple models for different tasks, experimenting with local and open-weight models, and integrating LLMs across different architectural layers. In today’s world, whether it is testing approaches or tools everything quickly becomes outdated, thus making enhancements is essential to maintain the quality and appropriateness of tools. Below are some key changes in "PenTestPrompt" (details of the tool available in our first blog) after our first release: -

Broader Risk Coverage
In its initial version, the tool was focused primarily on generating prompts with various prompt injection attack techniques, enriched with application context. This approach was effective in identifying direct prompt injection issues, but it failed to explicitly factor in impact assessment or structured AI risk frameworks during prompt generation.

A major update in this release is the introduction of risk frameworks directly during the prompt generation process. Prompt creation is now additionally guided by categories and sub-categories including, but not limited to, security, privacy, safety, reliability, fairness, transparency, and data integrity. A niche list of categories and sub-categories is created based on OWASP LLM Top 10, NIST AI RMF, and practical assessment experience.

This allows prompts to be designed with clear risk intent, covering areas such as system prompt leakage, adversarial behavior, hallucinations and context relevance, sensitive and personal data exposure, bias and political influence, harmful or illegal content, unlicensed advice, data exfiltration, and unintended access. As a result, model’s behaviour can now also be assessed with risk and impact dimensions during evaluation.

Support for Combined Attack Techniques
The platform now supports combining two or more techniques to generate a single prompt. This change reflects on carefully crafting the prompts with multiple techniques as effective attack on model may not rely on single prompt technique.

Expanded Model Provider Support
The platform now supports locally hosted LLM, Anthropic, and Gemini models in addition to Open AI. As businesses move towards multi-model and hybrid AI architectures, this enables organization to apply AI security testing using various models for different deployment patterns. 

Refined Techniques and System Instructions
We have added new attack techniques and refined existing ones based on our experience from various assessments. System instructions have also been updated to improve the relevance, consistency, and depth of generated prompts. The focus here is not only volume, but quality—ensuring that generated prompts aid in proper coverage of all attack techniques as well as the pillars defined in risk assessment frameworks.

Improved Output and Analysis Format
Results are now downloadable in Excel format, enabling easier filtering, correlation, and visualization of findings to support better internal analysis, and reporting.
Click here to download the new version of "PenTestPrompt".


AI Agent Communication Protocols: Security Analysis of Non-MCP Standards

With every passing day, AI agents are becoming more and more self-sufficient. Gone are the days when there was a need for manual intervention just to allow agents collaborate in real-time. The latest standardized protocols have enabled the AI agents to not only seamlessly collaborate across systems but also across organizations. 

However, beyond the widely discussed MCP, multiple other protocols define how agents discover, authenticate, and interact with each other. With such a vast availability, it becomes very important for organizations to address every unique security challenge that these agents may pose. 

A2A (Agent-to-Agent Protocol): Decentralized Collaboration Risks
Let’s get started with the Agent-to-Agent protocol that is primarily used to enable direct collaboration between agents via “Cards”. These cards are HTTPS-served metadata documents that promote capabilities, skills, and endpoints. However, this architecture is prone to critical security vulnerabilities scattered across three dimensions.

Here’s a detailed overview of them all:
Agent Cards present a primary attack surface where malicious actors can spoof legitimate agents or embed prompt-injection payloads in metadata fields like descriptions and skills. It leads to LLM downstream manipulation behavior when this data is embedded into system prompts without filtering. 

Secondly, compromised agent servers can extract credentials and operational data, while attackers can infiltrate multiple roles such as tool executor, planner, and delegation helper. This can disrupt operations or redirect traffic to malicious endpoints, with man-in-the-middle and session hijacking attacks exploiting insufficiently protected identity verification.

And lastly, traditional authentication methods are not enough for autonomous agents that require rapid trust establishment. It mainly creates impersonation risks when behavioral patterns lack satisfactory verification, and weak identity protocols fail to provide robust yet dynamic trust mechanisms.

ACP (Agent Communication Protocol): Flexibility vs. Security Trade-offs
The Agent Communication Protocol, powered by IBM, utilizes a registry-based model with MIME-typed multipart messages for structured data exchange. It is further backed by JSON Web Signatures (JWS) for integrity. However, this architectural flexibility boasts significant security vulnerabilities. 

Often, predictable integrity failures are caused by optional JWS enforcement in and data exfiltration in non-strict configurations. This mainly occurs when the protocol remains exposed to reflective leakage when task generation depends on LLM reasoning.

While tokens are not fully exposed, it is advisable to use short-lived tokens. Enforcement is optional, which increases the chance for replay attacks. Extended sessions without JWS timestamps are also vulnerable to this issue. Further, loose compatibility with the legacy system creates several misconfigurations that allow the token to have a prolonged lifetime.

Also, when manifests are reused, registry-mediated routing can unintentionally collect metadata across tasks, creating secondary exposure channels that indicate internal system structure, API endpoints, and operational patterns to unauthorized agents. 

ANP (Agent Network Protocol): Decentralized Identity Challenges
The Agent Network Protocol has a three-layer architecture for agent identity authentication and dynamic negotiation. However, such a design presents serious security flaws. 

By utilizing W3C DID standards, this protocol creates difficulties in setting up trust, making revocation and compromise detection more complicated than similar actions in the case of a centralized authority.
Even though the ANP calls for minimum information disclosure and end-to-end encryption, it is challenging to implement these approaches across different operating agent networks. Care will also need to be taken to enforce the separation of human and agent authorization schemes at the right point in order to mitigate privilege escalation issues.

Moreover, enhancements to flexibility through the naturally-language negotiation and AI code generator components of the Meta-Protocol Layer create attack vectors whereby malicious agents can negotiate capabilities that initially appear harmless but contain hidden malicious functions through AI-native exploitations.

Conclusion
The architecture of A2A, ACP, and ANP protocols has flexibility, but the security analysis shows a conflict.

A2A’s Agent Cards do not have a crypto signature that prevents impersonation and prompt injection attacks. Optional components of the ACP create known states of failure. The collection of metadata in the registry leaks the internal structure. The complexity of ANP's decentralized identity creates reversal and negotiation risks.

To mitigate these threats, organizations must require checks at the transport layer and govern centrally while combining hybrid approaches.

Guardrails for AI Applications – Google Cloud

This blog series establishes a foundation for securing AI implementations across layers and platforms. It covers AI security controls and approaches for addressing AI vulnerabilities, highlighting remediation options and controls at the LLM through configurable guardrails across multiple layers. Azure Content Safety (previous blog entry) demonstrates how native safeguards can be configured in Azure based AI applications. Building on these concepts, this blog focuses on guardrails that can be leveraged within Google Cloud based AI applications. 

Google Cloud provides configurable guardrails and security options that allow teams to protect LLM and AI Agents against risks such as prompt injection, hallucinations, and adversarial inputs. It offers configurable content filters, policy enforcement controls, and monitoring features that work across the AI lifecycle. These services help organizations to apply consistent, privacy aware protections without building custom security mechanisms from scratch, supporting safer and more controlled AI deployments. 

Dedicated Safety and Security Services

Model Armor 
This security service checks the inputs (prompts) and outputs (responses) of language models to find and reduce risks like harmful content and data exposure before they reach applications. It uses adjustable filters, allowing organizations to customize protections.
The below diagram explains the flow and sanitization applied at various steps: - 

 

  

Image: https://docs.cloud.google.com/model-armor/overview

Model Armor flags content based on risk levels: High (high likelihood), Medium_and_above (medium or high likelihood), and Low_and_above (any likelihood). It features the responsible AI safety filter, which screens for hate speech, harassment, sexually explicit content, and dangerous material, while also protecting sensitive data and blocking harmful URLs to maintain trust and compliance in AI solutions.

Checks Guardrails API
Checks Guardrails are a runtime safety feature from Google that evaluates both input and output of AI models against predefined safety policies. These policies cover areas like Dangerous Content, Personally Identifiable Information (PII), Harassment, Hate Speech, and more. Each policy returns a score from 0 to 1, indicating the likelihood that the content fits the category, along with a result showing whether it passes or fails based on a set threshold.
 

 Image: https://developers.google.com/checks/guide/ai-safety/guardrails

These guardrails help ensure ethical and legal standards by identifying inappropriate or harmful content before it reaches users, enabling actions like logging, blocking, or reformulating outputs. The scoring provides insights into safety risks and supports trust and compliance in AI usage. The below image shows the supported policies provided by Check Guardrails API: -

Image: https://developers.google.com/checks/guide/ai-safety/guardrails

Vertex AI Safety Filters
Vertex AI Safety Filters are safeguards from Google Cloud's Vertex AI platform that screen prompts and output. They operate independently to evaluate content before reaching an application, reducing the risk of harmful responses. There are two types of safety scores: -

  • Based on probability of being unsafe
  • Based on severity of harmful content

The probability safety attribute reflects the likelihood that an input or model response is associated with the respective safety attribute. The severity safety attribute reflects the magnitude of how harmful an input or model response might be. When content exceeds safety thresholds, it gets blocked.

ShieldGemma
ShieldGemma is a collection of safety classifier models released by Google, designed to evaluate text and images for compliance with safety policies – it is like using LLMs to analyze LLM inputs and LLM generated outputs against defined policies. Built on the Gemma family, the models come in various sizes (ShieldGemma 1 - 2B, 9B, 27B parameters for text and ShieldGemma 2 – 4B parameters for images) and can be fine-tuned. 

These classifiers score content against categories such as sexually explicit material, hate speech, and harassment, providing clear labels on safety compliance. Their open weights allow for flexibility and integration into broader safety systems, and mitigating harmful outputs across different generative AI applications.

Conclusion
Google Cloud's AI safety systems aim to reduce harmful inputs and outputs by filtering content, but a right balance is required in the configurability to ensure that harmless information is not blocked. It meaningfully reduces risk for organizations compared to deploying an unprotected AI applications. Still, no tool on its own can provide complete safety. A well-rounded approach requires combining it with solid application design, clear organizational guidelines, tailored configurations, and continuous oversight.

References
https://docs.cloud.google.com/model-armor/overview
https://developers.google.com/checks/guide/ai-safety/guardrails
https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-filters
https://ai.google.dev/responsible/docs/safeguards/shieldgemma

Article by Hemil Shah and Rishita Sarabhai 

AI Vulnerabilities - MCP Git Serve & Copilot's Reprompt

The "Confused Deputy": Inside the Anthropic MCP Git Server Exploit

A critical security flaw has been uncovered in the official Model Context Protocol (MCP) Git server, exposing a dangerous intersection between agentic AI and supply chain security. Researchers identified three distinct vulnerabilities (CVE-2025-68143, CVE-2025-68144, CVE-2025-68145) in Anthropic’s reference implementation, mcp-server-git, which allows AI assistants like Claude Desktop or Cursor to interface with Git repositories. By chaining these flaws, an attacker can achieve full Remote Code Execution (RCE) on a developer's machine simply by asking the AI to summarize a malicious repository. This "Zero-Click" style attack highlights the fragility of current tool-use safeguards when facing indirect prompt injection.

The technical mechanics of this attack are a textbook example of the "confused deputy" problem. The attack relies on placing hidden instructions within a repository’s text files (such as a README.md or issue ticket). When the LLM ingests this context, it unknowingly follows the malicious instructions to trigger the vulnerable tools. Specifically, the exploit chains a path traversal flaw to bypass allowlists, an unrestricted git_init command to create repositories in arbitrary locations, and argument injection in git_diff to execute shell commands. Essentially, the AI is tricked into modifying its own environment—writing malicious Git configurations—under the guise of performing standard version control tasks.

This discovery serves as a stark warning for the rapidly growing ecosystem of AI agents and MCP architecture. While the vulnerabilities have been patched in the latest versions, they demonstrate that "human-in-the-loop" approval mechanisms can be bypassed if the agent itself is compromised before presenting a plan to the user. For developers and security engineers, this reinforces the need for strict sandboxing of MCP servers; granting an AI agent direct access to local system tools requires treating the agent's context window as an untrusted input vector, much like a traditional SQL injection point.

Reprompt: Understanding the Single-Click Vulnerability in Microsoft Copilot

Reprompt is a newly disclosed AI security vulnerability affecting Microsoft Copilot that researchers say enables single-click data theft: a user only needs to click a crafted Copilot link for the attack to start. Varonis Threat Labs reported the issue in January 2026, highlighting how it can silently pull sensitive information without requiring plugins or complicated user interaction.

What makes Reprompt notable is its use of Copilot’s q URL parameter to inject instructions directly into the assistant’s prompt flow. Researchers described a “double-request” technique—prompting the assistant to perform the same action twice—where the second run can bypass protections that block the first attempt. After that foothold, “chain-request” behavior can let an attacker continue steering the session through dynamic follow-up instructions from an attacker-controlled server, enabling stealthy, iterative data extraction even if the user closes the chat.

The risk is amplified because it can operate without add-ons, meaning it can succeed in environments where defenders assume “no plugin” equals “lower risk.” Reports noted the exposure was primarily tied to Copilot Personal, while Microsoft 365 Copilot enterprise customers were described as not affected. Microsoft has since patched the vulnerability as of mid-January 2026, but Reprompt is a useful reminder that LLM apps need URL/prompt hardening, stronger guardrails against multi-step bypass patterns, and careful controls on what authenticated assistants can access by default. 

Semantic Attacks on AI Security

The recent disclosure of the "Weaponized Invite" vulnerability in Google Gemini marks a critical pivot point for AI security, moving us from the era of syntactic jailbreaks to the far more dangerous realm of semantic payloads. Discovered by Miggo Security, this indirect prompt injection attack didn't rely on complex code or coercion; instead, it used polite, natural language instructions hidden within a Google Calendar invite to completely bypass enterprise-grade security filters. The flaw exposes a fundamental fragility in current Large Language Model (LLM) architectures: the inability to strictly separate the "data plane" (content to be processed) from the "control plane" (instructions to be executed), effectively allowing untrusted external data to hijack the agent’s decision-making loop.

The attack mechanism is deceptively simple yet devastatingly effective, functioning as a dormant "sleeper" agent inside a victim’s daily workflow. When a user interacts with Gemini—asking a routine question like "What is my schedule today?"—the model retrieves the poisoned calendar event via Retrieval-Augmented Generation (RAG). Because the model is conditioned to be helpful, it interprets the hidden instructions in the invite description not as text to be read, but as a command to be obeyed. The payload then directs the agent to quietly summarize private data from other meetings and exfiltrate it by creating a new calendar event with the stolen information in its description—all while presenting a benign front to the unsuspecting user.

For cybersecurity professionals, this incident serves as a stark warning that traditional signature-based detection and input sanitization are insufficient for protecting agentic AI systems. Because the malicious payload was semantically meaningful and syntactically benign, it successfully evaded Google’s specialized secondary defense models designed to catch attacks. As we integrate agents more deeply into sensitive ecosystems, defense strategies must evolve beyond simple filtering; we need strict architectural sandboxing that treats all retrieved context as untrusted, ensuring that an agent’s ability to read data never automatically grants it the authority to write based on that data’s instructions.