ProofStrike — Proof-Driven Web and Network Pentesting

Validation

Capability first. Benchmarks with proof boundaries.

We use replay artifacts, proof linting, vulnerable-app scorecards, and Vulhub-style network manifests to measure whether the agent finds real issues without drifting into target-specific shortcuts.

328

Current Vulhub corpus discovered
ProofStrike now has a network benchmark path that can discover Vulhub compose targets, import the Pentest-Tools CSV export, and score retained network proof artifacts.

248

CVE-Labeled

167

PT 2024 Cases

128

Remote Cases

Artifact

Result

Status

XBOW validation suite Full 104-case XBEN suite, built-in orchestrator, proof retained

104/104

0 false wins

WebGoat (OWASP Top 10 web) Claude Code harness, 10-specialist multi-agent fan-out

18/19

0 FP · tool-verified

Network benchmark harness pentest-agent benchmark-network — run network-mode scans + retained-evidence scoring

CLI-WIRED

10 tests

Current Vulhub discovery Temporary clone enumerated local compose environments

328

248 CVEs

Pentest-Tools network benchmark Public methodology: 167 Vulhub cases, 128 remotely detectable

128

remote

DVWA web-app replay Still retained as web workflow regression signal

14/14

reachable TP

How It Works

Stateful workflows. Evidence gates.

A controlled pipeline maps the target, reasons about reachable workflows, probes for real vulnerabilities, and refuses to report anything it cannot prove.

Reconnaissance

Scope validation, technology fingerprinting, service inventory, and optional source or API review establish the testing boundary.

Surface Mapping

Crawl web routes and retain network hosts, ports, services, CPEs, TLS posture, scanner imports, and API specs as structured facts.

Workflow Planning

Build session, role, object, state-transition, and network probe plans so checks execute against realistic, policy-approved paths.

Attack Execution

Deterministic executors run targeted probes. LLM reasoning ranks web and network checks without bypassing scope, safety, or proof gates.

Verification

Replay, mutate, baseline diff, state readback, negative controls, and side-effect detection. Findings without proof are discarded.

Reporting

HTML, Markdown, JSON, SARIF, network operator summaries, scorecards, and comparison matrices keep proof and scanner-alert classes separate.

Use Cases

Where ProofStrike fits

Use ProofStrike when a security workflow needs repeatable execution, scoped probing, and evidence strong enough for engineering teams to act on.

Authorized Web App Pentests

Run scoped OWASP testing against applications you own, with request budgets, audit logs, and strict proof gates.

Injection, access control, auth, SSRF, XSS, and file workflows
Replayable evidence for every reportable issue

Authenticated Workflow Testing

Exercise real application paths after login instead of stopping at public pages or isolated endpoints.

Session, role, object, and state-transition planning
Provided accounts, recorded flows, and multi-role checks

API and CI Regression

Turn security behavior into repeatable checks for releases, pull requests, and remediation verification.

OpenAPI-aware route coverage and SARIF output
JSON artifacts for automation and score tracking

Network Exposure Review

Inventory reachable services, TLS posture, CVE candidates, and validation boundaries for approved network scopes.

Nmap, Nuclei, OpenVAS, TLS, and CVE correlation
Operator summaries that separate proof from alerts

Scanner Alert Triage

Import scanner output, normalize candidates, and verify which alerts are actually exploitable in context.

Nuclei, ZAP, Burp, and OpenVAS-style imports
Scanner alerts remain non-reportable until proven

Evidence-Ready Reporting

Produce reports that developers, security teams, and auditors can reproduce without trusting a black-box claim.

HTML, Markdown, JSON, SARIF, traces, and replay bundles
Confidence policy and quality-gate artifacts retained

Core Capabilities

Built for real security testing

Not a benchmark solver. Not a chatbot. An autonomous agent that proves what it finds on authorized web applications.

Proof-First Findings

Every reported vulnerability includes replayable evidence. No scanner suspicions, no unverified claims. The verification gate runs four stages: replay, mutate, baseline diff, and side-effect detection.

Network Scanner Lane

Opt-in network mode (--target-mode network) scans IP/host/CIDR targets: port and service discovery, Nuclei network templates, TLS evidence, and CVE correlation. Available in the built-in scan and as five scope-gated network MCP tools driven by the harness network specialist.

Context-Aware Scanning

Combines black-box crawling, browser authentication, OpenAPI specs, JavaScript analysis, optional source review, and network service inventory. Multi-role sessions probe public and authenticated surfaces.

Budget and Scope Control

Request counts, time limits, and scope boundaries are enforced at every layer. The agent stops when budgets exhaust, not when it gets bored. Every action is audit-logged.

Multi-Format Reports

HTML with charts, Markdown for developers, JSON for automation, SARIF for CI/CD, curl and Python PoCs, network scorecards, comparison matrices, and full execution traces.

Deterministic Executors

Real exploitation runs through scoped, tested executors that understand vulnerability mechanics. The LLM plans strategy and evaluates results without freestyling HTTP or shell actions.

This screenshot is captured from a real local DVWA ProofStrike run. It shows how findings, severity, coverage, and proof-gated results appear in the generated HTML report.

What the report gives you

ProofStrike reports are designed for engineering follow-up: the summary is readable for decision makers, while the artifacts retain enough detail for developers to reproduce and fix the issue.

Severity distribution and OWASP category coverage Strict quality gate: 14 evaluated, 14 reportable Replayable evidence and trace artifacts retained HTML, Markdown, JSON, and SARIF outputs

Open report image Report docs

DVWA pentest report

14 verified

ProofStrike sample pentest report showing 14 findings, severity distribution, and OWASP category coverage

Captured from a local authorized DVWA run. The image is a static sample of the generated report layout, not a customer scan.

Coming Soon

AI agent and LLM application security

Planned support will extend ProofStrike's proof-driven workflow to AI applications, agents, tool chains, and RAG systems, aligned with OWASP LLM Top 10 and emerging agentic security guidance.

Planned capability

Initial OWASP LLM 2025 map: Prompt Injection, Sensitive Information Disclosure, Supply Chain, Data and Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector and Embedding Weaknesses, Misinformation, and Unbounded Consumption.

LLM01 / Agent Goal Hijack

Prompt Injection

Probe direct and indirect prompt injection paths that try to override instructions, alter goals, or trigger unsafe tool use.

Planned: attack prompts, hidden content, tool-call diffing, and policy-bound replay evidence.

LLM02 / LLM07

Secrets and System Prompt Leakage

Check whether assistants disclose secrets, internal policy, private context, credentials, or sensitive retrieved data.

Planned: canary secrets, disclosure probes, redaction checks, and retained transcript proof.

LLM06 / MCP02

Excessive Agency and Tool Abuse

Map tools, permissions, scopes, and approval gates to find agents that can take actions beyond the intended boundary.

Planned: tool permission matrix, destructive-action simulation, and approval bypass tests.

LLM05 / MCP05

Improper Output Handling

Verify whether model outputs can become unsafe code, shell commands, browser actions, database queries, or workflow inputs.

Planned: output-to-sink tracing, sanitizer checks, command injection guards, and negative controls.

LLM08 / Memory

RAG and Memory Poisoning

Test retrieval, embeddings, and long-term memory for poisoned context, unsafe citations, and cross-session persistence attacks.

Planned: document poisoning fixtures, retrieval ranking checks, and memory reset verification.

LLM03 / AST01-AST07

Agent Supply Chain

Inspect skills, plugins, MCP servers, manifests, and dependencies for over-privilege, update drift, weak isolation, or tool poisoning.

Planned: manifest linting, sandbox health, source provenance, and runtime telemetry review.

Proof-driven web and network pentesting for real systems

Capability first. Benchmarks with proof boundaries.

Stateful workflows. Evidence gates.

Reconnaissance

Surface Mapping

Workflow Planning

Attack Execution

Verification

Reporting

Where ProofStrike fits

Authorized Web App Pentests

Authenticated Workflow Testing

API and CI Regression

Network Exposure Review

Scanner Alert Triage

Evidence-Ready Reporting

Built for real security testing

Proof-First Findings

Network Scanner Lane

Context-Aware Scanning

Budget and Scope Control

Multi-Format Reports

Deterministic Executors

Claude Code Harness

Real-Browser Verification

OWASP, workflows, and network exposure

SQL Injection

Broken Access Control

Command Injection

Server-Side Template Injection

Authentication Failures

Cross-Site Scripting

Path Traversal & File Inclusion

TLS & Service Exposure

CVE & Scanner Import Correlation

API Authorization (BOLA / BFLA)

GraphQL

Exposure & Misconfiguration

See the pentest result format