Generated imageThis concludes our two-part exploration of evolving from intuition-based to evidence-based prompt engineering. Read Part 1 for the context and methodology behind these findings.

The Real Results: Validating and Expanding “Curious Case” Findings

After fixing the flag format bias and running the complete four-strategy comparison across all models, the results were definitive - and they validated the core insight from “The Curious Case of the Thinking Machine” while revealing much broader patterns:

Overall Performance Rankings

Zero-shot + Focused Guardrails: 80% success rate, 4/5 perfect scores ✅ WINNER

Zero-shot + Extensive Guardrails: 20% success rate, 1/5 perfect scores (Over-engineered)

Step-by-step Reasoning: 0% success rate, 0/6 perfect scores ❌ Catastrophic failure

Few-shot with Examples: 0% success rate, 0/5 perfect scores (Poor constraint satisfaction)

The Shocking Discoveries

1. My “Optimized” 367-line prompt crushed my 901-line masterpiece

  • 4x improvement in success rate (20% → 80%)

  • Perfect scores from Claude 3.5, Claude 3.7, GPT-4o, and GPT-4-turbo

  • 60% reduction in complexity while dramatically improving results

2. Step-by-step reasoning failed catastrophically

  • 0/6 success rate despite appearing more “thoughtful”

  • Consistent failure pattern: Forgot -silent flag in 5/6 attempts

  • Process distraction: Models got caught up in steps, lost sight of requirements

  • Validates “Curious Case” thesis: More explicit “thinking” led to worse outcomes across ALL models

3. Few-shot struggled with complex constraints

  • 0/5 success rate even with perfect examples

  • Pattern mismatch: Examples couldn’t cover all probe requirement combinations

4. Model-agnostic results

  • Strategy matters more than model: Same models failed with poor prompting, succeeded with optimal approach

The Critical Insight: Component Responsibility vs. System Optimization

The systematic testing revealed something profound that initially escaped me: the “4x improvement” wasn’t about lowering standards - it was about correctly defining component responsibility.

What Actually Changed Between Tests

  • Original test: 18 assertions (comprehensive quality requirements)

  • Focused test: 9 assertions (only “critical” workflow-killer requirements)

Requirements I Removed:

  • Command efficiency (≤2 commands maximum)

  • Flag optimization preferences (short vs long forms)

  • Forbidden speculative paths (/admin, /config.js)

  • OPTIONS method handling requirements

  • Specific flag combination validations

  • Additional optimization guardrails

The Responsibility Boundary Revelation

Here’s what I initially missed: those removed requirements weren’t “production quality standards” - they were scope creep.

What Actually Breaks Reconnaissance Workflows:

  • ✅ No JSON output → Downstream processing fails completely

  • ✅ No -silent flag → Parsing breaks on verbose output

  • ✅ Wrong target specification → Probing wrong systems

  • ✅ Syntax errors → Commands fail to execute entirely

  • ✅ Missing required data capture → No TLS/header analysis

What’s Just Engineering Preference:

  • ❌ 2 commands instead of 1 → Still works, just less elegant

  • ❌ Long flags vs short flags → Functionally identical

  • ❌ Speculative paths like /admin → Time wasters, not failures

  • ❌ Suboptimal flag combinations → Still captures required data

The System Design Realization

The removed requirements fell into three categories that weren’t actually the Command Generator’s responsibility:

  1. Requirement Optimization: The Discovery Planner should specify what’s needed; the Command Generator shouldn’t second-guess those requirements

  2. Stylistic Preferences: Flag selection and command count don’t affect functional outcomes

  3. Low-Impact Inefficiencies: Speculative path probing wastes time but doesn’t break workflows

The Critical Insight: A command generator that produces 2 working commands with long flags is infinitely better than one that produces 1 “optimal” command missing the -silent flag.

This wasn’t lowered standards - it was correct scope definition. The 18-assertion test had scope creep, testing the component for responsibilities that belonged elsewhere in the system.

The Research-Backed Explanation: Why “Thinking” Can Hurt

The catastrophic failure of step-by-step reasoning initially puzzled me, but it perfectly validated the core finding from “The Curious Case of the Thinking Machine” - that more explicit “thinking” often produces worse results. Recent research provides the scientific explanation for this counterintuitive phenomenon.

A paper published in May 2025, “When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs”by Xiaomin Li et al., systematically documented this exact phenomenon. The researchers found that “explicit CoT reasoning can significantly degrade instruction-following accuracy” and that “CoT reasoning often diverts attention away from instruction-relevant tokens.”

The paper introduces “constraint attention” - a measure of how much the model focuses on actual requirements versus getting lost in the reasoning process. When models engage in step-by-step reasoning, they literally pay less attention to the specific constraints and requirements you’ve given them.

The Failure Analysis

Why step-by-step reasoning failed so badly:

  • 5/6 times: Missing -silent flag (required for clean output parsing)

  • 1/6 times: Missing target specification (command fails entirely)

  • Root cause: Process-focused instructions distracted from direct requirements

The models became so focused on following the reasoning steps that they lost track of the actual functional requirements. It’s like giving someone elaborate directions on how to think about making coffee and watching them forget to add water.

Why few-shot failed:

  • Flag selection errors despite correct examples

  • Difficulty generalizing from examples to new probe combinations

  • Root cause: Complex constraint satisfaction requires explicit rules, not pattern matching

The Cost/Performance Optimization Discovery

The systematic testing revealed another crucial insight: performance isn’t the only variable that matters.

With the optimal prompting strategy, I could now evaluate models based on both capability and economics:

  • Claude 3.7: Highest performance scores, highest cost

  • Claude 3.5 (Oct): Nearly identical performance, much lower cost

  • GPT-4o: Good performance, competitive pricing

The systematic evaluation enabled both performance optimization AND economic optimization. In production, I chose Claude 3.5 not because it was objectively “better,” but because it provided the optimal balance of capability and cost-effectiveness for a workflow that might generate hundreds of commands daily.

The Humbling Reality Check

Throughout this process, one observation kept nagging at me: AI is never going to take over the world if it can’t properly parse help files and construct valid commands.

Here I was, an expert in both cybersecurity and AI, spending weeks struggling to get multiple sophisticated language models to correctly interpret httpx --help output and translate requirements into proper syntax. Even with the optimal approach, only 80% success rate.

The systematic testing revealed that the gap between AI capabilities and reliable technical execution is larger than the marketing suggests. Current models exhibit this weird capability inversion:

  • Can reason about abstract security concepts ✅

  • Can engage in sophisticated strategic dialogue ✅

  • Can generate creative problem-solving approaches ✅

  • Cannot reliably parse documentation and construct valid commands

We’re building agentic systems on AI that struggles with the basic task of correctly using existing tools. But here’s the deeper question: we’re still squarely in agent land - carefully controlling inputs, actions, and outputs through extensive guardrails - rather than the truly autonomous systems everyone’s promising.

Beyond Prompt Engineering: Systematic AI Evaluation

This experience transformed my understanding of AI development from intuition-based to evidence-based engineering:

The Evolution:

  • First Generation: “Generate this function” (isolated snippets)

  • Second Generation: “Here’s my codebase, figure it out” (context overload)

  • Third Generation: Systematic validation with multi-vendor testing and bias detection

The Critical Insights:

  1. Evaluation Bias Is Real: Our assumptions about “correct” output formats can systematically favor certain models over others.

  2. Component Scope Matters: Testing components for responsibilities outside their scope creates false quality requirements.

  3. Functional vs Stylistic Requirements: Distinguish between what breaks workflows versus what represents engineering preferences.

  4. Reasoning-Instruction Tension: Explicit reasoning processes can divert attention from constraint compliance - a fundamental LLM limitation validated by research.

  5. Responsibility Boundaries: Systems work better when each component excels within its proper scope rather than trying to optimize the entire workflow.

  6. Cost/Performance Trade-offs: The “best” model isn’t always the optimal choice for production workflows.

  7. Cross-Vendor Validation: Testing across multiple providers reveals capabilities and limitations that single-vendor testing misses.

The Systematic Approach in Practice

Since adopting systematic AI evaluation, the transformation has been profound:

Before:

  • 47 manual iterations based on gut feelings

  • Model selection based on marketing and intuition

  • No way to validate that improvements were actually improvements

  • Recursive AI-debugging-AI conversations

  • Over-engineered solutions based on fear rather than evidence

After:

  • Empirical evidence for scaffolding decisions

  • Data-driven model selection optimized for cost and performance

  • Bias detection in evaluation criteria

  • Confidence in production deployment decisions

  • Right-sized complexity based on systematic testing

The Methodology That Matters

The broader lesson connects to themes I’ve explored in previous writing about the evolution of security roles. The most effective security professionals won’t be those who resist AI or those who blindly embrace it, but those who can systematically engineer AI behavior for reliable production use.

This requires developing new skills:

  • Systematic evaluation design that avoids bias while capturing domain requirements

  • Multi-strategy testing to find optimal approaches rather than assuming the first solution is best

  • Multi-vendor assessment for optimal cost/performance decisions

  • Evidence-based scaffolding that transfers human expertise to AI systems efficiently

  • Reasoning-instruction balance understanding when explicit reasoning helps versus hurts task performance

  • Production-oriented optimization that balances capability with economic constraints

Looking Forward: The Systematic Advantage

The Uncomfortable Truth About Current AI: Even with optimal prompting strategies and correct scope definition, we’re dealing with systems that succeed 80% of the time on well-defined technical tasks. More concerning, the research reveals that asking these systems to reason more explicitly can make them worse at following instructions - a fundamental architectural limitation that scaffolding alone cannot overcome.

The meta-lesson: In the age of AI-assisted development, systematic validation isn’t just about optimizing success rates - it’s about understanding the fundamental trade-offs and limitations in current AI architectures. The patterns I first observed anecdotally in “The Curious Case of the Thinking Machine” represent systematic limitations that affect all current LLMs, not just specific model versions.

Sometimes what looks like “better reasoning” actually produces worse results due to attention-diversion effects that researchers are only now beginning to understand. The future of AI-assisted development will belong to those who can navigate these limitations systematically, building reliable systems within the constraints of what AI can actually do today.

References

  1. Xiaomin Li et al. - When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs(arXiv:2505.11423, May 2025)

  2. PromptFoo: Test-driven LLM Development

  3. Nerve ADK: Agent Development Kit

This systematic approach to AI evaluation has transformed how I work with AI systems, moving from intuition-based to evidence-based engineering. The key insight: systematic validation reveals both capabilities and fundamental limitations that informal testing misses.