A Framework for Testing Agentic Design Patterns Using LangChain & LangGraph

This blog describes a Python application built using LangChain and LangGraph frameworks for testing agentic workflow design patterns such as Chaining, Routing, or Reflection. The application currently implements 10 AI agent patterns, each with several ore-configured representative use cases that you can run using OpenAI GPT, Anthropic Claude, or Google Gemini LLM models for comparison.

Introduction

The rise of Large Language Models (LLMs) over last 3 years has fundamentally changed how we build software by allowing you to utilize AI agents that can not only execute tasks without relying on preset workflows, but also learn from that to improve over time. There is an abundance of blogs and YouTube channels with demos of various AI agents. But moving from impressive demos to AI systems ready for real-life production use requires more than just API calls to GPT-4 or Claude LLM. It requires you to build agent’s workflow following relevant design patterns, called agentic workflow design patterns to enable AI to reason, plan, collaborate, and improve over time. That is similar to building software following well-established software design patterns described in the book “Design Patterns: Elements of Reusable Object-Oriented Software” was published in 1994 by Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides.

What are AI Agents and Agentic Workflows?

An AI agent is more than an AI chat-bot. It’s a software system that can break down complex tasks posed to it into manageable steps and then use tools and external resources as relevant for each step to execute the task. Agent can maintain context and use memory across multiple interactions with a user as required to complete tasks. It can collaborate with other agents, self-critique its work or adapt and improve its actions based on feedback from agents or users.

Agentic workflows are the design patterns that make this possible. Just as software engineers use design patterns like Factory, Observer, or Strategy to solve recurring problems, agentic workflows provide proven solutions like Routing, Reflection, or Planning for common AI challenges.

Why Design Patterns Matter

Without structured patterns, AI applications become brittle and unpredictable. Consider these real-world challenges:

Code generation: A single LLM call might produce buggy code. A Reflection pattern where one model generates and another critiques produces more reliable results.
Customer service: Routing every query to a general-purpose agent is inefficient. A Routing pattern classifies requests and delegates to specialized handlers.
Research analysis: Processing a paper sequentially is slow. A Parallelization pattern runs summary, question generation, and key term extraction concurrently.

Why Testing Framework

I found that most discussions of agentic patterns are theoretical. I can read a blog or watch YouTube video, but neither could answer me: How do these patterns perform in practice? Which LLM models (OpenAI, Anthropic, Google) work best for each pattern? How do you actually implement them?

Hence I decided to build a practical testing framework that currently implements 10 core agentic patterns using LangChain and LangGraph libraries, with the ability to run and compare agents across different models. Python application code is available on GitHub.

The Testing Framework

Why LangChain and LangGraph?

Building agentic workflows from scratch means wrestling with LLM API differences, managing conversation history, implementing state machines, and handling tool calls. Agent frameworks like LangChain or CrewAI abstract these complexities.

LangChain library provides agent foundation:

Model abstraction – allows switching between OpenAI GPT, Anthropic Claude, and Google Gemini with a single interface
LCEL (LangChain Expression Language) – allows composing workflow chains with intuitive syntax: chain = prompt | llm | parser
Tool integration – a decorator-based function calling that works across tool providers
Memory primitives – built-in conversation buffers and history management tools

LangGraph library extends LangChain by adding advanced agent capabilities:

State machines – allows defining complex workflows as graphs with nodes (tasks) and edges (workflow routes
Multi-agent orchestration – allows coordinating sequential, parallel, or debate-style agents collaboration
Persistence – provides InMemoryStore class to allow using semantic, episodic, or procedural memory
Conditional routing – provides support for dynamic workflow paths based on agent decisions

Framework Architecture

The testing framework follows a consistent structure that makes patterns easy to implement, run, and compare.

ModelFactory – Multi-Provider Abstraction

A central factory for creating LLM instances from any provider:

# Automatically routes to correct provider based on name
llm = ModelFactory.create("gpt-4o")                      # OpenAI
llm = ModelFactory.create("claude-sonnet-4-5")   # Anthropic
llm = ModelFactory.create("gemini-2.5-flash")      # Google

The factory handles API keys, default parameters such as temperature or max_tokens, and provider-specific configurations.

Pattern Folder Structure Convention

Implementation of every agentic pattern follows the same directory structure:

patterns/pattern_name/
├── config.py        # Pattern-specific configuration
├── run.py            # Implementation and CLI entry point
└── __init__.py    # Public API exports

Each run.py implements two key functions:

run() – execute the design pattern with a single model (or multiple for multi-role patterns)
compare_models() – run across multiple models and compare results

OutputWriter – Standardized Logging

All results are automatically logged to a file in the experiments/results/ folder with timestamp and pattern name added to the file name:

from agentic_patterns.common import create_writer

writer = create_writer("pattern_name")
writer.write_result(model_name, input_data, result)
writer.write_comparison(models, input_data, results)

This makes it easy to compare how GPT-4, Claude, and Gemini perform on the same task.

Running Design Patterns

Any agentic design pattern can be run from CLI or imported programmatically:

# CLI: Run with default model
uv run src/agentic_patterns/patterns/chaining_01/run.py

# CLI: Specify model
uv run src/agentic_patterns/patterns/reflection_04/run.py gpt-4o

# Python: Programmatic usage
from agentic_patterns.patterns.reflection_04 import run
result = run(creator_model='gpt-4o', critic_model='claude-sonnet-4-5')

Agentic Design Patterns

This section explores 10 core design patterns that I implemented so far (plus, RAG patterns that has been used in other projects), organized by complexity. For complete implementation details, see my GitHub repository.

Pattern Selection Guide

Before diving into individual patterns, you can use a decision tree below to identify which pattern fits your use case:

START: What is your primary challenge?
│
├─► "Task has multiple sequential steps"
│   └─► CHAINING (Pattern 1)
│
├─► "Different request types need different handling"
│   └─► ROUTING (Pattern 2)
│
├─► "Multiple independent subtasks can run simultaneously"
│   └─► PARALLELIZATION (Pattern 3)
│
├─► "Output quality needs iterative improvement"
│   │
│   ├─► "Need automated self-correction"
│   │   └─► REFLECTION (Pattern 4)
│   │
│   └─► "Need structured review with scoring"
│       └─► GOAL SETTING & MONITORING (Pattern 10)
│
├─► "Agent needs external data or capabilities"
│   └─► TOOL USE (Pattern 5)
│
├─► "Complex task needs decomposition and strategic execution"
│   └─► PLANNING (Pattern 6)
│
├─► "Multiple specialized perspectives needed"
│   └─► MULTI-AGENT COLLABORATION (Pattern 7)
│
├─► "Need to remember context across interactions"
│   └─► MEMORY MANAGEMENT (Pattern 8)
│
├─► "Agent should autonomously optimize its performance"
│   └─► LEARNING & ADAPTING (Pattern 9)
│
└─► "Need to ground responses in external documents"
    └─► RAG (Pattern 11 - separate project)

Foundation Patterns (1-3) – Building Blocks

Design patterns in this category form the basis for creating more complex agentic workflows.

1. Chaining – Sequential Processing Pipeline

The Pattern: Instead of using a single LLM call for a complex problem, break it down into several simpler steps which will be resolved by a sequence of multiple LLM calls (chain), where output of each step feeds the next step input.

How It Works: Pipeline chain is defined using LangChain Expression Language (LCEL) syntax:

chain = extraction_chain | transform_chain | generation_chain
result = chain.invoke({"input": user_text})

Sample Use Case: Agent analyses product order received from a user:

Step 1: Extract technical specifications from user description
Step 2: Transform specs into required structured format
Step 3: Generate implementation recommendations based on spec documentation

Why It Matters: Most real life problems are complex and require multiple processing steps to provide solution. Applying Chaining pattern makes this explicit and testable.

Trade-offs and Considerations:

Advantage	Disadvantage
Clear separation of concerns	Each step adds API latency
Easier debugging (inspect intermediate outputs)	Errors in early steps cascade
Specialized prompts per step	Higher token costs (N calls vs. 1)
Testable individual components	Over-decomposition adds complexity

Latency Impact: A 3-step chain incurs 3× the latency of a single call. For latency-sensitive applications, you should balance decomposition granularity against response time requirements.

Error Handling Strategy: Consider adding validation between steps to catch errors early:

# Chain with intermediate validation
def validated_chain(input_data):
    # Step 1: Extract
    specs = extraction_chain.invoke(input_data)
    if not validate_specs(specs):
        return {"error": "Extraction failed validation"}
    
    # Step 2: Transform
    structured = transform_chain.invoke({"specs": specs})
    if not validate_structure(structured):
        return {"error": "Transform produced invalid structure"}
    
    # Step 3: Generate
   return generation_chain.invoke({"data": structured})

When NOT to Use Chaining Pattern:

Simple tasks that a single well-crafted prompt can handle
When latency is critical and steps cannot be parallelized
When intermediate outputs don’t provide debugging value

2. Routing – Intent-Based Delegation

The Pattern: Complex problems often cannot be handled by a single sequential workflow. The Routing pattern provides a solution by introducing conditional logic into agentic workflow for choosing next step for a specific task. It allows the system to analyze incoming requests to determine a task nature, classify tasks, and then route them to specialized agents for handling.

How It Works: A coordinator LLM agent analyzes the request and then selects the appropriate handler agent:

from langchain_core.runnables import RunnableBranch

router = RunnableBranch(
    (lambda x: "booker" in x["category"], booking_handler),
    (lambda x: "info" in x["category"], info_handler),
    unclear_handler  # default
)

Sample Use Case: Customer service automation – coordinator / router agent analyzes incoming requests to determine which specialist handler should process it:

Booking requests → route to Booking agent if incoming request requires service booking
Information queries → route to FAQ agent if incoming request is a question
Unclear requests → route to Human if incoming request is not classified (escalation)

Why It Matters: General-purpose agents are often inefficient in handling specific requests. Routing increases efficiency by enabling specialization.

Classification Approaches:

There are several ways to implement the classification step, each with trade-offs:

Approach	Speed	Accuracy	Best For
Keyword matching	Fast (no LLM call)	Low	Obvious, distinct categories
LLM classification	Slower (+1 API call)	High	Nuanced, overlapping categories
Embedding similarity	Medium	Medium-High	Large number of categories
Hybrid	Medium	High	Production systems

Hybrid Classification Example:

def classify_request(request: str) -> str:
    # Fast path: keyword matching for obvious cases
    request_lower = request.lower()
    if any(word in request_lower for word in ["book", "reserve", "schedule"]):
        return "booking"
    if any(word in request_lower for word in ["cancel", "refund"]):
        return "cancellation"
    
    # Slow path: LLM for ambiguous cases
    return llm_classifier.invoke(request)

Handling Edge Cases:

Multi-Intent Requests: “Book a flight to Paris and tell me about visa requirements”:

# Option A: Primary intent routing with queue
def route_multi_intent(request: str):
    intents = extract_all_intents(request)  # Returns: ["booking", "info"]
    primary_result = handlers[intents[0]].invoke(request)
    
    # Queue secondary intents for follow-up
    for intent in intents[1:]:
        queue_followup(intent, request)
    
    return primary_result

# Option B: Split and route each part
def route_split(request: str):
    sub_requests = split_request(request)
    results = [handlers[classify(sub)].invoke(sub) for sub in sub_requests]
    return combine_results(results)

Confidence Thresholds: Escalate to human assistant when uncertain:

CLASSIFICATION_PROMPT = """
Classify this customer request. Return JSON with:
- category: one of [booking, info, complaint, unclear]
- confidence: 0.0 to 1.0

Request: {request}
"""

def route_with_confidence(request: str):
    result = classifier.invoke(request)
    
    if result["confidence"] < 0.7:
        return escalate_to_human(request)
    
    return handlers[result["category"]].invoke(request)

Sample of Production Classification Prompt:

CLASSIFICATION_PROMPT = """
You are a customer service request classifier. Analyze the request and determine 
the most appropriate handling category.

Categories:
- BOOKING: Requests to make, modify, or inquire about reservations (flights, hotels, cars)
- INFO: Questions about policies, destinations, requirements, or general information
- COMPLAINT: Issues with existing services, requests for refunds, or expressions of dissatisfaction
- URGENT: Safety concerns, stranded travelers, or time-sensitive emergencies
- UNCLEAR: Cannot determine intent or request is ambiguous

Request: {request}

Respond with exactly one category name and a brief justification:
Category: <category>
Reason: <one sentence explanation>
"""

When NOT to Use Routing Pattern:

All requests can be handled by a single general-purpose agent
Categories overlap significantly (consider hierarchical routing instead)
Classification overhead exceeds the benefit of specialization

3. Parallelization – Concurrent Execution

The Pattern: Resolving complex problems often require completing multiple sub-tasks that can be executed simultaneously, rather than sequentially. The Parallelization pattern involves executing multiple independent LLM agent chains concurrently to reduce overall latency and then synthesize results as required to solve the problem.

How It Works: Using RunnableParallel object in LangChain with asynchronous execution:

parallel_chains = RunnableParallel(
    summary=summary_chain,
    questions=questions_chain,
    key_terms=key_terms_chain
)
result = await parallel_chains.ainvoke({"topic": topic})

Sample Use Case: Peer review of a research paper:

Parallel execution: Peer agents generate summary, questions, and key terms simultaneously
Synthesis: Combine generated reviews into a comprehensive paper critique

Why It Matters: Reduces paper review latency by Nx where N is the number of peers and provides multiple perspectives on the research paper.

Understanding Concurrency vs. Parallelism:

An important distinction for LLM applications:

Concurrency (what we achieve using this pattern): Multiple tasks in progress simultaneously via async I/O. While waiting for one API response, we can initiate other requests.
True parallelism: Would require multiple CPU cores executing simultaneously (not typical for I/O-bound LLM calls).

The latency reduction comes from overlapping API wait times, not CPU parallelism. If each LLM call takes 2 seconds, three parallel calls still complete in ~2 seconds (not 6).

Sequential (6 seconds total):
[Call 1: 2s][Call 2: 2s][Call 3: 2s]

Concurrent (2 seconds total):
[Call 1: 2s      ]
[Call 2: 2s      ]
[Call 3: 2s      ]

Synthesis Strategies:

After parallel execution, you need to combine results. Choose based on your use case:

Strategy	Description	Best For
Aggregation	Combine all outputs into single document	Research summaries, reports
Voting	Multiple agents answer same question, majority wins	Factual queries, classification
Weighted Merge	Assign confidence scores, prioritize higher confidence	When agent reliability varies
Structured Merge	Each agent fills different fields of output schema	Multi-aspect analysis

Aggregation Example:

synthesis_prompt = ChatPromptTemplate.from_template("""
Synthesize these parallel analysis results into a coherent report:

Summary Analysis:
{summary}

Key Questions Generated:
{questions}

Important Terms Identified:
{key_terms}

Create a unified analysis that integrates all perspectives.
""")

full_chain = parallel_chains | synthesis_prompt | llm | StrOutputParser()

Voting Example:

synthesis_prompt = ChatPromptTemplate.from_template("""
Synthesize these parallel analysis results into a coherent report:

Summary Analysis:
{summary}

Key Questions Generated:
{questions}

Important Terms Identified:
{key_terms}

Create a unified analysis that integrates all perspectives.
""")

full_chain = parallel_chains | synthesis_prompt | llm | StrOutputParser()

Voting Example:

# Three agents answer the same factual question
parallel_voters = RunnableParallel(
    agent1=factual_chain,
    agent2=factual_chain,
    agent3=factual_chain
)

def majority_vote(results: dict) -> str:
    answers = [results["agent1"], results["agent2"], results["agent3"]]
    return max(set(answers), key=answers.count)

result = await parallel_voters.ainvoke({"question": question})
final_answer = majority_vote(result)

Error Handling in Parallel Execution:

Parallel chains can partially fail. You should handle such failures gracefully:

async def run_with_fallbacks(topic: str):
    try:
        results = await parallel_chains.ainvoke(
            {"topic": topic},
            return_exceptions=True  # Returns exceptions instead of raising
        )
    except Exception as e:
        return {"error": str(e)}
    
    # Process results, handling any that failed
    processed = {}
    for key, value in results.items():
        if isinstance(value, Exception):
            processed[key] = f"[Failed: {type(value).__name__}]"
            logger.warning(f"Chain {key} failed: {value}")
        else:
            processed[key] = value
    
    return processed

When NOT to Use Parallelization Pattern:

Tasks have dependencies (output of A is input to B)
Order of execution matters for correctness
Rate limits would be exceeded by concurrent requests
Combined token usage exceeds context window for synthesis

Enhancement Patterns (4-5) – Adding Intelligence

The foundation patterns enable agents to be efficient, fast, and flexible when resolving complex problems. However, even sophisticated workflows may not help agent to handle incoming request correctly, if task understanding by the agent is not accurate or agent is missing information required to give correct answer. Design patterns in this group allow making agents more capable in handling complex problems through evaluating its own work and iterating to refine task understanding; or using relevant tools to obtain missing data.

4. Reflection – Iterative Improvement Through Critique

The Pattern: Reflection pattern offers a solution for self-correction and refinement by establishing a feedback loop where one LLM model generates output, and then another model evaluates it against predefined criteria to allow the first model to revise output based on the received feedback. The iterative process progressively enhances the accuracy and quality of the final result.

How It Works: Using dual-model approach with specific roles defined for each AI agent:

# Generate initial solution
code = creator_llm.invoke(task_prompt)

# Critique the solution
critique = critic_llm.invoke(f"Review this code: {code}")

# Revise based on feedback
improved = creator_llm.invoke(f"Improve based on: {critique}")

Sample Use Case: Code generation on request with generated code review:

Creator (GPT-4o): Generates Python function based on a user request
Critic (Claude Sonnet): Reviews generated code for bugs, style, or edge cases support
Creator: Revises the code based on received feedback

Why It Matters: Single-pass code generation is often flawed. Using Reflection design pattern allows catching errors and improving code quality and style.

Framework Feature: Application supports using different models for creator and critic roles to leverage model-specific strengths.

Iteration Control Strategies:

Deciding when to stop iterating is crucial for both quality and cost of using Reflection:

Strategy	Description	Pros	Cons
Fixed iterations	Always run N cycles	Predictable cost/time	May over/under-iterate
Quality threshold	Stop when grade ≥ target	Efficient	Requires quantifiable metrics
Diminishing returns	Stop when delta improvement < ε	Balances quality/cost	Needs improvement tracking
Critic consensus	Stop when no issues found	Quality-focused	May never converge \|

Implementation with Multiple Strategies:

def reflection_loop(task: str, max_iterations: int = 5, 
                    quality_threshold: float = 0.9,
                    min_improvement: float = 0.05):
    
    current_output = creator_llm.invoke(task)
    previous_score = 0.0
    
    for i in range(max_iterations):
        # Get structured critique with score
        critique = critic_llm.invoke(f"""
        Review this output and provide:
        1. Quality score (0.0 to 1.0)
        2. List of issues (empty if none)
        3. Specific improvement suggestions
        
        Output: {current_output}
        """)
        
        current_score = extract_score(critique)
        issues = extract_issues(critique)
        
        # Strategy 1: Quality threshold reached
        if current_score >= quality_threshold:
            return current_output, f"Reached quality threshold at iteration {i+1}"
        
        # Strategy 2: No issues found (critic consensus)
        if not issues:
            return current_output, f"No issues found at iteration {i+1}"
        
        # Strategy 3: Diminishing returns
        improvement = current_score - previous_score
        if i > 0 and improvement < min_improvement:
            return current_output, f"Diminishing returns at iteration {i+1}"
        
        # Continue improving
        current_output = creator_llm.invoke(f"""
        Improve this output based on feedback:
        
        Current output: {current_output}
        Feedback: {critique}
        """)
        
        previous_score = current_score
    
    return current_output, f"Reached max iterations ({max_iterations})"

Designing Effective Critic Prompts:

The critic prompt is critical for pattern success. A vague critic request produces a vague feedback.

# Bad example: Vague critic prompt
WEAK_CRITIC_PROMPT = "Review this code and provide feedback."

# Good example: Structured critic prompt with specific criteria
STRONG_CRITIC_PROMPT = """
Review this Python code against these specific criteria:

1. CORRECTNESS (Critical)
   - Does it handle the stated requirements?
   - Are there logic errors or bugs?
   - Are edge cases handled (empty input, None, large values)?

2. CODE QUALITY (Major)
   - Follows PEP 8 style guidelines?
   - Meaningful variable/function names?
   - Appropriate use of Python idioms?

3. ERROR HANDLING (Major)
   - Are exceptions caught and handled appropriately?
   - Are error messages informative?
   - Does it fail gracefully?

4. DOCUMENTATION (Minor)
   - Are functions documented with docstrings?
   - Are complex sections commented?

For each criterion, provide:
- Rating: PASS / NEEDS_IMPROVEMENT / FAIL
- Specific issues found (with line references if applicable)
- Concrete suggestions for improvement

Code to review:
```python
{code}
```
"""

Cross-Model Reflection:

Using the same model for creator and critic often creates an “echo chamber” where the critic approves flawed output because it has similar blind spots.

# Recommended approach: Cross-model reflection
creator = ModelFactory.create("gpt-4o")      # Strong at generation
critic = ModelFactory.create("claude-sonnet-4-5")  # Strong at analysis

# Alternative: Same provider, different temperatures
creator = ModelFactory.create("gpt-4o", temperature=0.7)  # Creative
critic = ModelFactory.create("gpt-4o", temperature=0.2)   # Analytical

When NOT to Use Reflection Pattern:

Task has objective correctness criteria (you should use automated tests instead)
Single-pass output is consistently acceptable
Latency constraints don’t allow multiple iterations
Cost per iteration is prohibitive for the use case

5. Tool Use – Extending Agent Capabilities

The Pattern: Tool Use pattern enables agents to interact with external APIs, databases or services by equipping agents with domain-specific tools that they can call on as needed by the task they received.

How It Works: The pattern is often implemented through a Function Calling mechanism which involves defining and describing external functions or capabilities to the LLM. In LangChain that is done using the @tool decorator:

from langchain_core.tools import tool

@tool
def tech_search(query: str) -> str:
    """Search for technology information."""
    return search_tech_database(query)

@tool
def science_search(query: str) -> str:
    """Search for science information."""
    return search_science_database(query)

llm_with_tools = llm.bind_tools([tech_search, science_search])

The LLM receives both the user’s request and available tool definitions. Based on this information, the LLM decides if calling one or more tools is required to generate response.

Sample Use Case: Research assistant:

Query: “What is quantum computing?”
Agent: Calls tech_search("quantum computing")
Agent: Formulates answer using search results returned by the tool

Why It Matters: LLM model alone is limited to training data which may not had relevant information or had out of date data. The Tool Use pattern enables it to access additional information or real-time data and perform actions or calculations as needed for generating response.

Tool Description Best Practices:

The LLM model decides which tool to call based on the function name and description. Poor or incomplete descriptions often lead to incorrect tool selection.

# Bad choice: Vague description
@tool
def search(q: str) -> str:
    """Search for information."""
    return search_database(q)

# Good choice: Specific description with usage guidance
@tool
def search_tech_patents(query: str, year_from: int = 2020) -> str:
    """Search USPTO patent database for technology-related patents.
    
    Use this tool when the user asks about:
    - Patents, inventions, or intellectual property
    - Technology innovations and their inventors
    - Prior art research
    
    Args:
        query: Search terms (e.g., "machine learning image recognition")
        year_from: Filter patents from this year onward (default: 2020)
    
    Returns:
        JSON string with patent titles, numbers, abstracts, and filing dates.
        Returns empty array if no matches found.
    """
    return search_patent_db(query, year_from)

Tool Execution Patterns:

Tools can be used in various patterns depending on the task:

Pattern	Description	Example
Single tool	One tool call, use result	“What’s the weather?” -> call weather_api()
Sequential	Output of A feeds into B	search -> summarize results
Parallel	Multiple tools simultaneously	weather + news + calendar
Iterative	Same tool, refined queries	search -> refine -> search again

Sequential Tool Chain:

@tool
def web_search(query: str) -> str:
    """Search the web for current information."""
    return search_api(query)

@tool
def summarize_results(search_results: str) -> str:
    """Summarize search results into key points."""
    return llm.invoke(f"Summarize: {search_results}")

# Agent decides to chain: search → summarize

Error Handling for Tools:

Tools can fail. You should design agents to handle failures gracefully:

@tool
def reliable_search(query: str) -> str:
    """Search with automatic retry and fallback."""
    
    # Attempt primary source
    try:
        result = primary_search_api(query, timeout=5)
        if result:
            return result
    except TimeoutError:
        pass  # Fall through to backup
    except APIError as e:
        logger.warning(f"Primary search failed: {e}")
    
    # Attempt backup source
    try:
        result = backup_search_api(query, timeout=10)
        if result:
            return f"[From backup source] {result}"
    except Exception as e:
        logger.error(f"Backup search failed: {e}")
    
    # Graceful degradation
    return "Search unavailable. Please try rephrasing your query or try again later."

Tool Selection Prompt Enhancement:

Always consider how to help the LLM model to make better tool selection decision:

TOOL_SELECTION_SYSTEM_PROMPT = """
You have access to these tools:

{tool_descriptions}

Guidelines for tool selection:
1. Only use tools when the information is not in your training data
2. For current events, prices, or real-time data: ALWAYS use tools
3. If multiple tools could work, prefer the most specific one
4. If a tool fails, try rephrasing the query before giving up
5. Explain your tool choice briefly before calling it

When NOT to Use Tool Use Pattern:

Information is certainly available in the LLM’s training data
Tool latency would result in unacceptably slow responses
Task can be completed with LLM reasoning alone
Tool results would need extensive validation

Orchestration Patterns (6-7) – Complex Workflows

Intelligent behavior often requires an agent to break down a complex task into smaller steps that are planned to achieve the task goal once all the steps are completed. Some steps may require domain expertise or use of specific tools therefore, the plan should account for collaboration with specialized agents that have required expertise or tools. Design patterns in this group offer a standardized solution for having agentic system first to create a coherent plan to meet a goal and then coordinating multiple agent to execute this plan.

6. Planning – Strategic Breakdown and Execution

The Pattern: Planning pattern involves breaking a complex task into smaller steps, creating an execution plan and then following it step-by-step to achieve the final goal.

How It Works: Two-phase approach using LangGraph state machine:

from langgraph.graph import StateGraph, END

workflow = StateGraph(PlanningState)
workflow.add_node("planner", planner_agent)    # Phase 1: Create plan
workflow.add_node("executor", executor_agent)  # Phase 2: Execute plan
workflow.add_edge("planner", "executor")

Sample Use Case: Design RESTful API for a book library system

Planner: Analyzes requirements, breaks into steps (entities, relationships, constraints)
Executor: Create design for each step, produces SQL schema

Why It Matters: Handling complex tasks benefit from decomposing high-level requirements into actionable, sequential steps and creating an explicit plan with detailed design for each step before execution.

Plan Representation Formats:

You need to decide what structure to use for creating plans as that may affect plan execution quality, for example:

1. Linear Plans – a sequential list of items

2. Directed Acyclic Graph (DAG) Plans – a non-linear list of items with dependencies

3. Hierarchical Plans – nested goals:

Goal: Design Library API
├── SubGoal 1: Define Data Model
│   ├── Task 1.1: Identify core entities
│   ├── Task 1.2: Define entity attributes
│   └── Task 1.3: Map relationships
├── SubGoal 2: Design API Endpoints
│   ├── Task 2.1: CRUD operations
│   ├── Task 2.2: Search functionality
│   └── Task 2.3: Authentication endpoints
└── SubGoal 3: Define Validation Rules
    ├── Task 3.1: Input validation
    └── Task 3.2: Business rules

Structured Plan Generation:

PLANNING_PROMPT = """
Create an execution plan for this task. Return a structured plan with:

1. GOAL: One sentence describing the end state
2. STEPS: Numbered list where each step has:
   - Description: What to do
   - Inputs: What information is needed
   - Outputs: What this step produces
   - Dependencies: Which steps must complete first (use step numbers)

Task: {task}

Example format:
GOAL: Create a REST API design for a library system

STEPS:
1. Description: Identify core entities
   Inputs: Requirements document
   Outputs: Entity list with attributes
   Dependencies: None

2. Description: Define relationships
   Inputs: Entity list
   Outputs: ER diagram description
   Dependencies: Step 1
"""

Plan Validation:

Before passing the plan to an agent for execution, you should validate it, for example:

def validate_plan(plan: dict) -> tuple[bool, list[str]]:
    """Validate plan structure and dependencies."""
    errors = []
    
    # Check for circular dependencies
    if has_circular_deps(plan["steps"]):
        errors.append("Circular dependency detected")
    
    # Check all dependencies exist
    step_ids = {s["id"] for s in plan["steps"]}
    for step in plan["steps"]:
        for dep in step.get("dependencies", []):
            if dep not in step_ids:
                errors.append(f"Step {step['id']} depends on non-existent step {dep}")
    
    # Check inputs are available
    available_outputs = set()
    for step in topological_sort(plan["steps"]):
        for required_input in step.get("inputs", []):
            if required_input not in available_outputs and required_input != "initial":
                errors.append(f"Step {step['id']} requires unavailable input: {required_input}")
        available_outputs.update(step.get("outputs", []))
    
    return len(errors) == 0, errors

Adaptive Re-planning:

In real-life, task execution can often deviates from the generated plan. It is recommended to build in re-planning capability to account for changes in task execution, for example:

async def execute_with_replanning(plan: dict, max_replans: int = 2):
    replan_count = 0
    
    for step in plan["steps"]:
        try:
            result = await execute_step(step)
            step["result"] = result
        except ExecutionError as e:
            if replan_count >= max_replans:
                raise RuntimeError(f"Max replanning attempts exceeded at step {step['id']}")
            
            # Generate new plan from current state
            new_plan = await planner.invoke({
                "original_goal": plan["goal"],
                "completed_steps": [s for s in plan["steps"] if "result" in s],
                "failed_step": step,
                "error": str(e)
            })
            
            plan = new_plan
            replan_count += 1
    
    return plan

When NOT to Use Planning Pattern:

Task is straightforward enough for direct execution
Planning overhead exceeds execution time
Requirements are too vague for meaningful decomposition
Real-time response is required

7. Multi-Agent Collaboration – Coordinated Teamwork

The Pattern: Multi-Agent Collaboration pattern involves creating a system of multiple specialized agents that work together in a structured way through defined communication protocols and interaction models allowing the group to deliver solution that would be impossible for any single agent.

Sample Use Cases:

To illustrate use of the Multi-Agent Collaboration pattern, I used LangGraph to demonstrate three different collaboration models:

Sequential Pipeline: Research paper analysis (Researcher -> Critic -> Synthesizer)
Parallel & Synthesis: Product launch campaign (Marketing + Content + Analyst -> Coordinator)
Multi-Perspective Debate: Code review system (Security + Performance + Quality -> Synthesizer)

Sequential Pipeline: Research -> Critic -> Synthesizer:

workflow.add_edge("researcher", "critic")
workflow.add_edge("critic", "summarizer")

Parallel & Synthesis: Marketing + Content + Analyst -> Coordinator:

workflow.add_edge("marketing", "coordinator")
workflow.add_edge("content", "coordinator")
workflow.add_edge("analyst", "coordinator")

Multi-Perspective Debate: Security + Performance + Quality -> Synthesizer:

workflow.add_edge("security_reviewer", "synthesizer")
workflow.add_edge("performance_reviewer", "synthesizer")
workflow.add_edge("quality_reviewer", "synthesizer")

Why It Matters: Problems that agents face in real-life often require involvement of specialized agents working together in different collaboration structures. LangGraph library makes it easy to implement any collaboration pattern.

State Management in Multi-Agent Systems:

LangGraph uses a shared state object that all agents read from and write to:

from typing import TypedDict, List, Annotated
from langgraph.graph import StateGraph
from langchain_core.messages import BaseMessage
import operator

class CollaborationState(TypedDict):
    # Original input
    task: str
    
    # Agent outputs (each agent writes to their field)
    research_output: str
    critique: str
    final_summary: str
    
    # Shared conversation history (appended by all agents)
    messages: Annotated[List[BaseMessage], operator.add]
    
    # Metadata for coordination
    iteration: int
    status: str  # "in_progress", "needs_revision", "complete"

Agent Role Definition Best Practices:

You should define strong agent role boundaries to prevent agents from either duplicating work, or providing generic feedback, for example:

SECURITY_REVIEWER_PROMPT = """
You are a senior security engineer reviewing code. 

YOUR SCOPE - Focus ONLY on:
1. Authentication and authorization vulnerabilities
2. Input validation and injection risks (SQL, XSS, command injection)
3. Sensitive data exposure (logging, error messages, hardcoded secrets)
4. Dependency vulnerabilities (known CVEs)
5. Cryptographic issues (weak algorithms, improper key handling)

OUT OF SCOPE - DO NOT comment on:
- Code style or formatting
- Performance optimizations
- Documentation quality
- General code structure

OUTPUT FORMAT:
For each issue found:
- Severity: CRITICAL / HIGH / MEDIUM / LOW
- Location: File and line number
- Issue: Brief description
- Recommendation: Specific fix

If no security issues found, respond with: "No security vulnerabilities identified."
"""

PERFORMANCE_REVIEWER_PROMPT = """
You are a performance engineer reviewing code.

YOUR SCOPE - Focus ONLY on:
1. Algorithm complexity (O(n²) when O(n) is possible)
2. Database query efficiency (N+1 queries, missing indexes)
3. Memory usage (large object creation, memory leaks)
4. Caching opportunities
5. Async/concurrent execution opportunities

OUT OF SCOPE - DO NOT comment on:
- Security vulnerabilities
- Code style
- Documentation

OUTPUT FORMAT:
For each issue found:
- Impact: HIGH / MEDIUM / LOW
- Location: File and function
- Current: What the code does now
- Suggested: Specific optimization
- Expected improvement: Estimated gain
"""

Coordination Challenges and Solutions:

Challenge	Example	Solution
Conflicting outputs	Security: “add auth” vs Performance: “reduce overhead”	Synthesizer with explicit conflict resolution rules
Information loss	Key details lost between agents	Structured hand-off format with required fields
Infinite loops	Agents keep requesting revisions	Max iteration limits, improvement thresholds
Redundant work	Multiple agents analyze same aspect	Clear scope boundaries, explicit “out of scope”

Conflict Resolution in Synthesizer:

SYNTHESIZER_PROMPT = """
You are a technical lead synthesizing feedback from multiple reviewers.

Reviews received:
- Security Review: {security_review}
- Performance Review: {performance_review}
- Quality Review: {quality_review}

Your task:
1. Identify CONFLICTS where reviewers disagree or recommendations are mutually exclusive
2. For each conflict, decide the resolution based on these priorities:
   - Security concerns ALWAYS take precedence
   - Correctness over performance
   - Maintainability over micro-optimizations
3. Create a UNIFIED action plan that:
   - Lists all non-conflicting recommendations
   - Explains conflict resolutions with rationale
   - Prioritizes items as: MUST DO / SHOULD DO / NICE TO HAVE

Output a single, coherent improvement plan the developer can follow.
"""

Structured Hand-off Between Agents:

HANDOFF_TEMPLATE = """
## Agent Handoff Document

### Completed By: {previous_agent}
### Handing To: {next_agent}

### Summary of Work Done:
{work_summary}

### Key Findings:
{key_findings}

### Open Questions:
{open_questions}

### Recommendations for Next Agent:
{recommendations}

### Artifacts Produced:
{artifacts}
"""

When NOT to Use Multi-Agent Collaboration Pattern:

Single perspective on task execution is sufficient
Multi-agent coordination overhead exceeds benefits
Agents would have highly overlapping responsibilities
Task requires sharing deep context that’s hard to transfer between agents

Advanced Patterns (8-10) – Self-Improvement & Quality Assurance

Agentic systems need to remember information from past interactions not only to provide coherent and personalized user experience, but also to learn and self-improve using collected data. Design patterns in this group enable agents to remember past conversations, learn from them, and improve over time.

8. Memory Management – Context Across Interactions

The Pattern: Memory Management pattern is very important as it allows agents to keep track of conversations, personalize responses, or learn from the interactions. LLM models rely on three memory types:

Semantic Memory: Facts and knowledge (user preferences, domain knowledge)
Episodic Memory: Past experiences (conversation history, previous tickets)
Procedural Memory: Rules and strategies (company policies, protocols)

How It Works: LangChain offers ConversationBuifferMemory to automatically inject the history of a single conversation into a prompt, LangGraph enables advanced, long-term memory via the InMemoryStore:

from langgraph.store.memory import InMemoryStore

memory = InMemoryStore()

# Store semantic knowledge
await memory.aput(
    namespace=("advisor", "semantic"),
    key="tax_401k",
    value={"concept": "401k", "info": "Tax-advantaged retirement..."}
)

# Retrieve when needed
knowledge = await memory.aget(("advisor", "semantic"), "tax_401k")

Sample Use Case: Financial advisor chat bot:

Remembers user’s investment preferences (semantic memory)
Recalls past conversations (episodic memory)
Follows fiduciary duty rules (procedural memory)

Why It Matters: Without a memory mechanism, agents are stateless. They are unable to maintain conversational context, learn from experience, or personalize responses for users.

Agent Memory Architecture Decisions:

Your choice of storage backend for agentic application is always based on use case requirements:

Backend	Persistence	Scalability	Query Types	Best For
InMemoryStore	Session only	Single instance	Key-value	Prototyping, demos
Redis	Configurable	High (clustered)	Key-value, TTL	Production, multi-instance
PostgreSQL + pgvector	Yes	High	SQL + semantic	Complex queries + similarity
Pinecone/Weaviate	Yes	Very high	Semantic only	Large-scale retrieval
SQLite	Yes	Low	SQL	Desktop apps, edge comp.

Memory Retrieval Strategies:

Keep in mind that how you retrieve memories can affect response quality, for example:

# Strategy 1: Recency-based (last N interactions)
async def get_recent_memories(user_id: str, n: int = 5):
    memories = await memory.alist(namespace=("user", user_id, "episodic"))
    return sorted(memories, key=lambda m: m["timestamp"], reverse=True)[:n]

# Strategy 2: Relevance-based (semantic similarity)
async def get_relevant_memories(user_id: str, query: str, n: int = 5):
    query_embedding = embed_model.embed(query)
    memories = await memory.asearch(
        namespace=("user", user_id, "episodic"),
        query_embedding=query_embedding,
        top_k=n
    )
    return memories

# Strategy 3: Hybrid (recent + relevant)
async def get_hybrid_memories(user_id: str, query: str):
    recent = await get_recent_memories(user_id, n=3)
    relevant = await get_relevant_memories(user_id, query, n=3)
    
    # Deduplicate and merge
    seen_ids = set()
    combined = []
    for mem in recent + relevant:
        if mem["id"] not in seen_ids:
            combined.append(mem)
            seen_ids.add(mem["id"])
    
    return combined

Memory Types Implementation:

class FinancialAdvisorMemory:
    def __init__(self, store: InMemoryStore, user_id: str):
        self.store = store
        self.user_id = user_id
    
    # SEMANTIC: Facts and knowledge
    async def store_user_preference(self, key: str, value: dict):
        await self.store.aput(
            namespace=("advisor", self.user_id, "semantic"),
            key=key,
            value={"type": "preference", "data": value, "updated": datetime.now().isoformat()}
        )
    
    # EPISODIC: Past interactions
    async def log_interaction(self, interaction: dict):
        interaction_id = f"interaction_{datetime.now().timestamp()}"
        await self.store.aput(
            namespace=("advisor", self.user_id, "episodic"),
            key=interaction_id,
            value={"type": "interaction", "data": interaction, "timestamp": datetime.now().isoformat()}
        )
    
    # PROCEDURAL: Rules and strategies
    async def get_compliance_rules(self) -> list:
        rules = await self.store.aget(
            namespace=("advisor", "global", "procedural"),
            key="compliance_rules"
        )
        return rules["data"] if rules else []

Memory Consolidation:

Volume of episodic memories can significantly grow aver time. Therefor, detailed episodic memories should be periodically compressed into summarized semantic knowledge to manage the size / limit of the context window, for example:

async def consolidate_memories(user_id: str, days_old: int = 30):
    """Compress old episodic memories into semantic summaries."""
    
    cutoff = datetime.now() - timedelta(days=days_old)
    old_memories = await get_memories_before(user_id, cutoff)
    
    if not old_memories:
        return
    
    # Group by topic
    grouped = group_by_topic(old_memories)
    
    for topic, memories in grouped.items():
        # Generate summary
        summary = llm.invoke(f"""
        Summarize these past interactions about {topic}:
        {format_memories(memories)}
        
        Extract:
        1. Key facts learned about the user
        2. Preferences expressed
        3. Important decisions made
        """)
        
        # Store as semantic memory
        await memory.aput(
            namespace=("user", user_id, "semantic"),
            key=f"consolidated_{topic}",
            value={"summary": summary, "source_count": len(memories), "consolidated_at": datetime.now().isoformat()}
        )
        
        # Archive or delete old episodic memories
        for mem in memories:
            await memory.adelete(namespace=("user", user_id, "episodic"), key=mem["id"])

When NOT to Use Memory Management Pattern:

Stateless interactions are acceptable in your use case (e.g., simple Q&A chat)
Privacy requirements prohibit storing user data
Context window size is sufficient for conversation history for your use case
Memory maintenance complexity exceeds benefits

9. Learning & Adapting – Self-Improvement Through Benchmarking

The Pattern: Learning & Adapting pattern enable agents to iteratively evolve by autonomously improving its parameters or even, its own code based on test results. Without this ability their performance can degrade when faced with a novel task.

How It Works: Agent follows Benchmark -> Analyze -> Improve itslef cycle:

for iteration in range(max_iterations):
    # Run benchmark tests
    test_results = benchmark.run(current_code)

    # Calculate performance score (success rate, speed, complexity)
    score = calculate_score(test_results)

    # If good enough, stop
    if score >= threshold:
        break

    # Use LLM to generate improved version
    improved_code = llm.invoke(f"""
    Current code: {current_code}
    Test failures: {test_results.failures}
    Generate improved version.
    """)

    current_code = improved_code

Sample Use Case: An agent that improves its own code through cycles of:

Testing current implementation
Analyzing performance and failures
Generating improved version
Selecting best version for next iteration

Scoring Formula: Best version is selected using weighted combination of three factors:

foodescore = 0.5 × success_rate + 0.3 × speed + 0.2 × simplicity

Why It Matters: Demonstrates meta-learning – an agent that autonomously improves its behavior based on new data and iterations. Pattern is applicable to prompt engineering, hyper-parameter tuning, and automated optimization.

Designing Effective Benchmarks:

Remember that quality of your benchmark determines agent learning quality. Therefore, a comprehensive benchmark suite should include different test types:

Test Type	Purpose	Example
Correctness	Does output match expected?	sort([3,1,2]) -> [1,2,3]
Edge cases	Handles boundaries?	Empty list, single element, duplicates, etc.
Performance	Meets speed requirements?	Sort 10,000 elements in < 100ms
Robustness	Handles bad input?	None, wrong types, malformed data
Scale	Works at production volume?	1M element sort

Comprehensive Benchmark Example:

SORTING_BENCHMARK = [
    # Correctness tests
    {"name": "basic_sort", "input": [3, 1, 4, 1, 5], "expected": [1, 1, 3, 4, 5]},
    {"name": "already_sorted", "input": [1, 2, 3, 4, 5], "expected": [1, 2, 3, 4, 5]},
    {"name": "reverse_sorted", "input": [5, 4, 3, 2, 1], "expected": [1, 2, 3, 4, 5]},
    
    # Edge cases
    {"name": "empty_list", "input": [], "expected": []},
    {"name": "single_element", "input": [42], "expected": [42]},
    {"name": "all_same", "input": [7, 7, 7, 7], "expected": [7, 7, 7, 7]},
    {"name": "negative_numbers", "input": [-3, -1, -4], "expected": [-4, -3, -1]},
    {"name": "mixed_signs", "input": [-2, 0, 3, -1], "expected": [-2, -1, 0, 3]},
    
    # Performance tests (check timing separately)
    {"name": "large_random", "input": list(range(10000, 0, -1)), "expected": list(range(1, 10001)), "max_ms": 100},
    {"name": "large_uniform", "input": [1] * 10000, "expected": [1] * 10000, "max_ms": 50},
]

def run_benchmark(code: str, benchmark: list) -> dict:
    """Execute code against benchmark suite."""
    results = {
        "passed": 0,
        "failed": 0,
        "errors": 0,
        "failures": [],
        "total_time_ms": 0
    }
    
    exec_globals = {}
    exec(code, exec_globals)
    sort_func = exec_globals.get("sort") or exec_globals.get("custom_sort")
    
    for test in benchmark:
        try:
            start = time.perf_counter()
            result = sort_func(test["input"].copy())
            elapsed_ms = (time.perf_counter() - start) * 1000
            
            results["total_time_ms"] += elapsed_ms
            
            if result != test["expected"]:
                results["failed"] += 1
                results["failures"].append({
                    "test": test["name"],
                    "expected": test["expected"][:5],  # Truncate for logging
                    "got": result[:5] if result else None
                })
            elif "max_ms" in test and elapsed_ms > test["max_ms"]:
                results["failed"] += 1
                results["failures"].append({
                    "test": test["name"],
                    "reason": f"Too slow: {elapsed_ms:.1f}ms > {test['max_ms']}ms"
                })
            else:
                results["passed"] += 1
                
        except Exception as e:
            results["errors"] += 1
            results["failures"].append({
                "test": test["name"],
                "error": str(e)
            })
    
    return results

Avoiding Local Optima:

Keep in mind that agent learning loop can get stuck optimizing for specific failing tests while regressing on others. Therefore, you should always track the overall agent performance, for example:

def adaptive_learning_loop(initial_code: str, benchmark: list, max_iterations: int = 10):
    current_code = initial_code
    best_code = initial_code
    best_score = 0.0
    
    # Track performance across ALL tests, not just failing ones
    history = []
    
    for i in range(max_iterations):
        # Randomize test order to prevent order-dependent optimizations
        shuffled_benchmark = random.sample(benchmark, len(benchmark))
        results = run_benchmark(current_code, shuffled_benchmark)
        
        score = calculate_score(results)
        history.append({"iteration": i, "score": score, "passed": results["passed"]})
        
        # Track best overall, not just most recent
        if score > best_score:
            best_score = score
            best_code = current_code
        
        if results["failed"] == 0 and results["errors"] == 0:
            return best_code, history, "All tests passed"
        
        # Periodically reintroduce tests that were passing
        # to catch regressions
        if i > 0 and i % 3 == 0:
            regression_check = run_benchmark(current_code, benchmark)
            if regression_check["passed"] < history[0]["passed"]:
                # Regression detected, revert to best
                current_code = best_code
                continue
        
        # Generate improvement focused on failures
        current_code = llm.invoke(f"""
        Current code:
        ```python
        {current_code}
        ```
        
        Test failures:
        {json.dumps(results["failures"], indent=2)}
        
        Previous attempts: {len(history)}
        Best score achieved: {best_score:.2f}
        
        Generate an improved version that:
        1. Fixes the failing tests
        2. Does NOT break currently passing tests
        3. Maintains or improves performance
        """)
    
    return best_code, history, f"Max iterations reached (best score: {best_score:.2f})"

Scoring Formula Variations:

Different tasks typically need different scoring weights, for example:

# Correctness-focused (typical for most cases)
score = 0.7 * success_rate + 0.2 * (1 - normalized_time) + 0.1 * simplicity

# Performance-critical (real-time systems)
score = 0.4 * success_rate + 0.5 * (1 - normalized_time) + 0.1 * simplicity

# Maintainability-focused (enterprise code)
score = 0.5 * success_rate + 0.1 * (1 - normalized_time) + 0.4 * simplicity

When NOT to Use Learning & Adapting Pattern:

Task doesn’t have measurable success criteria
Cost of benchmark testst creation exceeds expected benefits
Solution space is too large for iterative search to succeede quickly
Human review is required anyway

10. Goal Setting & Monitoring – Quality Assurance Through Review Cycles

The Pattern: Goal Setting & Monitoring patterns is about setting a specific goal for an agent and providing the means to track the progress and determine if goal is achieved.

How It Works: Pattern is demonstrated using two agents:

Developer Agent:
- Analyzes requirements
- Creates implementation plan
- Writes Python code
- Revises code based on feedback
Manager Agent:
- Reviews code against requirements
- Grades the code across 4 criteria (0-100 scale):
  - Requirements coverage (40%)
  - Code quality (30%)
  - Error handling (15%)
  - Code documentation (15%)
- Provides prioritized, actionable feedback

Agents collaborate via the following iteration cycle:

Developer agent that creates implementation plans and generates code
Manager agent that monitors progress, reviews code, and provides feedback
Iterative improvement cycle based on manager feedback
Grade-based progress tracking – iterations will stop if grade is above 85

Sample Use Case: REST API client implementation based on simple requirements: retry logic, rate limiting, error handling

Why It Matters: The pattern provides a standardize solution by giving LLM model a sense of purpose and self-assessment. Automated code review and quality assurance use case shows how multi-agent collaboration enables complex quality control workflows.

Designing Effective Grading Rubrics:

You should alwaws rememebr that vague prompt, such as vague rubrics will lead to inconsistent grading. Therefore, be explicit in the prompt about what each score means:

GRADING_RUBRIC = {
    "requirements_coverage": {
        "weight": 0.40,
        "criteria": {
            "90-100": "All requirements fully implemented with edge cases handled",
            "80-89": "All core requirements implemented, minor edge cases missing",
            "70-79": "Most requirements implemented, some gaps in functionality",
            "60-69": "Partial implementation, missing key requirements",
            "0-59": "Fundamental requirements not addressed"
        }
    },
    "code_quality": {
        "weight": 0.30,
        "criteria": {
            "90-100": "Clean, idiomatic Python following PEP 8, well-structured with appropriate abstractions",
            "80-89": "Readable and maintainable, minor style inconsistencies",
            "70-79": "Functional but could be cleaner, some code smells",
            "60-69": "Works but hard to maintain, significant style issues",
            "0-59": "Poorly structured, major code smells, difficult to understand"
        }
    },
    "error_handling": {
        "weight": 0.15,
        "criteria": {
            "90-100": "Comprehensive error handling with specific exceptions, informative messages, graceful degradation",
            "80-89": "Good error handling for common cases, reasonable messages",
            "70-79": "Basic error handling present, generic exceptions",
            "60-69": "Minimal error handling, may crash on bad input",
            "0-59": "No error handling, crashes easily"
        }
    },
    "documentation": {
        "weight": 0.15,
        "criteria": {
            "90-100": "Complete docstrings, clear comments for complex logic, usage examples",
            "80-89": "Good docstrings for public functions, adequate comments",
            "70-79": "Basic documentation present, some gaps",
            "60-69": "Minimal documentation, unclear function purposes",
            "0-59": "No documentation"
        }
    }
}

Manager Prompt with Explicit Rubric:

MANAGER_REVIEW_PROMPT = """
You are a senior engineering manager reviewing code against specific requirements.

## Requirements:
{requirements}

## Code to Review:
```python
{code}
```

## Grading Rubric:
{rubric}

## Your Task:
1. Grade each criterion using the rubric descriptions
2. Calculate weighted total: (req × 0.40) + (quality × 0.30) + (errors × 0.15) + (docs × 0.15)
3. Provide feedback in two categories:

### PRIORITY FEEDBACK (Must fix before passing):
- Critical bugs or missing requirements
- Security vulnerabilities
- Major functionality gaps

### SECONDARY IMPROVEMENTS (Nice to have):
- Code style refinements
- Documentation enhancements
- Performance optimizations

## Output Format:
REQUIREMENTS_COVERAGE: [score]/100
Justification: [one line]

CODE_QUALITY: [score]/100
Justification: [one line]

ERROR_HANDLING: [score]/100
Justification: [one line]

DOCUMENTATION: [score]/100
Justification: [one line]

TOTAL: [weighted score]/100

PRIORITY FEEDBACK:
1. [issue]
2. [issue]

SECONDARY IMPROVEMENTS:
1. [suggestion]
2. [suggestion]
"""

When NOT to Use Goal Setting & Monitoring Pattern:

Task doesn’t have clear success criteria
Single-pass generation is sufficient to generate response
Human review of response is required regardless
Iteration cost (API calls, time) exceeds benefits

11. Retrieval-Augmented Generation (RAG) – Access Context Specific Data

The Pattern: RAG pattern enables LLM to access and integrate external, current, and context-specific information to enhance the accuracy, relevance and factual basis of LLM response.

How It Works:

User request is analyzed to determine question type (factual, comparison, overview, etc.)
Based on the question type, system determines what information is required to answer it
Simple RAG process involves Retrieval (searching a knowledge base for relevant content) and Augmentation (adding citations from retrieved documents to the LLM prompt)
GraphRAG process also leverages connections between different entities (nodes in the knowledge graph) which allows system to answer questions that require knowledge of relationships between different pieces of information / documents.

Why It Matters:

Grounds responses in facts: Allows reducing hallucinations by providing source material
Enables domain-specific knowledge: Provides access to proprietary documents, databases, or specialized corporate information
Dynamic knowledge: Enables updating knowledge base used by LLM without retraining the model
Transparency: Can cite sources for information used to generate answers

Sample Use Cases:

Question answering over enterprise documents
Customer support with knowledge base integration
Research assistants with access to a library of scientific papers
Legal or medical applications requiring factual accuracy of generated answers

Note: While RAG is a very important agentic design pattern, it’s not implemented in the testing framework. I’ve have a separate GraphRAG project that uses this design paternal to build an application that demonstrates advanced retrieval techniques including graph-based knowledge representation, multi-hop reasoning, and hybrid search strategies. You can find more information about this project in the blogs:

Why Separate?: GraphRAG requires significant infrastructure (object storage for documents, graph database, vector databases, embedding models, indexing pipelines) that deserves a dedicated attention. The GraphRAG project explores these components in depth.

Combining Patterns: Real-World Applications

Individual patterns are building blocks for agentic applications. In real-world production application, you would typically need to combine multiple patterns:

Example: AI Code Assistant

Example: Research Analyst Agent

Pattern Combination Guidelines:

Combination	When to Use	Watch Out For
Routing + Specialized Chains	Multiple distinct request types	Misclassification cascades
Planning + Multi-Agent	Complex tasks needing expertise	Coordination overhead
Tool Use + Reflection	External data needs verification	Tool failures during reflection
Memory + Any Pattern	Personalization needed	Memory retrieval latency
Parallelization + Synthesis	Independent analyses to combine	Context window limits

Anti-Pattern: Over-Engineering

Not every agentic application needs using every design pattern. A simple FAQ chatbot using Routing and Tool Use patterns (for knowledge base search) likely will be more effective than a complex multi-agent system.

Rule of thumb: Start with the simplest pattern that could work.

Add complexity only when you have evidence it’s needed.

Key Insights and Comparison Results

Cost and Latency Implications

Understanding the resource implications of each pattern helps in architecture decisions:

Pattern	API Calls per Request	Relative Cost	Latency Impact
Chaining	N (number of steps)	Medium	Additive (+N calls)
Routing	1 (classify) + handler	Low-Medium	+1 classification call
Parallelization	N (parallel tasks)	Medium-High	Reduced (concurrent)
Reflection	2-6 (iterations × 2)	High	2× per iteration
Tool Use	1 + tool calls	Low-Medium	+tool latency
Planning	2+ (plan + execute)	Medium	+planning phase
Multi-Agent	3-10+ (varies)	Highest	Depends on topology
Memory	1 + retrieval	Low-Medium	+retrieval latency
Learning & Adapting	5-20+ (iterations)	Very High	Minutes to hours
Goal Monitoring	4-14 (iterations × 2)	High	Minutes

Cost Optimization Strategies

1. Model Tiering: Use cheaper LLM models for simpler tasks:

# Expensive model for final output only
classifier = ModelFactory.create("gpt-4o-mini")  # Cheap, fast
generator = ModelFactory.create("gpt-4o")         # Expensive, high quality

2. Early Termination: Stop iterations when good enough:

if quality_score >= 0.85:
    break  # Don't over-optimize

3. Caching: Store and reuse common results:

@lru_cache(maxsize=1000)
def classify_intent(request_hash: str) -> str:
    return classifier.invoke(request)

4. Batching: Combine multiple small requests:

# Instead of 10 separate calls
results = await asyncio.gather(*[process(item) for item in items])

### Latency Budget Example:

For a 3-second response time budget:

Component	Budget	Design Pattern Choice
Classification	300ms	Keyword matching (no LLM) \|
Main processing	2000ms	Single LLM call or 2-step chain
Tool calls	500ms	Max 1-2 fast tools
Synthesis	200ms	Lightweight post-processing

For a 30-second budget which is ussually acceptable for complex tasks:

Full planning phase
2-3 reflection iterations
Multi-agent collaboration (sequential)

LLM Model Performance

After implementing these 10 patterns across multiple LLM providers I nmotices that Model Performance Varies by Pattern:

Code generation (Reflection, Learning & Adapting): GPT-4o and Claude Sonnet excel
Structured planning (Planning, Goal Monitoring): Claude Sonnet provides more detailed plans
Tool use: GPT-4o has more reliable function calling
Multi-step reasoning: All frontier models (GPT-4o, Claude Sonnet 4.5, Gemini 3.5 Pro) perform well

The Framework’s Value: Having a consistent testing harness made it possible to quickly prototype patterns, compare models objectively, and identify which combinations work best for specific tasks.

Conclusion and What’s Next

This project demonstrates that agentic design patterns are not just theoretical concepts. Like software design patterns, they’re practical solutions with measurable benefits. The testing framework implements 10 core patterns using LangChain and LangGraph, with the ability to run and compare them across OpenAI, Anthropic, and Google models.

Key Takeaways:

Design Patterns matter: Structured workflows significantly outperform single LLM calls
Different tasks need different patterns**: There is no one-size-fits-all solution
Model choice matters: Different LLMs have different strengths
Frameworks accelerate development: LangChain and LangGraph abstract model APIs and provide tools to dramatically speed up workflows development

In this project I focused on intra-application agent communication where agents collaborate within a single application runtime. But building an agentic system often involve inter-applicatio communication:

Inter-Agent Communication (A2A): Google’s protocol for agents from different systems to discover, negotiate, and coordinate with each other
Model Context Protocol (MCP): Anthropic’s protocol for agents to access tools and resources across application boundaries
Hybrid architectures: Combining intra-application patterns (like those implemented here) with inter-application protocols

Part 2 will explore these communication paradigms, their trade-offs, and how they complement the design patterns covered in this post.

Framework code is open source shared in GitHub repository. Try running the patterns, compare models, and adapt them to your use cases. The future of AI applications is agentic – these patterns are your starting point.