LLM Prompt Engineering for Production Systems

Most prompt engineering content is written for demos. Production is different: you need consistent structured outputs, cost predictability, graceful degradation, and a way to know when your prompts stop working. Here's what I've learned running LLM pipelines in production.

The Foundation: Structured Outputs

The first thing you lose in production that demos don't need: you can't parse free-form text reliably. Every production LLM call should return structured data.

python

from anthropic import Anthropic
import json
 
client = Anthropic()
 
EXTRACT_SCHEMA = {
    "type": "object",
    "required": ["entities", "sentiment", "summary"],
    "properties": {
        "entities": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["name", "type", "confidence"],
                "properties": {
                    "name": {"type": "string"},
                    "type": {"enum": ["person", "org", "location", "product"]},
                    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
                }
            }
        },
        "sentiment": {"enum": ["positive", "negative", "neutral", "mixed"]},
        "summary": {"type": "string", "maxLength": 200}
    }
}
 
def extract_entities(text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are an entity extraction API. 
        Always respond with valid JSON matching the provided schema exactly.
        Do not include any text outside the JSON object.""",
        messages=[
            {
                "role": "user",
                "content": f"""Extract entities from this text. 
                Return JSON matching this schema: {json.dumps(EXTRACT_SCHEMA)}
                
                Text: {text}"""
            }
        ]
    )
 
    # Parse and validate
    result = json.loads(response.content[0].text)
    # Add schema validation here (jsonschema, pydantic, etc.)
    return result

Use Pydantic for validation

Wrap your JSON parsing in a Pydantic model. If the LLM returns a field with the wrong type, Pydantic will catch it before it causes a downstream error. This makes debugging much easier than tracing a TypeError through your pipeline.

Few-Shot Prompting That Scales

The naive approach to few-shot prompting puts all examples in the system prompt. This works until you have 20 examples and you're paying for them on every call.

A better pattern: store examples in a database, retrieve the most relevant ones for each request using embeddings:

python

import numpy as np
from typing import Optional
 
class DynamicFewShotPrompter:
    def __init__(self, examples: list[dict], embedding_fn):
        self.examples = examples
        self.embedding_fn = embedding_fn
        # Pre-compute embeddings for all examples
        self.example_embeddings = np.array([
            embedding_fn(ex["input"])
            for ex in examples
        ])
 
    def get_relevant_examples(self, query: str, k: int = 3) -> list[dict]:
        query_embedding = self.embedding_fn(query)
 
        # Cosine similarity
        similarities = np.dot(self.example_embeddings, query_embedding) / (
            np.linalg.norm(self.example_embeddings, axis=1) *
            np.linalg.norm(query_embedding)
        )
 
        top_k_indices = np.argsort(similarities)[-k:][::-1]
        return [self.examples[i] for i in top_k_indices]
 
    def build_prompt(self, query: str) -> str:
        examples = self.get_relevant_examples(query)
        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in examples
        ])
        return f"Examples:\n{examples_text}\n\nNow process:\n{query}"

This keeps your prompt size constant regardless of how many total examples you have.

Prompt Chaining for Complex Tasks

Don't try to do everything in one prompt. Break complex tasks into a chain:

python

def analyze_support_ticket(ticket: str) -> dict:
    # Step 1: Classify intent (cheap, fast model)
    intent = classify_intent(ticket)  # Uses claude-haiku-4-5
 
    if intent["category"] == "billing":
        # Step 2: Extract billing details
        billing_info = extract_billing_info(ticket)
        # Step 3: Generate response using billing context
        response = generate_billing_response(ticket, billing_info)
    elif intent["category"] == "technical":
        # Different chain for technical issues
        diagnosis = diagnose_technical_issue(ticket)
        response = generate_technical_response(ticket, diagnosis)
    else:
        # Generic response for unknown intent
        response = generate_generic_response(ticket)
 
    return {
        "intent": intent,
        "response": response,
        "escalate": intent["confidence"] < 0.7
    }

Use cheaper/faster models for classification and routing, reserve expensive models for generation.

Cost Control in Production

LLM costs grow faster than you expect. Three levers:

1. Cache identical requests. Use Redis with a TTL. The cache key should be a hash of your prompt + model parameters:

python

import hashlib
import json
import redis
 
cache = redis.Redis()
 
def cached_llm_call(prompt: str, model: str = "claude-opus-4-6", ttl: int = 3600) -> str:
    cache_key = hashlib.sha256(
        json.dumps({"prompt": prompt, "model": model}).encode()
    ).hexdigest()
 
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()
 
    response = make_llm_call(prompt, model)
    cache.setex(cache_key, ttl, response)
    return response

For document summarization, this cache hit rate can be 40–60%.

2. Route by complexity. Use a cheap model to assess if the request needs an expensive model:

python

def route_request(query: str) -> str:
    """Returns model name based on complexity assessment."""
    # Simple heuristics first
    if len(query) < 100 and "?" not in query:
        return "claude-haiku-4-5-20251001"
 
    complexity = assess_complexity(query)  # Fast call to Haiku
    if complexity["score"] > 0.8:
        return "claude-opus-4-6"
    elif complexity["score"] > 0.4:
        return "claude-sonnet-4-6"
    return "claude-haiku-4-5-20251001"

3. Set hard token limits. Always set max_tokens to the minimum you need for the task. Unbounded responses waste money and slow your pipeline.

Evaluation: Know When Your Prompts Break

This is the part everyone skips. Your prompt will break in production, and you won't know unless you have an eval framework.

python

import json
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class EvalResult:
    passed: bool
    score: float
    details: str
 
class PromptEvaluator:
    def __init__(self, eval_dataset_path: str):
        self.cases = json.loads(Path(eval_dataset_path).read_text())
 
    def run(self, prompt_fn) -> dict:
        results = []
        for case in self.cases:
            output = prompt_fn(case["input"])
            score = self.score(output, case["expected_output"], case["checks"])
            results.append(score)
 
        return {
            "pass_rate": sum(r.passed for r in results) / len(results),
            "avg_score": sum(r.score for r in results) / len(results),
            "failures": [r for r in results if not r.passed],
        }
 
    def score(self, output: dict, expected: dict, checks: list[str]) -> EvalResult:
        # Structural checks
        for check in checks:
            if check == "has_summary" and not output.get("summary"):
                return EvalResult(False, 0.0, "Missing summary field")
            if check == "valid_sentiment" and output.get("sentiment") not in ["positive", "negative", "neutral", "mixed"]:
                return EvalResult(False, 0.0, "Invalid sentiment value")
 
        # Semantic similarity check using embedding distance
        score = self.semantic_similarity(output.get("summary", ""), expected.get("summary", ""))
        return EvalResult(score > 0.8, score, f"Semantic score: {score:.2f}")

Run this eval in CI. If pass rate drops below a threshold, fail the build before the broken prompt reaches production.

Production Checklist

Before shipping an LLM feature:

All LLM calls return structured output validated with Pydantic
Retry logic with exponential backoff for rate limits
Token usage logged to your observability stack
Cache layer in front of expensive calls
Eval dataset with 50+ cases covering edge cases
Hard max_tokens limits on every call
Fallback behaviour when LLM call fails
Cost alerting when daily spend exceeds threshold

The gap between a demo and a production LLM feature is mostly this checklist.

LLM Prompt Engineering for Production Systems

The Foundation: Structured Outputs

Few-Shot Prompting That Scales

Prompt Chaining for Complex Tasks

Cost Control in Production

Evaluation: Know When Your Prompts Break

Production Checklist

Chetan Yamger

Stay in the loop.
New articles, straight to you.

Discussion

The Foundation: Structured Outputs

Few-Shot Prompting That Scales

Prompt Chaining for Complex Tasks

Cost Control in Production

Evaluation: Know When Your Prompts Break

Production Checklist

Chetan Yamger

Stay in the loop.New articles, straight to you.

Discussion

Stay in the loop.
New articles, straight to you.