Cloud Engineer Lab
Cloud Engineer Lab
Cloud Engineer Lab
Cloud Engineer Lab
© 2026
AI & InnovationIntermediate

LLM Prompt Engineering for Production Systems

Move beyond toy demos to production-ready LLM integration: structured outputs, few-shot prompting, prompt chaining, cost control, and evaluation frameworks that actually work.

6 min read
Share

Most prompt engineering content is written for demos. Production is different: you need consistent structured outputs, cost predictability, graceful degradation, and a way to know when your prompts stop working. Here's what I've learned running LLM pipelines in production.

The Foundation: Structured Outputs

The first thing you lose in production that demos don't need: you can't parse free-form text reliably. Every production LLM call should return structured data.

python
from anthropic import Anthropic
import json
 
client = Anthropic()
 
EXTRACT_SCHEMA = {
    "type": "object",
    "required": ["entities", "sentiment", "summary"],
    "properties": {
        "entities": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["name", "type", "confidence"],
                "properties": {
                    "name": {"type": "string"},
                    "type": {"enum": ["person", "org", "location", "product"]},
                    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
                }
            }
        },
        "sentiment": {"enum": ["positive", "negative", "neutral", "mixed"]},
        "summary": {"type": "string", "maxLength": 200}
    }
}
 
def extract_entities(text: str) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        system="""You are an entity extraction API. 
        Always respond with valid JSON matching the provided schema exactly.
        Do not include any text outside the JSON object.""",
        messages=[
            {
                "role": "user",
                "content": f"""Extract entities from this text. 
                Return JSON matching this schema: {json.dumps(EXTRACT_SCHEMA)}
                
                Text: {text}"""
            }
        ]
    )
 
    # Parse and validate
    result = json.loads(response.content[0].text)
    # Add schema validation here (jsonschema, pydantic, etc.)
    return result

Use Pydantic for validation

Wrap your JSON parsing in a Pydantic model. If the LLM returns a field with the wrong type, Pydantic will catch it before it causes a downstream error. This makes debugging much easier than tracing a TypeError through your pipeline.

Few-Shot Prompting That Scales

The naive approach to few-shot prompting puts all examples in the system prompt. This works until you have 20 examples and you're paying for them on every call.

A better pattern: store examples in a database, retrieve the most relevant ones for each request using embeddings:

python
import numpy as np
from typing import Optional
 
class DynamicFewShotPrompter:
    def __init__(self, examples: list[dict], embedding_fn):
        self.examples = examples
        self.embedding_fn = embedding_fn
        # Pre-compute embeddings for all examples
        self.example_embeddings = np.array([
            embedding_fn(ex["input"])
            for ex in examples
        ])
 
    def get_relevant_examples(self, query: str, k: int = 3) -> list[dict]:
        query_embedding = self.embedding_fn(query)
 
        # Cosine similarity
        similarities = np.dot(self.example_embeddings, query_embedding) / (
            np.linalg.norm(self.example_embeddings, axis=1) *
            np.linalg.norm(query_embedding)
        )
 
        top_k_indices = np.argsort(similarities)[-k:][::-1]
        return [self.examples[i] for i in top_k_indices]
 
    def build_prompt(self, query: str) -> str:
        examples = self.get_relevant_examples(query)
        examples_text = "\n\n".join([
            f"Input: {ex['input']}\nOutput: {ex['output']}"
            for ex in examples
        ])
        return f"Examples:\n{examples_text}\n\nNow process:\n{query}"

This keeps your prompt size constant regardless of how many total examples you have.

Prompt Chaining for Complex Tasks

Don't try to do everything in one prompt. Break complex tasks into a chain:

python
def analyze_support_ticket(ticket: str) -> dict:
    # Step 1: Classify intent (cheap, fast model)
    intent = classify_intent(ticket)  # Uses claude-haiku-4-5
 
    if intent["category"] == "billing":
        # Step 2: Extract billing details
        billing_info = extract_billing_info(ticket)
        # Step 3: Generate response using billing context
        response = generate_billing_response(ticket, billing_info)
    elif intent["category"] == "technical":
        # Different chain for technical issues
        diagnosis = diagnose_technical_issue(ticket)
        response = generate_technical_response(ticket, diagnosis)
    else:
        # Generic response for unknown intent
        response = generate_generic_response(ticket)
 
    return {
        "intent": intent,
        "response": response,
        "escalate": intent["confidence"] < 0.7
    }

Use cheaper/faster models for classification and routing, reserve expensive models for generation.

Cost Control in Production

LLM costs grow faster than you expect. Three levers:

1. Cache identical requests. Use Redis with a TTL. The cache key should be a hash of your prompt + model parameters:

python
import hashlib
import json
import redis
 
cache = redis.Redis()
 
def cached_llm_call(prompt: str, model: str = "claude-opus-4-6", ttl: int = 3600) -> str:
    cache_key = hashlib.sha256(
        json.dumps({"prompt": prompt, "model": model}).encode()
    ).hexdigest()
 
    cached = cache.get(cache_key)
    if cached:
        return cached.decode()
 
    response = make_llm_call(prompt, model)
    cache.setex(cache_key, ttl, response)
    return response

For document summarization, this cache hit rate can be 40–60%.

2. Route by complexity. Use a cheap model to assess if the request needs an expensive model:

python
def route_request(query: str) -> str:
    """Returns model name based on complexity assessment."""
    # Simple heuristics first
    if len(query) < 100 and "?" not in query:
        return "claude-haiku-4-5-20251001"
 
    complexity = assess_complexity(query)  # Fast call to Haiku
    if complexity["score"] > 0.8:
        return "claude-opus-4-6"
    elif complexity["score"] > 0.4:
        return "claude-sonnet-4-6"
    return "claude-haiku-4-5-20251001"

3. Set hard token limits. Always set max_tokens to the minimum you need for the task. Unbounded responses waste money and slow your pipeline.

Evaluation: Know When Your Prompts Break

This is the part everyone skips. Your prompt will break in production, and you won't know unless you have an eval framework.

python
import json
from pathlib import Path
from dataclasses import dataclass
 
@dataclass
class EvalResult:
    passed: bool
    score: float
    details: str
 
class PromptEvaluator:
    def __init__(self, eval_dataset_path: str):
        self.cases = json.loads(Path(eval_dataset_path).read_text())
 
    def run(self, prompt_fn) -> dict:
        results = []
        for case in self.cases:
            output = prompt_fn(case["input"])
            score = self.score(output, case["expected_output"], case["checks"])
            results.append(score)
 
        return {
            "pass_rate": sum(r.passed for r in results) / len(results),
            "avg_score": sum(r.score for r in results) / len(results),
            "failures": [r for r in results if not r.passed],
        }
 
    def score(self, output: dict, expected: dict, checks: list[str]) -> EvalResult:
        # Structural checks
        for check in checks:
            if check == "has_summary" and not output.get("summary"):
                return EvalResult(False, 0.0, "Missing summary field")
            if check == "valid_sentiment" and output.get("sentiment") not in ["positive", "negative", "neutral", "mixed"]:
                return EvalResult(False, 0.0, "Invalid sentiment value")
 
        # Semantic similarity check using embedding distance
        score = self.semantic_similarity(output.get("summary", ""), expected.get("summary", ""))
        return EvalResult(score > 0.8, score, f"Semantic score: {score:.2f}")

Run this eval in CI. If pass rate drops below a threshold, fail the build before the broken prompt reaches production.

Production Checklist

Before shipping an LLM feature:

  • All LLM calls return structured output validated with Pydantic
  • Retry logic with exponential backoff for rate limits
  • Token usage logged to your observability stack
  • Cache layer in front of expensive calls
  • Eval dataset with 50+ cases covering edge cases
  • Hard max_tokens limits on every call
  • Fallback behaviour when LLM call fails
  • Cost alerting when daily spend exceeds threshold

The gap between a demo and a production LLM feature is mostly this checklist.

CChetan Yamger

Written by

Chetan Yamger

Cloud Engineer · AI Automation Architect · Blogger

Cloud Engineer and AI Automation Architect with deep expertise in Azure, Intune, PowerShell, and AI-driven workflows. I use ChatGPT, Gemini, and prompt engineering to build intelligent automation that improves productivity and decision-making in real IT environments.

AI AutomationAzure & IntunePowerShell & PythonNode.js / Next.jsApplication PackagingPower BIGeminiVDI / WVDGitHub ActionsM365Graph APIPrompt Engineering
Newsletter

Stay in the loop.
New articles, straight to you.

Deep-dive technical articles on Intune, PowerShell, and AI — no noise, no spam.

New article notifications
No spam, ever
Free forever

Discussion

Share your thoughts — your email stays private

Leave a comment

0/2000

Your email is used to prevent spam and will never be displayed.