LLM Prompt Engineering for Production Systems
Move beyond toy demos to production-ready LLM integration: structured outputs, few-shot prompting, prompt chaining, cost control, and evaluation frameworks that actually work.
Most prompt engineering content is written for demos. Production is different: you need consistent structured outputs, cost predictability, graceful degradation, and a way to know when your prompts stop working. Here's what I've learned running LLM pipelines in production.
The Foundation: Structured Outputs
The first thing you lose in production that demos don't need: you can't parse free-form text reliably. Every production LLM call should return structured data.
from anthropic import Anthropic
import json
client = Anthropic()
EXTRACT_SCHEMA = {
"type": "object",
"required": ["entities", "sentiment", "summary"],
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "type", "confidence"],
"properties": {
"name": {"type": "string"},
"type": {"enum": ["person", "org", "location", "product"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
}
}
},
"sentiment": {"enum": ["positive", "negative", "neutral", "mixed"]},
"summary": {"type": "string", "maxLength": 200}
}
}
def extract_entities(text: str) -> dict:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system="""You are an entity extraction API.
Always respond with valid JSON matching the provided schema exactly.
Do not include any text outside the JSON object.""",
messages=[
{
"role": "user",
"content": f"""Extract entities from this text.
Return JSON matching this schema: {json.dumps(EXTRACT_SCHEMA)}
Text: {text}"""
}
]
)
# Parse and validate
result = json.loads(response.content[0].text)
# Add schema validation here (jsonschema, pydantic, etc.)
return resultUse Pydantic for validation
Wrap your JSON parsing in a Pydantic model. If the LLM returns a field with the wrong type, Pydantic will catch it before it causes a downstream error. This makes debugging much easier than tracing a TypeError through your pipeline.
Few-Shot Prompting That Scales
The naive approach to few-shot prompting puts all examples in the system prompt. This works until you have 20 examples and you're paying for them on every call.
A better pattern: store examples in a database, retrieve the most relevant ones for each request using embeddings:
import numpy as np
from typing import Optional
class DynamicFewShotPrompter:
def __init__(self, examples: list[dict], embedding_fn):
self.examples = examples
self.embedding_fn = embedding_fn
# Pre-compute embeddings for all examples
self.example_embeddings = np.array([
embedding_fn(ex["input"])
for ex in examples
])
def get_relevant_examples(self, query: str, k: int = 3) -> list[dict]:
query_embedding = self.embedding_fn(query)
# Cosine similarity
similarities = np.dot(self.example_embeddings, query_embedding) / (
np.linalg.norm(self.example_embeddings, axis=1) *
np.linalg.norm(query_embedding)
)
top_k_indices = np.argsort(similarities)[-k:][::-1]
return [self.examples[i] for i in top_k_indices]
def build_prompt(self, query: str) -> str:
examples = self.get_relevant_examples(query)
examples_text = "\n\n".join([
f"Input: {ex['input']}\nOutput: {ex['output']}"
for ex in examples
])
return f"Examples:\n{examples_text}\n\nNow process:\n{query}"This keeps your prompt size constant regardless of how many total examples you have.
Prompt Chaining for Complex Tasks
Don't try to do everything in one prompt. Break complex tasks into a chain:
def analyze_support_ticket(ticket: str) -> dict:
# Step 1: Classify intent (cheap, fast model)
intent = classify_intent(ticket) # Uses claude-haiku-4-5
if intent["category"] == "billing":
# Step 2: Extract billing details
billing_info = extract_billing_info(ticket)
# Step 3: Generate response using billing context
response = generate_billing_response(ticket, billing_info)
elif intent["category"] == "technical":
# Different chain for technical issues
diagnosis = diagnose_technical_issue(ticket)
response = generate_technical_response(ticket, diagnosis)
else:
# Generic response for unknown intent
response = generate_generic_response(ticket)
return {
"intent": intent,
"response": response,
"escalate": intent["confidence"] < 0.7
}Use cheaper/faster models for classification and routing, reserve expensive models for generation.
Cost Control in Production
LLM costs grow faster than you expect. Three levers:
1. Cache identical requests. Use Redis with a TTL. The cache key should be a hash of your prompt + model parameters:
import hashlib
import json
import redis
cache = redis.Redis()
def cached_llm_call(prompt: str, model: str = "claude-opus-4-6", ttl: int = 3600) -> str:
cache_key = hashlib.sha256(
json.dumps({"prompt": prompt, "model": model}).encode()
).hexdigest()
cached = cache.get(cache_key)
if cached:
return cached.decode()
response = make_llm_call(prompt, model)
cache.setex(cache_key, ttl, response)
return responseFor document summarization, this cache hit rate can be 40–60%.
2. Route by complexity. Use a cheap model to assess if the request needs an expensive model:
def route_request(query: str) -> str:
"""Returns model name based on complexity assessment."""
# Simple heuristics first
if len(query) < 100 and "?" not in query:
return "claude-haiku-4-5-20251001"
complexity = assess_complexity(query) # Fast call to Haiku
if complexity["score"] > 0.8:
return "claude-opus-4-6"
elif complexity["score"] > 0.4:
return "claude-sonnet-4-6"
return "claude-haiku-4-5-20251001"3. Set hard token limits. Always set max_tokens to the minimum you need for the task. Unbounded responses waste money and slow your pipeline.
Evaluation: Know When Your Prompts Break
This is the part everyone skips. Your prompt will break in production, and you won't know unless you have an eval framework.
import json
from pathlib import Path
from dataclasses import dataclass
@dataclass
class EvalResult:
passed: bool
score: float
details: str
class PromptEvaluator:
def __init__(self, eval_dataset_path: str):
self.cases = json.loads(Path(eval_dataset_path).read_text())
def run(self, prompt_fn) -> dict:
results = []
for case in self.cases:
output = prompt_fn(case["input"])
score = self.score(output, case["expected_output"], case["checks"])
results.append(score)
return {
"pass_rate": sum(r.passed for r in results) / len(results),
"avg_score": sum(r.score for r in results) / len(results),
"failures": [r for r in results if not r.passed],
}
def score(self, output: dict, expected: dict, checks: list[str]) -> EvalResult:
# Structural checks
for check in checks:
if check == "has_summary" and not output.get("summary"):
return EvalResult(False, 0.0, "Missing summary field")
if check == "valid_sentiment" and output.get("sentiment") not in ["positive", "negative", "neutral", "mixed"]:
return EvalResult(False, 0.0, "Invalid sentiment value")
# Semantic similarity check using embedding distance
score = self.semantic_similarity(output.get("summary", ""), expected.get("summary", ""))
return EvalResult(score > 0.8, score, f"Semantic score: {score:.2f}")Run this eval in CI. If pass rate drops below a threshold, fail the build before the broken prompt reaches production.
Production Checklist
Before shipping an LLM feature:
- All LLM calls return structured output validated with Pydantic
- Retry logic with exponential backoff for rate limits
- Token usage logged to your observability stack
- Cache layer in front of expensive calls
- Eval dataset with 50+ cases covering edge cases
- Hard
max_tokenslimits on every call - Fallback behaviour when LLM call fails
- Cost alerting when daily spend exceeds threshold
The gap between a demo and a production LLM feature is mostly this checklist.
Written by
Chetan Yamger
Cloud Engineer · AI Automation Architect · Blogger
Cloud Engineer and AI Automation Architect with deep expertise in Azure, Intune, PowerShell, and AI-driven workflows. I use ChatGPT, Gemini, and prompt engineering to build intelligent automation that improves productivity and decision-making in real IT environments.
Stay in the loop.
New articles, straight to you.
Deep-dive technical articles on Intune, PowerShell, and AI — no noise, no spam.
Discussion
Share your thoughts — your email stays private
Leave a comment
