AKS Production Checklist: What Microsoft Doesn't Tell You
Deploy a production-ready Azure Kubernetes Service cluster with autoscaling, managed identity, private endpoints, and proper observability — lessons from running AKS at scale.
The Azure documentation will get you a running AKS cluster in 10 minutes. It will not get you a production-ready AKS cluster. This post is the checklist I run through on every AKS deployment.
The Minimal Viable Production Cluster
A production AKS cluster needs at minimum:
- Managed identity (not service principal) for node authentication
- Private API server or at least authorized IP ranges
- Multiple node pools — system and user separated
- Cluster autoscaler configured correctly
- Azure CNI (not kubenet) for enterprise networking
- Azure Policy add-on for governance
- Container Insights for observability
Terraform Configuration
resource "azurerm_kubernetes_cluster" "main" {
name = "aks-${var.environment}-${var.location_short}"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
dns_prefix = "aks-${var.environment}"
kubernetes_version = var.kubernetes_version
# Use managed identity, not service principal
identity {
type = "SystemAssigned"
}
# Private cluster — no public API server
private_cluster_enabled = true
private_dns_zone_id = azurerm_private_dns_zone.aks.id
# System node pool — only system workloads
default_node_pool {
name = "system"
vm_size = "Standard_D4s_v5"
node_count = 3
min_count = 3
max_count = 5
enable_auto_scaling = true
os_disk_type = "Ephemeral"
vnet_subnet_id = azurerm_subnet.aks_nodes.id
node_labels = {
"kubernetes.azure.com/mode" = "system"
}
node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
outbound_type = "userDefinedRouting" # Route through firewall
}
azure_active_directory_role_based_access_control {
managed = true
admin_group_object_ids = [var.aks_admin_group_id]
azure_rbac_enabled = true
}
oms_agent {
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
microsoft_defender {
log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
}
workload_identity_enabled = true
oidc_issuer_enabled = true
lifecycle {
ignore_changes = [kubernetes_version]
}
}
# Separate user node pool for application workloads
resource "azurerm_kubernetes_cluster_node_pool" "user" {
name = "user"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D8s_v5"
min_count = 2
max_count = 20
enable_auto_scaling = true
os_disk_type = "Ephemeral"
vnet_subnet_id = azurerm_subnet.aks_nodes.id
mode = "User"
node_labels = {
"workload-type" = "application"
}
}Never use kubenet in enterprise
Azure CNI assigns real VNet IP addresses to pods, which enables network policies, private endpoints to pods, and proper integration with Azure Firewall. Kubenet breaks all of this. The trade-off (smaller addressable pod range) is worth it.
Workload Identity Setup
Service principals are a security liability. Use Workload Identity instead — it binds a Kubernetes service account to an Azure managed identity with no long-lived credentials:
# Create user-assigned managed identity for your app
az identity create \
--name "id-myapp-${ENVIRONMENT}" \
--resource-group "${RG_NAME}"
# Get the client ID and object ID
CLIENT_ID=$(az identity show --name "id-myapp-${ENVIRONMENT}" \
--resource-group "${RG_NAME}" \
--query clientId -o tsv)
OBJECT_ID=$(az identity show --name "id-myapp-${ENVIRONMENT}" \
--resource-group "${RG_NAME}" \
--query principalId -o tsv)
# Federate with Kubernetes service account
OIDC_ISSUER=$(az aks show \
--name "${AKS_NAME}" \
--resource-group "${RG_NAME}" \
--query oidcIssuerProfile.issuerUrl -o tsv)
az identity federated-credential create \
--name "fc-myapp-${ENVIRONMENT}" \
--identity-name "id-myapp-${ENVIRONMENT}" \
--resource-group "${RG_NAME}" \
--issuer "${OIDC_ISSUER}" \
--subject "system:serviceaccount:myapp:myapp-sa" \
--audience "api://AzureADTokenExchange"Then in your Kubernetes manifests:
apiVersion: v1
kind: ServiceAccount
metadata:
name: myapp-sa
namespace: myapp
annotations:
azure.workload.identity/client-id: "${CLIENT_ID}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
template:
metadata:
labels:
azure.workload.identity/use: "true"
spec:
serviceAccountName: myapp-saNo secrets. No credential rotation. No service principal expiry surprises at 2 AM.
Autoscaler Gotchas
The cluster autoscaler has three common failure modes:
1. Node pool min/max bounds too tight. If your min is 2 and max is 5, and you get a traffic spike needing 8 nodes, the autoscaler will max out at 5 and your pods will sit pending. Set max generously — you're not charged for nodes that don't exist.
2. Pod disruption budgets blocking scale-down. A PDB of minAvailable: 100% prevents any pod from being evicted, which prevents node scale-down. Set minAvailable: 1 or maxUnavailable: 1 for most stateless workloads.
3. Priority classes causing unexpected eviction. System pods use system-cluster-critical and system-node-critical priority classes. If your application pods don't have a priority class, they can be evicted to make room for system pods during node pressure — with no warning.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-app
value: 1000
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
priorityClassName: high-priority-appObservability Stack
Container Insights + Azure Monitor is the managed path, but for teams that want Prometheus/Grafana:
# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="${GRAFANA_PASSWORD}" \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50GiPair this with Azure Monitor for infrastructure metrics and you get full-stack visibility.
Upgrade Strategy
Never upgrade the control plane and node pools at the same time. The safe sequence:
- Upgrade control plane to N+1
- Verify control plane health (check API server logs in Container Insights)
- Upgrade system node pool (cordon, drain, upgrade in batches)
- Upgrade user node pools
Use maintenance windows to control when AKS applies security patches:
az aks maintenanceconfiguration add \
--name "default" \
--resource-group "${RG_NAME}" \
--cluster-name "${AKS_NAME}" \
--weekday Saturday \
--start-hour 2This limits surprise upgrades to Saturday 2–3 AM — still not ideal, but better than Tuesday afternoon.
Written by
Chetan Yamger
Cloud Engineer · AI Automation Architect · Blogger
Cloud Engineer and AI Automation Architect with deep expertise in Azure, Intune, PowerShell, and AI-driven workflows. I use ChatGPT, Gemini, and prompt engineering to build intelligent automation that improves productivity and decision-making in real IT environments.
Stay in the loop.
New articles, straight to you.
Deep-dive technical articles on Intune, PowerShell, and AI — no noise, no spam.
Discussion
Share your thoughts — your email stays private
Leave a comment
