AKS Production Checklist: What Microsoft Doesn't Tell You

The Azure documentation will get you a running AKS cluster in 10 minutes. It will not get you a production-ready AKS cluster. This post is the checklist I run through on every AKS deployment.

The Minimal Viable Production Cluster

A production AKS cluster needs at minimum:

Managed identity (not service principal) for node authentication
Private API server or at least authorized IP ranges
Multiple node pools — system and user separated
Cluster autoscaler configured correctly
Azure CNI (not kubenet) for enterprise networking
Azure Policy add-on for governance
Container Insights for observability

Terraform Configuration

hcl

resource "azurerm_kubernetes_cluster" "main" {
  name                = "aks-${var.environment}-${var.location_short}"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  dns_prefix          = "aks-${var.environment}"
  kubernetes_version  = var.kubernetes_version
 
  # Use managed identity, not service principal
  identity {
    type = "SystemAssigned"
  }
 
  # Private cluster — no public API server
  private_cluster_enabled             = true
  private_dns_zone_id                 = azurerm_private_dns_zone.aks.id
 
  # System node pool — only system workloads
  default_node_pool {
    name                = "system"
    vm_size             = "Standard_D4s_v5"
    node_count          = 3
    min_count           = 3
    max_count           = 5
    enable_auto_scaling = true
    os_disk_type        = "Ephemeral"
    vnet_subnet_id      = azurerm_subnet.aks_nodes.id
 
    node_labels = {
      "kubernetes.azure.com/mode" = "system"
    }
 
    node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
  }
 
  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
    outbound_type     = "userDefinedRouting"  # Route through firewall
  }
 
  azure_active_directory_role_based_access_control {
    managed                = true
    admin_group_object_ids = [var.aks_admin_group_id]
    azure_rbac_enabled     = true
  }
 
  oms_agent {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
  }
 
  microsoft_defender {
    log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id
  }
 
  workload_identity_enabled = true
  oidc_issuer_enabled       = true
 
  lifecycle {
    ignore_changes = [kubernetes_version]
  }
}
 
# Separate user node pool for application workloads
resource "azurerm_kubernetes_cluster_node_pool" "user" {
  name                  = "user"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D8s_v5"
  min_count             = 2
  max_count             = 20
  enable_auto_scaling   = true
  os_disk_type          = "Ephemeral"
  vnet_subnet_id        = azurerm_subnet.aks_nodes.id
  mode                  = "User"
 
  node_labels = {
    "workload-type" = "application"
  }
}

Never use kubenet in enterprise

Azure CNI assigns real VNet IP addresses to pods, which enables network policies, private endpoints to pods, and proper integration with Azure Firewall. Kubenet breaks all of this. The trade-off (smaller addressable pod range) is worth it.

Workload Identity Setup

Service principals are a security liability. Use Workload Identity instead — it binds a Kubernetes service account to an Azure managed identity with no long-lived credentials:

bash

# Create user-assigned managed identity for your app
az identity create \
  --name "id-myapp-${ENVIRONMENT}" \
  --resource-group "${RG_NAME}"
 
# Get the client ID and object ID
CLIENT_ID=$(az identity show --name "id-myapp-${ENVIRONMENT}" \
  --resource-group "${RG_NAME}" \
  --query clientId -o tsv)
 
OBJECT_ID=$(az identity show --name "id-myapp-${ENVIRONMENT}" \
  --resource-group "${RG_NAME}" \
  --query principalId -o tsv)
 
# Federate with Kubernetes service account
OIDC_ISSUER=$(az aks show \
  --name "${AKS_NAME}" \
  --resource-group "${RG_NAME}" \
  --query oidcIssuerProfile.issuerUrl -o tsv)
 
az identity federated-credential create \
  --name "fc-myapp-${ENVIRONMENT}" \
  --identity-name "id-myapp-${ENVIRONMENT}" \
  --resource-group "${RG_NAME}" \
  --issuer "${OIDC_ISSUER}" \
  --subject "system:serviceaccount:myapp:myapp-sa" \
  --audience "api://AzureADTokenExchange"

Then in your Kubernetes manifests:

yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: myapp-sa
  namespace: myapp
  annotations:
    azure.workload.identity/client-id: "${CLIENT_ID}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    metadata:
      labels:
        azure.workload.identity/use: "true"
    spec:
      serviceAccountName: myapp-sa

No secrets. No credential rotation. No service principal expiry surprises at 2 AM.

Autoscaler Gotchas

The cluster autoscaler has three common failure modes:

1. Node pool min/max bounds too tight. If your min is 2 and max is 5, and you get a traffic spike needing 8 nodes, the autoscaler will max out at 5 and your pods will sit pending. Set max generously — you're not charged for nodes that don't exist.

2. Pod disruption budgets blocking scale-down. A PDB of minAvailable: 100% prevents any pod from being evicted, which prevents node scale-down. Set minAvailable: 1 or maxUnavailable: 1 for most stateless workloads.

3. Priority classes causing unexpected eviction. System pods use system-cluster-critical and system-node-critical priority classes. If your application pods don't have a priority class, they can be evicted to make room for system pods during node pressure — with no warning.

yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-app
value: 1000
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      priorityClassName: high-priority-app

Observability Stack

Container Insights + Azure Monitor is the managed path, but for teams that want Prometheus/Grafana:

bash

# Install kube-prometheus-stack via Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
 
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword="${GRAFANA_PASSWORD}" \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Pair this with Azure Monitor for infrastructure metrics and you get full-stack visibility.

Upgrade Strategy

Never upgrade the control plane and node pools at the same time. The safe sequence:

Upgrade control plane to N+1
Verify control plane health (check API server logs in Container Insights)
Upgrade system node pool (cordon, drain, upgrade in batches)
Upgrade user node pools

Use maintenance windows to control when AKS applies security patches:

bash

az aks maintenanceconfiguration add \
  --name "default" \
  --resource-group "${RG_NAME}" \
  --cluster-name "${AKS_NAME}" \
  --weekday Saturday \
  --start-hour 2

This limits surprise upgrades to Saturday 2–3 AM — still not ideal, but better than Tuesday afternoon.

AKS Production Checklist: What Microsoft Doesn't Tell You

The Minimal Viable Production Cluster

Terraform Configuration

Workload Identity Setup

Autoscaler Gotchas

Observability Stack

Upgrade Strategy

Chetan Yamger

Stay in the loop.
New articles, straight to you.

Discussion

The Minimal Viable Production Cluster

Terraform Configuration

Workload Identity Setup

Autoscaler Gotchas

Observability Stack

Upgrade Strategy

Chetan Yamger

Stay in the loop.New articles, straight to you.

Discussion

Stay in the loop.
New articles, straight to you.