Running Kubernetes in production is vastly different from following a tutorial. When you manage 50 or more microservices across multiple clusters, every architectural decision compounds. At Nexis Limited, our platform engineering teams have accumulated hard-won experience operating Kubernetes at scale for enterprise clients in Bangladesh and internationally. Here are the lessons that matter most.

Namespace Strategy and Multi-Tenancy

Organizing workloads into namespaces is one of the first decisions you make, and getting it wrong creates long-term pain. We recommend namespace-per-team or namespace-per-environment rather than namespace-per-service. With 50+ microservices, a namespace-per-service approach becomes unmanageable. Apply ResourceQuotas and LimitRanges at the namespace level to prevent any single team from monopolizing cluster resources. Network policies should default to deny-all, with explicit allowlists for inter-service communication.

Resource Requests and Limits: The Silent Killer

Incorrect resource requests and limits are the number one cause of production incidents in Kubernetes. Setting requests too low leads to aggressive scheduling and node pressure. Setting limits too high wastes resources and inflates cloud bills. We use the Vertical Pod Autoscaler in recommendation mode to continuously analyze actual resource usage and adjust requests accordingly. For CPU, we typically set requests at the P95 usage level and limits at 2-3x the request. For memory, requests should match actual working set size, and limits should be set slightly above peak observed usage.

Horizontal Pod Autoscaling Done Right

HPA should scale on application-specific metrics, not just CPU utilization. Queue depth, request latency, and concurrent connections are far better scaling signals for most microservices. Use KEDA (Kubernetes Event-Driven Autoscaling) for event-driven workloads that need to scale to zero. Always set PodDisruptionBudgets to ensure minimum availability during scaling events and node maintenance.

Service Mesh Considerations

A service mesh like Istio or Linkerd adds enormous value at scale but introduces operational complexity. We recommend adopting a service mesh only when you genuinely need mutual TLS between services, advanced traffic management, or fine-grained observability. For smaller deployments, Kubernetes native NetworkPolicies and ingress controllers handle most requirements without the overhead. When you do adopt a mesh, start with sidecar injection in a single namespace and expand gradually.

Observability at Scale

At 50+ microservices, you cannot debug production issues by reading individual pod logs. Invest in three pillars of observability: metrics with Prometheus and Grafana, logs with Loki or Elasticsearch, and distributed tracing with Jaeger or Tempo. Standardize on OpenTelemetry for instrumentation across all services. Custom dashboards per team showing golden signals (latency, traffic, errors, saturation) are essential. Alert on symptoms rather than causes: users care about latency and errors, not individual pod CPU usage.

Deployment Strategy and Rollbacks

We enforce rolling deployments with readiness probes and a maxUnavailable of zero for critical services. Every deployment must pass automated smoke tests before the rollout progresses. Kubernetes rollback capabilities are powerful but require that you maintain deployment history. Set revisionHistoryLimit to a reasonable number and practice rollback procedures regularly. Canary deployments using Flagger or Argo Rollouts add another safety layer by gradually shifting traffic to new versions.

Cluster Upgrades and Maintenance

Kubernetes releases a new minor version every four months, and staying current is non-negotiable for security patches and feature access. We maintain staging clusters that mirror production and validate upgrades there before touching production. Node pool upgrades should use surge upgrades to maintain capacity. Test all admission webhooks, custom controllers, and CRDs against the new version in staging first.

Managing Kubernetes at scale demands continuous learning and operational discipline. Our teams at Nexis Limited bring this expertise to every engagement, from initial architecture design to ongoing operations. Explore our services for Kubernetes consulting, or contact us to discuss your container orchestration strategy.