In order to improve our services we changed the way the Kubernetes nodepools are structured. Previously there was a default nodepool that had a mix of both Kubernetes add-ons and application deployments. This made things more complex than it needed to be. Therefore we created a dedicated system
nodepool where all add-ons are scheduled on. During this change we also took a closer look at the requested resources for all add-ons and made adjustments where needed. For most of our customer environment we’ve been able to reduce the cluster size with at least 1 equivalent node. A handful are break-even for now, but we have further optimizations planned as follow-ups.
System nodepool
Now all system add-ons are running in a dedicated nodepool.
Benefits of this are:
- clear and transparant base cost of the platform
- better utilisation of EC2 instances
- easier capacity planning for system and application workloads as the requirements for each might be different
- easier for Skyscrapers to roll out maintenance updates without affecting application workloads
- increased workload isolation
Evaluation of add-on resources and scaling options
Historically we increased resources on some components because it needed more memory. When adding new add-ons to our reference solution sometimes we defaulted to the upstream recommendations in order to guarantee stability. When we updated add-ons we don’t always re-evaluate whether the resource usage dropped or not.
In combination with the rollout of the system nodepool we are also rolling out the revisited resource requests. This has a big impact on the overall resource reservation of the cluster.
An example:
The overall CPU reservation dropped from 55% to 32% and the Memory from 51% to 39%. This allowed us to go from a 3xm5.xlarge cluster to a 3xm5.large cluster and therefore halving our operational cost.
Cluster usage before optimisations:
ip-10-12-166-51.eu-west-1.compute.internal cpu █████████████████████████░░░░░░░░░░ 72% (38 pods) m5.xlarge - - Ready
memory █████████████████████████░░░░░░░░░░ 73%
ip-10-12-147-220.eu-west-1.compute.internal cpu ████████████████████░░░░░░░░░░░░░░░ 56% (30 pods) m5.xlarge - - Ready
memory ███████████░░░░░░░░░░░░░░░░░░░░░░░░ 32%
ip-10-12-135-162.eu-west-1.compute.internal cpu █████████████░░░░░░░░░░░░░░░░░░░░░░ 36% (16 pods) m5.xlarge - - Ready
memory █████████████████░░░░░░░░░░░░░░░░░░ 48%
Cluster usage after optimisations:
ip-10-12-147-220.eu-west-1.compute.internal cpu ██████████████░░░░░░░░░░░░░░░░░░░░░ 41% (43 pods) m5.xlarge - - Ready
memory █████████████░░░░░░░░░░░░░░░░░░░░░░ 36%
ip-10-12-135-162.eu-west-1.compute.internal cpu █████████████░░░░░░░░░░░░░░░░░░░░░░ 36% (28 pods) m5.xlarge - - Ready
memory █████████████████░░░░░░░░░░░░░░░░░░ 47%
ip-10-12-168-169.eu-west-1.compute.internal cpu ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 17% (14 pods) m5.xlarge - - Ready
memory ████████████░░░░░░░░░░░░░░░░░░░░░░░ 35%