Migrating from OpenTofu to Flux: A leap forward in our Kubernetes management

In this post, we’re excited to announce our plans to shift the management of Kubernetes (K8s) add-ons from our existing OpenTofu-based approach (using Terragrunt and Concourse CI) to Flux. This change is part of our broader effort to simplify our platform managemt, improve reliability, and enhance visibility for both our internal teams and our customers. Below, we’ll cover what’s happening, why we’re making these changes, how you’ll benefit, and what to expect during and after the migration.

Some history

When we started, we managed our Kubernetes deployments via Terraform with manual applies. Over time, we introduced Concourse CI to automate these manual processes. But our current automation eventually became a bottleneck:

  • Rollouts are becoming increasingly time consuming and migrations increasingly painful
  • As we keep growing, managing more and more environments, scaling our CI/CD system has become increasingly hard
  • We are fighting our own automation to do something that seemed straightforward (like upgrading Karpenter from v0.x to v1.x)
  • “State” managed in several systems: adds extra complexity and problems during migrations
  • Drift between what was on-cluster and what was in Terraform state also became an issue whenever we did rollouts

Back in 2021, we already identified some of these struggles and explored possible solutions, but lacked the time or plan to address them fully. In 2022-2023 we started on an ambitious plan to migrate our cluster lifecycle to Crossplane (and Flux), however due to internal changes progress of rebuilding our automation from scratch has been very slow.

Fast-forward to 2024, and we’ve spent a huge amount of time on a relatively “simple” migration (Karpenter v0.x to v1.x) precisely because our automation was so fragile. This migration made us realise that we needed a more robust and native approach to managing Kubernetes resources and we need it fast. One that would simplify our workflows, reduce human error, and give us a healthier path toward future expansions.

Why Flux

We needed a system which we could start implementing as soon as possible, that addresses the pain points above and could evolve next to our original plans we still have for future platform iterations (Crossplane). Therefore we landed on Flux:

  • Flux is natively integrated in the K8s ecosystem.
  • Changes will be “pulled” and rolled out by each cluster individually, instead of increasingly heavy “push” operations by our CICD system. Flux is also a continuous auto-reconciliation system, making sure that synchronize actual state with intended state.
  • We can re-use a lot what we already have, initially switch OpenTofu from a deploy mechanism to a templating engine for the Flux manifests. There’s no need to re-write all the logic we have in place. We also still use OpenTofu for managing (AWS) cloud resources, like VPC, EKS cluster, IAM policies, bootstrapping Flux, etc.
  • It can live next to the Crossplane plans we still have, even simplifying the migration considerably. Next (and due) to that, we already have quite some experience using Flux for internal purposes and recently also offering customer guidance using GitOps and Flux for application workload CD.

However, considering Flux needs to bootstrap and manage the whole cluster platform, including our autoscaler Karpenter, we’d run into a chicken-and-egg situation: Flux can’t run before it has deployed Karpenter to provision worker nodes. Therefore we’ve decided to run Flux on AWS Fargate, similar how we run Karpenter on Fargate today.

Another hurdle was dealing with secrets, sensitive data, considering all Flux config would be committed to a (private) git repository. For encrypting our OpenTofu “secrets” we’ve been relying on KMS data sources within the code. Increasingly we’ve also been leveraging SOPS (with the same KMS keys) more and more as it integrates nicely with our Terragrunt workflows and luckily SOPS also integrates nicely with Flux. The choice of using SOPS for secret encryption, in favor of eg. an external secret system, was easy due to it’s simplicity, integration at no extra cost.

What Are the Benefits?

  1. Continuous Auto-Reconciliation

    Instead of pushing changes from a central CI system, each cluster runs its own Flux controllers. They “pull” configuration changes from the Git repository, significantly reducing overhead on our central systems.

    If anyone (us or you) changes something manually, Flux will notice the drift and reset the configuration unless reconciliation is intentionally paused for debugging or migration.

  2. Git as source of truth

    All Flux configurations live in your Git repository. This increases the transparency on what is deployed on your environment(s). Your Git repository is your source of truth.

  3. Easier Off-Boarding

    If, for any reason, you choose to stop our cooperation, you can continue using the same Flux manifests to run and maintain your cluster with minimal work.

  4. Increased Visibility

    With Git as the single source of truth, there’s a clear record of who changed what and when. This level of clarity makes auditing, debugging, and compliance checks far more straightforward.

What will you notice?

  1. Repository Structure Changes

    Our system will start creating or updating files in your Skyscrapers managed git repo. You’ll notice a new folder hierarchy for Flux such as:

    flux/
    └── system/
        └── <cluster_name>/
            ├── <namespace>/
            |   └── <addon_name>/...
            ├── DO_NOT_EDIT
            ├── flux-apps-alerts.yaml
            ├── flux-apps-ns.yaml
            └── flux-system-alerts.yaml
    └── clusters/
        └── <cluster_name>/
            ├── flux-system/...
            ├── DO_NOT_EDIT
            └── system.yaml
    

    Note: These files are auto-generated, and direct manual changes to them won’t stick as the files are automatically regenerated when our CICD pipelines trigger.

  2. Notifications & Slack Alerts

    During deployment, Flux can notify us (and optionally you) via Slack or other channels about the status of each reconciliation run. You’ll be more informed about what’s happening on your clusters, and you’ll see fewer unpredictable manual changes.

  3. Expected Cost Change

    Running Flux on Fargate does incur a monthly cost, roughly $36/month per cluster (region-dependent). However we believe the operational benefits we gain in terms of reliability, simplicity, and minimized downtime far outweigh this expense.

  4. No Manual Adjustments in flux/system and flux/clusters folder

    As mentioned above, the Flux manifests are automatically (re)generated. While you can still manage your own workloads, the cluster infrastructure components managed by us will be declared in the auto-generated Flux files. Trying to modify them manually will result in our automation overwriting your changes again once we push an update to your Git repository. If you need a tweak, just let us know. Since making unaudited changes here can have disastrous results, we will also introduce main branch protections.

High-Level Roadmap

Our migration will follow a phased approach.

  • In the first phase (we are here now) we focus on pulling all Kubernetes resources out of OpenTofu and migrating them to Flux files and consolidate what’s left to manage in OpenTofu (cleanup).
  • In a 2nd phase we evaluate whether to migrate away from OpenTofu to something more fitting as templating engine for rendering the Kubernetes manifests.
  • In the 3rd phase we continue further iterations and our Crossplane roadmap.

Where are we today?

What we’ve done so far is making all preparations ready to start this transition. We have configured Flux controllers to run on AWS Fargate with appropriate label selectors and resource sizing for stable reconciliations at minimal cost. We have updated our CICD pipelines to generate and commit manifests via pull requests to the repo of the cluster that we then manually verify (for now) and merge. Flux will take over the actual rollout. We have tested this approach already on clusters who had Flux enabled for managing application-level workloads.

Now that we have everything ready we can formally announce our plans (hence this post) and start rolling out to our complete customer base. We aim to have a fully operational Flux-based system (Phase 1 completed) by May 2025 for all customers with a deadline before the summer vacation period to ensure everything is stable and working.

During this period, we’ll keep you informed via our changelog posts on our progress.

Final Thoughts

Our goal is to give you faster, more reliable, and more transparent operations on the Kubrnetes platforms we manage. By switching to Flux, we’re eliminating much of the complexity from our old OpenTofu + Concourse approach for managing Kubernetes add-ons. We firmly believe that GitOps, and Flux, in particular, is a major leap forward for both us and our customers.

Although most of this happens behind the curtain without direct impact on our customers, we appreciate your support and patience as we make this transition. If you have any questions or concerns, please reach out to us. We’re excited to streamline our processes and continue delivering reliable platforms for your applications.

Stay tuned for more updates as we progress this roadmap!