We use Velero as our solution to backup complete K8s cluster workloads (both K8s resources and Persistent Volumes).
However, we discovered and resolved 2 bugs in our implementation which could lead to failed K8s resource backups:
- An error in our policies to enforce encryption of objects stored in S3, resulting in failed uploads
- Too strict resource limits, causing the Velero container to get OOMKilled during backup and/or cleanup
In both cases, backups of Persistent Volumes always completed succesfully (via EBS snapshots). Only K8s resources like Deployments, ConfigMaps, etc. failed.
Check out our documentation on how you can interact with Velero yourself to create and/or restore backups.