Adjusting the NodeFilesystemSpaceFillingUp Prometheus alert

You might have noticed the NodeFilesystemSpaceFillingUp alert passing by on some occasions. That alert triggers when Prometheus predicts that a node’s disk will run out of space, based on the trend of the last few hours.

Usually, that alert triggers when there’re a bunch of new deployments at the same time, or when a new node is launched, because that’s when nodes pull all the necessary docker images in a short period of time, filling up a considerable amount of disk space.

This circumstance, however, is not always problematic, as the kubelet will run a garbage collector when the disk space usage goes over 85%, and will remove old docker images that are no longer in use, consecuently freeing up some disk space.

Up until now, the NodeFilesystemSpaceFillingUp (critical) alert would fire before the kubelet GC would have the chance to run, without there being an actual problem. We’ve adjusted the threshold of the alert so it triggers after the kubelet GC runs, so if it now triggers, it probably signals a real problem with the node’s disk space.