Post Mortem - Loki log loss

After deploying our Grafana Loki refactor, several issues started popping up, cascading to a loss of logs for a maximum of 12 hours on 22/02 or 23/02. All environments using Loki as main logging provider were affected. Environments logging to other systems like CloudWatch Logs and ElasticSearch were not affected.

Below you will find more details on the events leading up to this problem and actions taken.

Timeline

  • 20/02: We discovered that the loki-write pods were having errors storing chunk data into its indices on one of our clusters. Loki’s table-manager, responsible for rotating the DynamoDB index tables, was no longer creating new tables.
  • 21/02: We decided to move our timeline for switching from DynamoDB to BoltDB-shipper for storing Loki indices. We deployed a hotfix to all clusters to move the index cutover to 22/02 instead of 1/03.
  • 22/02: We got alerts on some clusters that the loki-write persistent volume was filling up. Upon inspection we noticed that there was data that was failing to write its chunk data into its indices. We had to delete the current WAL (Write Ahead Log) with failed uncommitted transactions.
  • 23/02: We noticed that all clusters were affected and we proceeded to remove the current WAL from all clusters. This caused a loss of logs for a (maximum) period of 12 hours.
  • 23/02: We confirmed that all clusters are processing logs correctly now and there are no more problems processing the WAL.

Context

What went wrong

  • Loki’s DynamoDB-shipper uses a component called Table manager to manage the DynamoDB tables used for the log indices. It uses a rolling window mechanism, so new tables get created periodically to handle indices for newer log entries. Due to an unknown reason, table manager did not create the latest DynamoDB table, so Loki was not able to process new logs. We detected this problem in just a few clusters initially, and instead of fixing it, we decided to move up our timeline to switch to BoltDB-shipper sooner (22/02 instead of 1/03) on all our clusters, before any others could reach the same situation with table manager.
  • Loki did not handle the the switch of the DynamoDB to BoltDB-shipper index as we expected due to a bug. Therefore, from the time the DynamoDB-shipper stopped working until the BoltDB-shipper started taking over, logs were not processed and were lost (max 12 hours).
  • Loki ingesters WAL and storage were not sized correctly, so the errors caused by the above problem cascaded into other problems. Since loki was not able to write log entries due to the index management problems, the WAL in the loki-write Pods kept growing until it filled up the persistent volume. Also, every time the Pods restarted, they tried replaying the WAL (unsuccessfully). This replay operation uses considerable memory resources, proportional to the WAL size. Due to this, some loki-write Pods were getting OOMKilled periodically while replaying the WAL, adding yet more stress to the system.
  • Although we have several alerts in place for monitoring Loki, we didn’t initially catch the errored transactions (cause). It’s only after several effects started popping up that our monitoring alerted us: dropped logs, volume full, OOMKills, …

Follow-up actions

  • We are revising the WAL configuration of Loki
  • We are revising our Loki monitoring and fix our alert on write errors
  • We are working to improve our component and platform roll-out processes, as part of our short-term strategy, to be able to catch problems like this sooner and reduce the impact on customers