After deploying our Grafana Loki refactor, several issues started popping up, cascading to a loss of logs for a maximum of 12 hours on 22/02 or 23/02. All environments using Loki as main logging provider were affected. Environments logging to other systems like CloudWatch Logs and ElasticSearch were not affected.
Below you will find more details on the events leading up to this problem and actions taken.
Timeline
- 20/02: We discovered that the
loki-write
pods were having errors storing chunk data into its indices on one of our clusters. Loki’stable-manager
, responsible for rotating the DynamoDB index tables, was no longer creating new tables. - 21/02: We decided to move our timeline for switching from DynamoDB to BoltDB-shipper for storing Loki indices. We deployed a hotfix to all clusters to move the index cutover to 22/02 instead of 1/03.
- 22/02: We got alerts on some clusters that the
loki-write
persistent volume was filling up. Upon inspection we noticed that there was data that was failing to write its chunk data into its indices. We had to delete the current WAL (Write Ahead Log) with failed uncommitted transactions. - 23/02: We noticed that all clusters were affected and we proceeded to remove the current WAL from all clusters. This caused a loss of logs for a (maximum) period of 12 hours.
- 23/02: We confirmed that all clusters are processing logs correctly now and there are no more problems processing the WAL.
Context
- Why we switched from DynamoDB to BoltDB-shipper. Table manager (for the DynamoDB shipper) is being phased out by Loki. The BoltDB index management is a more robust and battle-tested system, so we expect problems like this are less likely to happen in the future.
- Loki log ingesters use a Write Ahead Log (WAL)
What went wrong
- Loki’s DynamoDB-shipper uses a component called Table manager to manage the DynamoDB tables used for the log indices. It uses a rolling window mechanism, so new tables get created periodically to handle indices for newer log entries. Due to an unknown reason, table manager did not create the latest DynamoDB table, so Loki was not able to process new logs. We detected this problem in just a few clusters initially, and instead of fixing it, we decided to move up our timeline to switch to BoltDB-shipper sooner (22/02 instead of 1/03) on all our clusters, before any others could reach the same situation with table manager.
- Loki did not handle the the switch of the DynamoDB to BoltDB-shipper index as we expected due to a bug. Therefore, from the time the DynamoDB-shipper stopped working until the BoltDB-shipper started taking over, logs were not processed and were lost (max 12 hours).
- Loki ingesters WAL and storage were not sized correctly, so the errors caused by the above problem cascaded into other problems. Since loki was not able to write log entries due to the index management problems, the WAL in the
loki-write
Pods kept growing until it filled up the persistent volume. Also, every time the Pods restarted, they tried replaying the WAL (unsuccessfully). This replay operation uses considerable memory resources, proportional to the WAL size. Due to this, someloki-write
Pods were getting OOMKilled periodically while replaying the WAL, adding yet more stress to the system. - Although we have several alerts in place for monitoring Loki, we didn’t initially catch the errored transactions (cause). It’s only after several effects started popping up that our monitoring alerted us: dropped logs, volume full, OOMKills, …
Follow-up actions
- We are revising the WAL configuration of Loki
- We are revising our Loki monitoring and fix our alert on write errors
- We are working to improve our component and platform roll-out processes, as part of our short-term strategy, to be able to catch problems like this sooner and reduce the impact on customers