Some of our customers have experienced performance-related limitations in our Loki & Fluent-bit setup, mainly on queries that require scanning a large volume of data. At the moment we run a monolith Loki architecture, so the single Loki Pod performs all the roles (ingester, querier, query-scheduler, frontend, …). This setup works good enough for most of our customers, but it shows its limitations on large-volume queries, which require some brute-force and could benefit from more parallelism.
The end goal is to move towards a more scalable Loki setup, where each of the roles involved run separately and can scale based on demand. That way a large-volume query would be automatically split in smaller chunks that would be processed in parallel by multiple querier Pods, and would be finished in a fraction of the time. This is in our roadmap, and will be implemented in the future.
In the meantime, we’ve updated some of the Loki and Grafana settings so it allows to process large-volume queries without failing. These will take longer to process but they should eventually respond successfully.
In addition, we’ve also made some of the Fluent-bit parameters configurable, to make the pipeline more customizable. For example, we can now enable JSON log merging into Fluent bit keys, so you can use data from within your JSON logs to add some filters or to add additional Loki labels.
Loki labels are the ones you use between brackets {...}
in your queries, and it can be useful to add additional ones from your log entries, to be able to write simpler queries and/or to scan through less data on each query. But be cautious, as adding some labels can be counterproductive, specially for high-cardinality labels (the ones with a large number of distinct values). We recommend you read this post from Grafana about Loki labels, which is quite useful to understand how they work.
Don’t hesitate to contact us in case you have any questions about this or you’d like to explore one of theses customizations.