Adding Prometheus monitoring for Elasticsearch on ECS

Our ECS monitoring solution now supports monitoring Elasticsearch clusters using Elasticsearch Exporter, Prometheus and AlertManager, so we can get notified via slack (critical/warnings) and via OpsGenie (critical) for any issues with ES. This is similar functionality which was already available for customers running on the Kubernetes platform.

Move to the AWS provided Kibana

We’re in the process of removing our kibana deployment from all the Staging clusters and replacing it with the AWS provided kibana setup that comes with the AWS ElasticSearch service. Production clusters will follow.

More …

Increased monitoring alerts visibility

During the following days we’re going to rollout some changes in how Kubernetes monitoring notifications are delivered. From now on, all notifications comming from the production k8s monitoring system will be shown in our shared slack channel, that is the channel we share with each of our customers. The current notification channels will still work as until now. Here’s an overview of how notifications will work:

More …

Upgrade Kubernetes components

We are in the process of upgrading our staging Kubernetes clusters components to the latest stable releases. Production clusters will follow in 1 to 2 weeks (to be announced) after we have confirmed there are no issues with our customer’s workloads.

More …

Improved monitoring alerts on Slack

We have updated the format of the monitoring Slack notifications. You might have already noticed that the monitoring messages in your Slack channels now contain more useful information and are more structured. We’ve already started rolling out the changes in staging clusters and we’ll start rolling them in production clusters during this week.

More …

Mongodb monitoring and dashboards

We have updated the clusters to have support for mongodb monitoring, alerts and dashboards. If you have a mongodb cluster you will see that there is now a mongodb dashboard in Grafana and that we added specific alert rules for mongodb in prometheus.

Improved etcd backups

We’ve upgraded all the k8s cluster with a new etcd backup implementation. The old backup solution was relying on daily snapshots taken from a service running in the master nodes.

More …

Upgrade Vault to 1.0.1

A Vault upgrade for our setups was long overdue. We’ve upgraded our Vault installation tools from version 0.9.3 to 1.0.1, which is the latest Vault version available at the moment. As Vault is set up as HA, the downtime of the upgrade will be minimal, normally between half a second and a couple of seconds, which is the time the fail-over takes. The upgrade procedure to achieve that minimal downtime is the following:

More …