A proper monitoring system is crucial in any environment, whether monolithic or microservice. We chose Prometheus due to its versatility, making it the logical choice for our project. We aimed to monitor server resources and the services running on top of them, such as Kafka. The vast amount of data available via JMX exporters by default made the decision even easier.
In simple terms, Prometheus is a metrics scraper that provides quick insights into your environment. It records every Kafka metric exposed by JMX exporters and presents them in graphs through Grafana. This includes metrics like CPU load, disk space usage, memory usage, server and service status, and much more.
Prometheus scrapes each target defined in its configuration file every minute to collect metrics and save them locally. For Kafka, there is an extensive list of available metrics by default. For instance, “Consumer LAG” is an essential metric that indicates how many messages the consumer is behind the producer, among many others.
The short answer: its power. This software can collect hundreds of thousands of metrics per minute and store them for use in dashboards and alerting systems. It operates efficiently and quickly with minimal computing power. An invaluable addition to Prometheus is Alertmanager, developed by the same team, which integrates tightly and enhances its capabilities. Alertmanager uses Prometheus-gathered metrics, evaluates them against sysadmin-defined thresholds, and triggers alerts when thresholds are exceeded. For example, if the Kafka service on a node goes down, Prometheus won’t be able to scrape metrics, and Alertmanager will promptly alert the admin to investigate. Alerts can be sent to your alerting system to generate tickets, emails, or other notifications. You can define rules per service type and set up specific notifications accordingly. The best part is that no special configuration or agent installation is required on your client machines; as long as the metrics are exposed, Prometheus can effortlessly scrape and log them.
Grafana is a popular graph generator service that can take inputs from various sources like Prometheus, InfluxDB, or SQL, and create visually appealing and easy-to-understand graphs. In our project, we were fortunate that Confluent had already developed a dashboard collection for their Confluent Platform. This allowed us to enable JMX metric exports in our Ansible Playbooks for Confluent Platform Kafka, scrape the metrics with Prometheus, and then use Prometheus as an input source for Grafana. Consequently, we have a dashboard that updates every minute, and we might consider making this interval even quicker in the future. Grafana’s versatility allows you to export your dashboards as JSON files for easy portability, sharing with others, or importing on new instances.
In this short post, we might have oversimplified the use of Prometheus and Grafana, but their beauty lies in their simplicity and effectiveness. These tools do precisely what they are designed for – being an outstanding monitoring service.