Understanding the basics of Prometheus / grafana / telegraf stack

In the last few days I tried to learn something which is totally new to me, monitoring. Still haven’t taken many deep conclusions, but here’s where I’m at.

Monitoring as a “thing”

Enough chachacha is enough, regarding monitoring:

We have many options Prometheus, InfluxDB… (Can’t remember more, sorry, reading random google findings isn’t all that didactic after all). Well, let’s pick one, for now, stick with it, and try to understand how it works and what we can get out of it. In this post, we will be looking at Prometheus, Grafana, and Telegraf. I won’t show you a real-world example, because I haven’t actually dealt and understand one completely, yet. So let’s start with the BASICS. SIMPLE thighs.

Prometheus

Prometheus works primarily as a time-series database, saving key/value pairs for you. What do these key-value pairs hold actually? What problems does it solve? Well, in the mainstream, people have been using it to hold mostly information about their systems, CPU, disk usage (I/O), etc, but with enough tweaking, you can also make your apps expose end-points to serve costume metrics. You can also trigger alerts based on these metrics with Alertmanager, but that is a topic for other times. People don’t want to bother looking at screens all the time, hence the need for a 01 friend (Slack, Vitorops, etc). As most DB’s, it can be queried. It also comes up with a modest web UI, that isn’t really that complete, which justifies the need for Grafana.

Grafana

Well, put it this way, it allows you to create dashboards using the key-value pairs Prometheus allow’s it to read. Giving you the opportunity to build the cockpit dashboard you probably won’t bother to look at as much as you should.

Telegraf

Our “producer”. Here’s the basics: It’s highly customizable with plugins (input and output). You use input plugins to gather your systems information, and output plugins to allow some other “piece of software” to scrape that information. In this case, we will use Prometheus although it’s most commonly used with InfluxDB, another influxData product. It’s starting to get annoying having to deal with all these names, “tools”, rockstar companies and their .io websites. I miss the times when I naively taught all the servers were Linux, and that C the language of the Gods, but one has to grow right? Let’s not lose focus.

Docker-compose stack:

version: "3"

services:
  prometheus:
    image: quay.io/prometheus/prometheus:v2.0.0
    volumes:
     - ./monitor/prometheus.yml:/etc/prometheus/prometheus.yml
     - prometheus_data:/prometheus
    command: --config.file=/etc/prometheus/prometheus.yml
    ports:
     - 9090:9090
    depends_on:
     - telegraf

  telegraf:
    image: telegraf:1.8
    volumes:
     - ./monitor/telegraf.conf:/etc/telegraf/telegraf.conf:ro
    ports:
     - 9100:9100
    
  grafana:
    image: grafana/grafana
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
     - 3000:3000
    depends_on:
      - prometheus

volumes:
  prometheus_data: {}
  grafana_data: {}

The same compose jaba-daba as usual, exposing ports and using volumes to mount config files. You can find them here.

Run it docker-compose -p Telegraf-Prometheus-Grafana up. Head over to prometheus/targets. As you can see Prometheus is collecting metrics from two places, our Telegraf service, and Prometheus itself. By default, it scrapes information from the /metrics endpoint. Head over to the Main Page, and execute a query based on some metric (provided by the input plugins). If you’re simply lazy click here. Kinda basic right? How did we manage to get Telegraf - Prometheus to communicate?

# monitor/prometheus.yml
 - job_name: telegraf
    scrape_interval: 15s
    static_configs:
      - targets: ['telegraf:9100']
# monitor/telegraf.conf
# Configuration for the Prometheus client to spawn
[[outputs.prometheus_client]]
# /metrics exposed by default
listen = "telegraf:9100"

As you can see prometheus_client is an output plugin. It’s exposing information. If you dig the telegraf.conf you will also find the input plugins that are allowing Telegraf to read the system’s data. For example inputs CPU, disk, diskio, Kernell, mem, etc.

Now, time to check grafana. Write the usual 2-word magic (admin / admin). Go here. Add http://prometheus:9090. Now feel free to click buttons and such. Create your own dashboard, in the Metrics tab of the query thingy, you will find the information that your Telegraf inputs are reading. It seems like magic.

There is much more to talk but, many more questions. How would this setup be built in the cloud, in a micro service oriented environment? How would we deal with security groups, service discovery etc?

Well, this is my first approach into this topic, it’s a local, dummy and basic setup, but it’s a point of start, at least for me. Let’s see where we go from here.

I think this post has already gotten way bigger than I anticipated or wanted, so I will leave it here. I hope, I was able to help someone. Now that you have this basic setup you can tweak it to your liking, experiment with Telegraf plugins, or other “reporter” like node-exporter instead of Telegraf.

Don’t forget to cleanup those containers ma’ men.

· prometheus, grafana and telegraf