Establishing a telegraf metric system

Telegraf is an agent for collecting, processing, aggregating, and writing metrics. With over 200 plugins, it provides all the metrics you need to track and understand your systems and applications effectively. This blog will demonstrate how to set up a telegraf metric stack.

To persist the data, we will store the telemetry data using InfluxDB. This efficient time-series database is well suited to store the data for subsequent analysis (e.g. visualization via Grafana).

Deploy A InfluxDB Service

There are two ways to deploy an InfluxDB database:

Docker Compose
Kubernetes

services:
  influxdb:
    image: influxdb:2.2 # image name
    container_name: influxdb # container name
    ports:
      - "8086:8086"
    volumes:
      - influxdb_conf:/etc/influxdb # conf volume
      - influxdb_data:/var/lib/influxdb # data volume
volumes:
  influxdb_data:
  influxdb_conf:

Below is the reference configuration file for deployment using K8S. Please note that NFS data volumes are used here and the service is exposed through Nginx Ingress, so the actual deployment needs to be modified according to the specific cluster configuration (the places that need to be modified have been labeled). Deployment type resources are used here, actually using StatefulSet would be better.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    workload.user.cattle.io/workloadselector: apps.deployment-<namespace>-influxdb # TO BE MODIFED
  name: influxdb
  namespace: <namespace> # TO BE MODIFED
spec:
  replicas: 1
  selector:
    matchLabels:
      workload.user.cattle.io/workloadselector: apps.deployment-<namespace>-influxdb # TO BE MODIFED
  template:
    metadata:
      creationTimestamp: null
      labels:
        workload.user.cattle.io/workloadselector: apps.deployment-<namespace>-influxdb # TO BE MODIFED
    spec:
      containers:
        - image: influxdb:2.2
          imagePullPolicy: IfNotPresent
          name: container-0
          ports:
            - containerPort: 8086
              name: 8086tcp
              protocol: TCP
          resources:
            limits:
              cpu: "2"
              memory: 4Gi
            requests:
              cpu: 50m
              memory: 128Mi
          volumeMounts:
            - mountPath: /var/lib/influxdb2
              name: <volume_name> # TO BE MODIFED
              subPath: db-data
            - mountPath: /etc/influxdb
              name: <volume_name> # TO BE MODIFED
              subPath: db-config
      restartPolicy: Always
      volumes:
        - name: <volume_name> # TO BE MODIFED
          nfs:
            path: /<volume_path> # TO BE MODIFED
            server: 0.0.0.0 # TO BE MODIFED
---
apiVersion: v1
kind: Service
metadata:
  name: influxdb
  namespace: <namespace> # TO BE MODIFED
spec:
  ports:
    - name: 8086tcp
      port: 8086
      protocol: TCP
      targetPort: 8086
  selector:
    workload.user.cattle.io/workloadselector: apps.deployment-<namespace>-influxdb # TO BE MODIFED
  type: ClusterIP
---
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx # TO BE MODIFED
  name: influxdb-ingress
  namespace: <namespace> # TO BE MODIFED
spec:
  rules:
    - host: influxdb.example.com # TO BE MODIFED
      http:
        paths:
          - backend:
              service:
                name: influxdb
                port:
                  number: 8086
            path: /
            pathType: Prefix
  tls:
    - hosts:
        - influxdb.example.top # TO BE MODIFED
      secretName: <tls_secret> # TO BE MODIFED

Deploy A Telegraf Collector

These are steps to deploy telegraf collector

Install telegraf package according to this guide.
Generate access token from influxdb’s webui.
Configure telegraf’s configuration /etc/telegraf.conf according to the access token.
Test the configuration.

You can use this command to generate the default configuration as a reference:

telegraf config > telegraf.conf

Here is a selection of useful telegraf configurations:

[global_tags]
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false
[[outputs.influxdb_v2]]
    urls = ["https://influxdb.example.com"]
    token = "xxx"
    organization = "<organization_name>"
    bucket = "telegraf"
[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "loop"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.bond]]
[[inputs.net]]
[[inputs.nvidia_smi]]
    timeout = "25s"
[[inputs.smart]]
    interval = "600s"
    attributes = true
[[inputs.temp]]

[agent]

This configures how agent behaves, hostname defaults to system’s hostname.

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false

[outputs.influxdb_vs]

This section instructs telegraf to send metric to InfluxDB database, url, token, organization and bucket need to be properly set.

[[outputs.influxdb_v2]]
    urls = ["https://influxdb.example.com"]
    token = "xxx"
    organization = "<organization_name>"
    bucket = "telegraf"

[[inputs.cpu]]

This enables the collection of CPU usage.

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]

This enables the collection of disk and filesystem, except some types.

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "loop"]

[[inputs.diskio]]

This enables the collection of disk io.

[[inputs.diskio]]

[[inputs.kernel]]

This enables the collection of kernel info.

[[inputs.kernel]]

[[inputs.mem]]

This enables the collection of memory usage.

[[inputs.mem]]

[[inputs.processes]]

This enables the collection of processes.

[[inputs.processes]]

[[inputs.swap]]

This enables the collection of swap usage.

[[inputs.swap]]

[[inputs.system]]

This enables the collection of system.

[[inputs.system]]

[[inputs.bond]]

This enables the collection of traffic.

[[inputs.bond]]

[[inputs.net]]

This enables the collection of network.

[[inputs.net]]

[[inputs.nvidia_smi]]

This enables the collection of NVIDIA GPUs.

[[inputs.nvidia_smi]]
    timeout = "25s"

[[inputs.smart]]

This enables the collection of SMART information.

[[inputs.smart]]
    interval = "600s"
    attributes = true

[[inputs.temp]]

This enables the collection of thermo.

[[inputs.temp]]

After you have generated the configuration, run telegraf command to test it. If everything works fine, enable the telegraf system service and let it run in the background.