# Installing Prometheus + Grafana with Xen-Exporter for XCP-NG monitoring. (+extras) ## Hardware requirements: - Processor: 1 core (single core processes) - ~512 ram (recommended 1gb on first install) - 10gb (system + ~5gb enough for ~30days data at 60sec/tick) ## Software used: - OS: Ubuntu Server Minimal - Prometheus → scrapes VM metrics. - Grafana -> displays Prometheus metrics - Xen-exporter -> gather metrics from XCP-NG - Node-exporter (optional) -> gather metrics from any PC - Alertmanager (optional) -> handles alerts triggered by Prometheus rules. ## First step: - Boot your VM with Ubuntu Server Minimal and install the system, reboot. ## Installation: 1. Update system and install dependencies `sudo apt update && sudo apt upgrade -y` `sudo apt install -y wget curl tar git software-properties-common` 2. Install Prometheus `cd /tmp` `wget https://github.com/prometheus/prometheus/releases/download/v3.6.0/prometheus-3.6.0.linux-amd64.tar.gz` `tar xvf prometheus-3.6.0.linux-amd64.tar.gz` `sudo mv prometheus-3.6.0.linux-amd64 /usr/local/prometheus` 3. Create a Prometheus user `sudo useradd -rs /bin/false prometheus` `sudo mkdir /etc/prometheus` `sudo mkdir /var/lib/prometheus` `sudo chown prometheus:prometheus /usr/local/prometheus /etc/prometheus /var/lib/prometheus` 4. Create a basic Prometheus config `sudo nano /etc/prometheus/prometheus.yml` ``` global: scrape_interval: 60s scrape_timeout: 10s evaluation_interval: 60s # ----- This is the WebUI: scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] ``` 5. Set proper ownership `sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml` 6. Create a systemd service `sudo nano /etc/systemd/system/prometheus.service` ``` [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.console.templates=/usr/local/prometheus/consoles \ --web.console.libraries=/usr/local/prometheus/console_libraries \ --storage.tsdb.retention.time=30d \ --storage.tsdb.retention.size=5GB Restart=always [Install] WantedBy=multi-user.target ``` 7. Start Prometheus `sudo systemctl daemon-reload` `sudo systemctl enable --now prometheus` `sudo systemctl status prometheus` The Prometheus web interface will be in http://PROMETHEUS-VM-IP:9090 8. Install Grafana `sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"` `wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -` `sudo apt update && sudo apt install -y grafana` 9. Check the systemd service for Grafana (default from install) `/etc/systemd/system/grafana-server.service` ``` [Unit] Description=Grafana instance Documentation=http://docs.grafana.org Wants=network-online.target After=network-online.target After=postgresql.service mariadb.service mysql.service influxdb.service [Service] EnvironmentFile=/etc/default/grafana-server User=grafana Group=grafana Type=simple Restart=on-failure WorkingDirectory=/usr/share/grafana RuntimeDirectory=grafana RuntimeDirectoryMode=0750 ExecStart=/usr/share/grafana/bin/grafana server \ --config=${CONF_FILE} \ --pidfile=${PID_FILE_DIR}/grafana-server.pid \ --packaging=deb \ cfg:default.paths.logs=${LOG_DIR} \ cfg:default.paths.data=${DATA_DIR} \ cfg:default.paths.plugins=${PLUGINS_DIR} \ cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR} LimitNOFILE=10000 TimeoutStopSec=20 CapabilityBoundingSet= DeviceAllow= LockPersonality=true MemoryDenyWriteExecute=false NoNewPrivileges=true PrivateDevices=true PrivateTmp=true ProtectClock=true ProtectControlGroups=true ProtectHome=true ProtectHostname=true ProtectKernelLogs=true ProtectKernelModules=true ProtectKernelTunables=true ProtectProc=invisible ProtectSystem=full RemoveIPC=true RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX RestrictNamespaces=true RestrictRealtime=true RestrictSUIDSGID=true SystemCallArchitectures=native UMask=0027 [Install] WantedBy=multi-user.target ``` 10. Enable and start Grafana `sudo systemctl daemon-reload` `sudo systemctl enable --now grafana-server` `sudo systemctl status grafana-server` The Grafana web interface will be in http://GRAFANA-VM-IP:3000 (u:`admin` p:`admin`) # Installing Xen-exporter without docker 1. Clone the repo `cd ~/` `git clone https://github.com/MikeDombo/xen-exporter.git && cd xen-exporter` 2. Install python required packages `sudo apt install python3-pip python3-venv -y` 3. Install requirements in the virtual env `python3 -m venv venv` `source venv/bin/activate` `pip install -r requirements.txt` You can test it manually before wrapping in a system service: Replace ``, ``, `YOUR-XCP-IP`, `YOUR-XCP-PASSWORD` ``` XEN_HOST="YOUR-XEN-IP" \ XEN_USER="root" \ XEN_PASSWORD="YOUR-XCP-PASSWORD" \ XEN_SSL_VERIFY="false" \ python3 xen-exporter.py ``` 4. Create a systemd service `sudo nano /etc/systemd/system/xen-exporter.service` ``` [Unit] Description=Xen Exporter for Prometheus After=network.target [Service] Type=simple WorkingDirectory=/home//xen-exporter ExecStart=/home//xen-exporter/venv/bin/python3 /home//xen-exporter/xen-exporter.py Environment="XEN_HOST=YOUR-XCP-IP" # CHANGE TO YOURS! Environment="XEN_USER=root" Environment="XEN_PASSWORD=YOUR-XCP-PASSWORD" # CHANGE TO YOURS! Environment="XEN_SSL_VERIFY=false" Restart=always User= # CHANGE TO YOURS! Group= # CHANGE TO YOURS! [Install] WantedBy=multi-user.target ``` 5. Enable and start `sudo systemctl daemon-reload` `sudo systemctl enable --now xen-exporter` 6. Add it to Prometheus `sudo nano /etc/prometheus/prometheus.yml` ``` - job_name: 'xenserver' static_configs: - targets: ['dashboard-vm-ip:9100'] ``` 7. Restart Prometheus `sudo systemctl restart prometheus` ## Configure Grafana to see Xen-exporter 1. Open Grafana: http://GRAFANA-VM-IP:3000 2. Login: admin/admin → change password 3. Add data source: ``` Type: Prometheus URL: http://PROMETHEUS_VM_IP:9090 ``` 4. Add the Dashboards - Click `+` -> `Import` - Enter dashboard **ID 16588** (Xen Prometheus) - Select **Prometheus** data source -> `Import` # Installing Node-exporter (optional) ### **IMPORTANT** Do these steps **on each VM** you want data exported. *Note: If you want metrics from the VM running prometheus itself too, use a diferent port (eg. 9111) when doing these steps on this specific VM so it doesn't clash with xen-exporter port 9100. Or ignore this note if you wont use xen-exporter at all.* 1. Create a Node Exporter user on each VM `sudo useradd -rs /bin/false node_exporter` 2. Download Node Exporter on each VM `cd /tmp` `wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz` `tar xvf node_exporter-1.9.1.linux-amd64.tar.gz` `sudo mv node_exporter-1.9.1.linux-amd64 /usr/local/node_exporter` `sudo chown -R node_exporter:node_exporter /usr/local/node_exporter` 3. Create systemd service for Node Exporter on each VM `sudo nano /etc/systemd/system/node_exporter.service` ``` [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/node_exporter/node_exporter Restart=always [Install] WantedBy=multi-user.target ``` 4. Enable and start the service on each VM `sudo systemctl daemon-reload` `sudo systemctl enable --now node_exporter` `sudo systemctl status node_exporter` Test with: `curl http://localhost:9100/metrics` on each VM. ### *Back to the Prometheus VM!* 5. Configure Prometheus to scrape all VMs `sudo nano /etc/prometheus/prometheus.yml` Replace `VM_IP#` with the VM IP or DDNS ``` global: scrape_interval: 60s scrape_timeout: 10s evaluation_interval: 60s # ------- WebUI: scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # ------- XCP-NG: (if using xen-exporter) - job_name: 'xenserver' static_configs: - targets: ['dashboard-vm-ip:9100'] # ------- VMS: - job_name: 'node_exporters' static_configs: - targets: - 'VM1_IP:9100' - 'VM2_IP:9100' - 'localhost:9111' # eg. for node-exporter in Prometheus VM ``` 6. Reload Prometheus `sudo systemctl restart prometheus` You can now verify the target in: http://PROMETHEUS-VM-IP:9090/targets ## Configure Grafana to see Node-exporter 1. Open Grafana: http://GRAFANA-VM-IP:3000 2. Login: admin/admin → change password 3. Add data source: ``` Type: Prometheus URL: http://PROMETHEUS_VM_IP:9090 ``` 4. Add the Dashboards - Click `+` -> `Import` - Enter dashboard **ID 1860** (Node Exporter Full) - Select **Prometheus** data source -> `Import` # Configuring Alerts with Alertmanager (WhatsApp Webhook) 1. Download and install Alertmanager `cd /tmp` `wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz` `tar xvf alertmanager-0.28.1.linux-amd64.tar.gz` `sudo mkdir -p /usr/local/alertmanager` `sudo mv alertmanager-0.28.1.linux-amd64/{alertmanager,amtool} /usr/local/alertmanager/` `sudo mkdir -p /etc/alertmanager` `sudo cp alertmanager-0.28.1.linux-amd64/alertmanager.yml /etc/alertmanager/alertmanager.yml` 2. Configure alertmanager.yml `sudo nano /etc/alertmanager/alertmanager.yml` **This config is based on a webhook.** There are other ways Prometheus can alert like email etc. not covered here. ``` route: group_by: ['alertname'] group_wait: 15s group_interval: 1m repeat_interval: 5m receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:9095/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ``` 3. Create a systemd service for alertmanager `sudo nano /etc/systemd/system/alertmanager.service` Replace `YOUR-USER` `YOUR-GROUP` ``` [Unit] Description=Prometheus Alertmanager Wants=network-online.target After=network-online.target [Service] User=YOUR-USER # CHANGE TO YOURS! Group=YOUR-GROUP # CHANGE TO YOURS! Type=simple ExecStart=/usr/local/alertmanager/alertmanager \ --config.file=/etc/alertmanager/alertmanager.yml \ --storage.path=/var/lib/alertmanager Restart=always [Install] WantedBy=multi-user.target ``` 4. Fix permissions `sudo mkdir -p /var/lib/alertmanager` `sudo chown -R prometheus:prometheus /var/lib/alertmanager /etc/alertmanager` 5. Start the alertmanager service `sudo systemctl daemon-reload` `sudo systemctl enable --now alertmanager` `sudo systemctl status alertmanager` Alertmanager should now be listening on port 9093. You can verify: `ss -tlnp | grep 9093` 6. Callmebot webhook bridge This is a tiny Python HTTP server that listens on 127.0.0.1:9095 and invokes `callmebot "message"` command for each alert received. `sudo apt install python3-flask -y` `sudo nano /usr/local/bin/whatsapp-webhook.py` ``` #!/usr/bin/env python3 from flask import Flask, request import subprocess app = Flask(__name__) @app.route('/', methods=['POST']) def webhook(): data = request.json if not data: return "no data", 400 alerts = data.get("alerts", []) for alert in alerts: summary = alert.get("annotations", {}).get("summary", "No summary") subprocess.Popen(["/usr/local/bin/callmebot", summary]) return "ok", 200 if __name__ == '__main__': app.run(host="0.0.0.0", port=9095) ``` Make it executable. `chmod +x /usr/local/bin/whatsapp-webhook.py` 7. Create a systemd service for the webhook `sudo nano /etc/systemd/system/whatsapp-webhook.service` Replace `YOUR-USER` ``` [Unit] Description=WhatsApp Webhook for Alertmanager After=network.target [Service] User=YOUR-USER # CHANGE TO YOURS! ExecStart=/usr/bin/python3 /usr/local/bin/whatsapp-webhook.py Restart=always [Install] WantedBy=multi-user.target ``` 8. Enable and start the service `sudo systemctl daemon-reload` `sudo systemctl enable --now whatsapp-webhook` `sudo systemctl status whatsapp-webhook` Verify it listens only on localhost: `ss -tlnp | grep 9095` (should show 0.0.0.0:9095) 9. Wire Prometheus -> Alertmanager `sudo nano /etc/prometheus/prometheus.yml` ``` alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] ``` 10. Restart Prometheus `sudo systemctl restart prometheus.service` ## Creating Alert rules `sudo nano /etc/prometheus/alert.rules.yml` ``` groups: - name: basic-alerts rules: - alert: NodeDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} has been unreachable for 1 minutes." ``` 1. Tell Prometheus about the new rules `sudo nano /etc/prometheus/prometheus.yml` ``` alerting: alertmanagers: - static_configs: - targets: ['localhost:9093'] rule_files: - "/etc/prometheus/alert.rules.yml" ``` 2. Restart Prometheus `sudo systemctl restart prometheus.service` # Final working config files from **my** setup ### These are the files on **my VM**, with the *user/group* `dash` Use for reference in case of errors but remember to change user/groups and credentials. - `/etc/prometheus/prometheus.yml` ``` global: scrape_interval: 60s scrape_timeout: 10s evaluation_interval: 60s # ------ Alertmanager: # Config file-> /etc/alertmanager/alertmanager.yml alerting: alertmanagers: - static_configs: - targets: - "localhost:9093" rule_files: - "/etc/prometheus/alert.rules.yml" # ------- WebUI: scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # ------- VMS: - job_name: 'node_exporters' static_configs: - targets: ['192.168.15.221:9100'] labels: hostname: 'xo' - targets: ['192.168.15.222:9100'] labels: hostname: 'pihole' - targets: ['192.168.15.223:9100'] labels: hostname: 'copyparty' - targets: ['192.168.15.224:9100'] labels: hostname: 'media' - targets: ['192.168.15.225:9100'] labels: hostname: 'vaultwarden' - targets: ['192.168.15.226:9100'] labels: hostname: 'minecraft' # -------- XCP-NG Host: (xen-exporter on localhost) - job_name: 'xen_exporter' static_configs: - targets: ['192.168.15.227:9100'] labels: hostname: 'XCP-NG' # -------- Oracle VPS tunnel: (Forward ssh tunnel on localhost) - job_name: 'vps_node_exporter' static_configs: - targets: ['localhost:9101'] labels: hostname: 'Oracle VPS' # -------- Prometheus VM itself - job_name: 'prometheus_vm' static_configs: - targets: ['localhost:9102'] labels: hostname: 'Dashboards' ``` - `/etc/alertmanager/alertmanager.yml` ``` route: group_by: ['alertname'] group_wait: 15s group_interval: 1m repeat_interval: 5m receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:9095/' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ``` - `/etc/prometheus/alert.rules.yml` ``` groups: - name: critical-alerts rules: # Node unreachable - alert: NodeDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.hostname }} down" description: "{{ $labels.hostname }} has been unreachable for 1 minute." # CPU over 90% for 10 minutes - alert: CPUPeak expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90 for: 10m labels: severity: critical annotations: summary: "CPU from instance {{ $labels.hostname }} peaking" description: "{{ $labels.hostname }} has been over 90% CPU for 10 minutes." # Disk low space (less than 10% free) - alert: DiskLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 # for: 5m labels: severity: critical annotations: summary: "Disk space low on {{ $labels.hostname }}" description: "Less than 10% disk space available on {{ $labels.hostname }}." # RAM usage over 95% - alert: RAMHigh expr: (1 - ((node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes)) * 100 > 95 for: 10m labels: severity: critical annotations: summary: "RAM usage high on {{ $labels.hostname }}" description: "Memory usage is over 95% on {{ $labels.hostname }}." - name: warning-alerts rules: # CPU over 70% for 5 minutes - alert: CPUWarning expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70 for: 5m labels: severity: warning annotations: summary: "CPU usage high on {{ $labels.hostname }}" description: "CPU usage is above 70% on {{ $labels.hostname }}." # Disk less than 20% free - alert: DiskWarning expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20 # for: 10m labels: severity: warning annotations: summary: "Disk usage warning on {{ $labels.hostname }}" description: "Disk usage is above 80% on {{ $labels.hostname }}." # RAM over 85% - alert: RAMWarning expr: (1 - ((node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes)) * 100 > 85 for: 10m labels: severity: warning annotations: summary: "RAM usage warning on {{ $labels.hostname }}" description: "Memory usage is over 85% on {{ $labels.hostname }}." # Network: high incoming traffic (>80% of interface capacity) - alert: NetworkInHigh expr: rate(node_network_receive_bytes_total[5m]) * 8 > (0.8 * 1e9) # adjust 1e9 = 1Gbps link for: 5m labels: severity: warning annotations: summary: "High network IN traffic on {{ $labels.hostname }}" description: "Incoming network traffic exceeded 80% of interface capacity." # Network: high outgoing traffic (>80% of interface capacity) - alert: NetworkOutHigh expr: rate(node_network_transmit_bytes_total[5m]) * 8 > (0.8 * 1e9) # adjust 1e9 = 1Gbps link for: 5m labels: severity: warning annotations: summary: "High network OUT traffic on {{ $labels.hostname }}" description: "Outgoing network traffic exceeded 80% of interface capacity." # Load average over 5 (1-min load) - alert: LoadHigh expr: node_load1 > 5 for: 5m labels: severity: warning annotations: summary: "High load on {{ $labels.hostname }}" description: "1-minute load average is over 5." # Load average over 3 for 5-min average - alert: LoadHigh5 expr: node_load5 > 3 for: 5m labels: severity: warning annotations: summary: "High load on {{ $labels.hostname }}" description: "5-minute load average is over 3." # Load average over 2 for 15-min average - alert: LoadHigh15 expr: node_load15 > 2 for: 5m labels: severity: warning annotations: summary: "High load on {{ $labels.hostname }}" description: "15-minute load average is over 2." ``` - `/etc/systemd/system/alertmanager.service` ``` [Unit] Description=Prometheus Alertmanager Wants=network-online.target After=network-online.target [Service] User=dash Group=dash Type=simple ExecStart=/usr/local/alertmanager/alertmanager \ --config.file=/etc/alertmanager/alertmanager.yml \ --storage.path=/var/lib/alertmanager Restart=always [Install] WantedBy=multi-user.target ``` - `/etc/systemd/system/prometheus.service` ``` [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --web.console.templates=/usr/local/prometheus/consoles \ --web.console.libraries=/usr/local/prometheus/console_libraries \ --storage.tsdb.retention.time=30d \ --storage.tsdb.retention.size=5GB Restart=always [Install] WantedBy=multi-user.target ``` - `/etc/systemd/system/whatsapp-webhook.service` ``` [Unit] Description=WhatsApp Webhook for Alertmanager After=network.target [Service] User=dash ExecStart=/usr/bin/python3 /usr/local/bin/whatsapp-webhook.py Restart=always [Install] WantedBy=multi-user.target ``` - `/etc/systemd/system/xen-exporter.service` (Change to your credentials!) ``` [Unit] Description=Xen Exporter for Prometheus After=network.target [Service] Type=simple WorkingDirectory=/home/dash/xen-exporter ExecStart=/home/dash/xen-exporter/venv/bin/python3 /home/dash/xen-exporter/xen-exporter.py Environment="XEN_HOST=myXCPngIP" # CHANGE TO YOURS! Environment="XEN_USER=root" Environment="XEN_PASSWORD=myAwesomePassword" # CHANGE TO YOURS! Environment="XEN_SSL_VERIFY=false" Restart=always User=dash Group=dash [Install] WantedBy=multi-user.target ``` - `/usr/local/bin/whatsapp-webhook.py` ``` #!/usr/bin/env python3 from flask import Flask, request import subprocess app = Flask(__name__) @app.route('/', methods=['POST']) def webhook(): data = request.json if not data: return "no data", 400 alerts = data.get("alerts", []) for alert in alerts: summary = alert.get("annotations", {}).get("summary", "No summary") subprocess.Popen(["/usr/local/bin/callmebot", summary]) return "ok", 200 if __name__ == '__main__': app.run(host="0.0.0.0", port=9095) ``` - `/usr/local/bin/callmebot` (Change to your credentials!) ``` #!/bin/bash # # callmebot - Send WhatsApp messages via CallMeBot from the terminal # set +H # Your phone number and API key PHONE="5521YOURPHONENUMBER" # CHANGE TO YOURS! APIKEY="myAPIkey" # CHANGE TO YOURS! # Check if a message was passed if [ $# -eq 0 ]; then echo "Usage: callmebot \"Your message here\"" exit 1 fi # Join all arguments into a single string MESSAGE="$*" # URL encode only reserved characters, leave UTF-8 (like emojis) alone rawurlencode() { local string="$1" local encoded="" local i c for (( i=0; i<${#string}; i++ )); do c="${string:$i:1}" case "$c" in [a-zA-Z0-9.~_-]) encoded+="$c" ;; ' ') encoded+='+' ;; *) # Encode only ASCII < 128 if [[ "$c" =~ [[:cntrl:]] || $(LC_CTYPE=C printf '%d' "'$c") -lt 128 ]]; then printf -v o '%%%02X' "'$c" encoded+="$o" else encoded+="$c" fi ;; esac done echo "$encoded" } # Encode the message safely ENCODED_MESSAGE=$(rawurlencode "$MESSAGE") # Send the message curl -s "https://api.callmebot.com/whatsapp.php?phone=${PHONE}&text=${ENCODED_MESSAGE}&apikey=${APIKEY}" \ > /dev/null # Optional confirmation echo "Message sent: $MESSAGE" ``` ## Setup Callmebot 1. Add the phone number **+34 684 783 708** into your Phone Contacts. (Name it it as you wish) 2. Send this message **"I allow callmebot to send me messages"** to the new Contact created (using WhatsApp of course) 3. Wait until you receive the message "API Activated for your phone number. Your APIKEY is 123123" from the bot. *Note: If you don't receive the ApiKey in 2 minutes, please try again after 24hs.* 4. The WhatsApp message from the bot will contain the apikey needed to send messages using the API. # Tips and Tricks ### Remove all data to start from scratch: `sudo systemctl stop prometheus` `sudo rm -rf /var/lib/prometheus/*` `sudo systemctl restart prometheus` - This will remove all gathered data and Grafana will have no data to display. This is useful after setting everything up since renaming hosts or changing config files may duplicate data on dashboards. ### Change data retention period/size: - Edit the file `/etc/systemd/system/multi-user.target.wants/prometheus.service` - Change these lines to what you wish for ` --storage.tsdb.retention.time=30d \` Keeps at most 30 days of data. ` --storage.tsdb.retention.size=5GB` Keeps at most 5gb of data. - Restart the service: `sudo systemctl daemon-reload && sudo systemctl restart prometheus` ### Change `Nodename` in Grafana (defaults to VM's /etc/hostname): Enter the dashboard in http://GRAFANA-VM-IP:3000 and click on `edit`->`settings`->`variables`. Click on the `Nodename` variable and it can be changed to *uuid*, *hostname* etc. ### Force hostnames on Prometheus metrics: You can add specific hostnames to machines in prometheus.yml From: ``` # ------- VMS: - job_name: 'node_exporters' static_configs: - targets: - '192.168.15.221:9100' - '192.168.15.222:9100' ``` To: ``` # ------- VMS: - job_name: 'node_exporters' static_configs: - targets: ['192.168.15.221:9100'] labels: hostname: 'xo' - targets: ['192.168.15.222:9100'] labels: hostname: 'pihole' ``` ### Alerts in general: The `alert_rules.yml` and `alertmanager.yml` are part of Prometheus default alert system and they have no direct relation to **my** whatsapp webhook. **If you want to use a different alert type keep these** and delete (or dont install) the whatsapp* related files/services. *So keep the alert services and configs.* ### Prometheus/Grafana/Alerts syncronization:

With this setup, scrapes happen every 1 minute, so Grafana also has a 1 min refresh and alerts wait for at least 1 minute. If you change scrape time on prometheus.yml keep in mind you may need to change timers all around to match so services are not being wasted.

Also keep in mind that reducing scrape time GREATLY increases hard drive usage, so plan accordingly. ### Grafana's own alerts *vs* Prometheus Alertmanager:

Grafana has it's own alert system. They can be used on very specific cases I guess, I'm not using them. You can play with Grafana's alerts within the UI itself by using the `...` on each metric dashboard.

Prometheus alerts on the other hand will fire for all vm's (unless excluded) so it's not grafana/dashboard dependant and usually the more "correct" and robust way to do this.