Set Up Prometheus and Grafana on Ubuntu VPS: Production Monitoring Stack

If you've been running Netdata or basic monitoring scripts on your Ubuntu VPS, you've probably hit a wall. Maybe you need metrics from multiple services correlated on a single dashboard. Maybe you need alerting that goes beyond "disk is full." Maybe you need to retain weeks of historical data to spot trends. That's when it's time to graduate to Prometheus and Grafana — the industry-standard open-source monitoring stack used by companies from startups to Fortune 500s.

This guide walks you through deploying a production-ready monitoring stack on an Ubuntu VPS using Docker Compose. By the end, you'll have Prometheus scraping metrics from your system, your web server, your database, and your containers — all visualized in Grafana dashboards with alerting configured.

MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10

Deploy a self-managed VPS — from $1.99/mo
Need dedicated resources? — from $19.80/mo
Want fully managed hosting? — we handle everything

When to Move Beyond Basic Monitoring

Tools like Netdata (covered in our VPS monitoring setup guide) are excellent for real-time system observation. But as your infrastructure grows, you'll need capabilities that basic tools can't provide:

Multi-service correlation — overlay Nginx request rates with database query times and CPU usage on a single dashboard
Custom metrics — instrument your own applications and expose business-level metrics
Powerful query language — PromQL lets you compute rates, aggregates, percentiles, and predictions
Long-term retention — store weeks or months of historical data for capacity planning
Flexible alerting — alert on complex conditions like "error rate above 5% for 10 minutes" rather than simple thresholds
Dashboard sharing — Grafana dashboards can be shared with your team, embedded, or exported

Architecture Overview

The Prometheus + Grafana stack follows a pull-based architecture:

┌─────────────┐     scrapes      ┌──────────────────┐
│ node_exporter├────────────────► │                  │
└─────────────┘                  │                  │
┌──────────────┐     scrapes     │   Prometheus     │     queries     ┌─────────┐
│nginx_exporter├────────────────►│  (time-series DB)├────────────────►│ Grafana │
└──────────────┘                 │                  │                 │  (UI)   │
┌──────────────┐     scrapes     │                  │                 └─────────┘
│mysqld_exporter├───────────────►│                  │
└──────────────┘                 └──────────────────┘
┌──────────────┐     scrapes            │
│   cAdvisor   ├────────────────────────┘
└──────────────┘

Prometheus periodically scrapes HTTP endpoints (called "targets") exposed by exporters. Each exporter translates service-specific metrics into Prometheus's text format. Prometheus stores these metrics as time-series data. Grafana connects to Prometheus as a data source and lets you build dashboards and alerts using PromQL queries.

Prerequisites

You'll need an Ubuntu VPS with at least 2 vCPU and 4 GB RAM. The full stack (Prometheus, Grafana, four exporters) uses approximately 500–800 MB of RAM at idle and scales with the number of metrics and retention period.

Docker and Docker Compose must be installed. If you haven't set those up yet, follow our Docker installation guide first.

Verify your Docker installation:

docker --version
# Docker version 27.x.x

docker compose version
# Docker Compose version v2.x.x

The full stack runs on a Cloud VPS with 2 vCPU / 4 GB RAM. Prometheus stores time-series data — scale storage independently as your retention grows.

Project Structure

Create the directory structure for your monitoring stack:

mkdir -p ~/monitoring/{prometheus,grafana/{provisioning/datasources,provisioning/dashboards}}
cd ~/monitoring

The final structure will look like this:

~/monitoring/
├── docker-compose.yml
├── prometheus/
│   └── prometheus.yml
└── grafana/
    └── provisioning/
        ├── datasources/
        │   └── prometheus.yml
        └── dashboards/
            └── dashboards.yml

Docker Compose: The Full Stack

Create the docker-compose.yml file that defines all services:

cat > ~/monitoring/docker-compose.yml << 'EOF'
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    container_name: prometheus
    restart: unless-stopped
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "127.0.0.1:9090:9090"
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme_strong_password_here
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://monitoring.yourdomain.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=smtp.gmail.com:587
      - GF_SMTP_USER=alerts@yourdomain.com
      - GF_SMTP_PASSWORD=your_app_password
      - GF_SMTP_FROM_ADDRESS=alerts@yourdomain.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "127.0.0.1:3000:3000"
    depends_on:
      - prometheus
    networks:
      - monitoring

  node_exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node_exporter
    restart: unless-stopped
    command:
      - '--path.rootfs=/host'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
      - '--collector.netclass.ignored-devices=^(veth.*|br-.*|docker.*)$$'
    volumes:
      - /:/host:ro,rslave
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    pid: host
    networks:
      - monitoring

  nginx_exporter:
    image: nginx/nginx-prometheus-exporter:1.1
    container_name: nginx_exporter
    restart: unless-stopped
    command:
      - '-nginx.scrape-uri=http://host.docker.internal:8080/nginx_status'
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - monitoring

  mysqld_exporter:
    image: prom/mysqld-exporter:v0.15.1
    container_name: mysqld_exporter
    restart: unless-stopped
    environment:
      - DATA_SOURCE_NAME=exporter:exporter_password@(host.docker.internal:3306)/
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge
EOF

Key decisions in this configuration:

Ports bound to 127.0.0.1 — Prometheus (9090) and Grafana (3000) are only accessible locally. You'll use an Nginx reverse proxy or SSH tunnel for external access.
Retention settings — Prometheus keeps data for 30 days or up to 10 GB, whichever limit is hit first.
--web.enable-lifecycle — allows reloading Prometheus configuration via HTTP POST without restarting the container.
Named volumes — prometheus_data and grafana_data persist data across container restarts.

Prometheus Configuration

Create the Prometheus configuration file that defines what to scrape and how often:

cat > ~/monitoring/prometheus/prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/alert_rules.yml

scrape_configs:
  # Prometheus monitors itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus'

  # System metrics from node_exporter
  - job_name: 'node'
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          instance: 'vps-01'

  # Nginx metrics
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx_exporter:9113']
        labels:
          instance: 'nginx-main'

  # MySQL/MariaDB metrics
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysqld_exporter:9104']
        labels:
          instance: 'mysql-main'

  # Docker container metrics from cAdvisor
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
        labels:
          instance: 'docker-host'
EOF

The scrape_interval of 15 seconds is the standard for most deployments. Lower intervals (5s) increase storage requirements proportionally. Higher intervals (30s-60s) may miss short spikes.

Understanding Scrape Configuration

Each job_name groups related targets. The static_configs block lists the endpoints to scrape. Since all our exporters run in the same Docker network, we use container names as hostnames. The labels section adds custom metadata that you can use in PromQL queries and Grafana dashboards to filter and group metrics.

Node Exporter: System Metrics

The node_exporter service defined in our Docker Compose file exposes hundreds of system-level metrics. Here are the key metric families:

Metric Prefix	What It Measures	Example Metric
`node_cpu_seconds_total`	CPU time per core and mode	Time spent in user, system, idle, iowait
`node_memory_*`	Memory allocation	MemTotal, MemAvailable, Buffers, Cached
`node_filesystem_*`	Disk space and inodes	Size, free, available per mount point
`node_disk_*`	Disk I/O operations	Reads/writes completed, bytes, time
`node_network_*`	Network interface stats	Bytes received/transmitted, errors, drops
`node_load1/5/15`	System load averages	1, 5, and 15-minute load averages

Once the stack is running, you can verify node_exporter metrics directly:

curl -s http://localhost:9100/metrics | head -20

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 283974.22
node_cpu_seconds_total{cpu="0",mode="iowait"} 142.87
node_cpu_seconds_total{cpu="0",mode="system"} 4821.63
node_cpu_seconds_total{cpu="0",mode="user"} 12487.91
...

Adding Nginx Exporter

The nginx_exporter reads from Nginx's stub_status module. First, enable stub_status on your Nginx server. Add a location block to your Nginx configuration:

sudo nano /etc/nginx/conf.d/status.conf

Add the following content:

server {
    listen 8080;
    server_name localhost;

    location /nginx_status {
        stub_status on;
        allow 127.0.0.1;
        allow 172.16.0.0/12;  # Docker network range
        deny all;
    }
}

Test and reload Nginx:

sudo nginx -t
sudo systemctl reload nginx

# Verify stub_status works
curl http://localhost:8080/nginx_status

# Expected output:
# Active connections: 3
# server accepts handled requests
#  1542 1542 8923
# Reading: 0 Writing: 1 Waiting: 2

The Nginx exporter translates this into Prometheus metrics like nginx_connections_active, nginx_http_requests_total, and nginx_connections_reading/writing/waiting.

Adding MySQL/MariaDB Exporter

The mysqld_exporter needs a dedicated MySQL user with limited privileges. If you're running MySQL or MariaDB on the host (see our MySQL/MariaDB tuning guide for optimization), create the exporter user:

sudo mysql -u root << 'EOF'
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter_password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
EOF

Security note: Use a strong password in production. The exporter user only needs read-only access. The PROCESS privilege allows reading query statistics, and REPLICATION CLIENT allows reading replication status.

Update the DATA_SOURCE_NAME environment variable in your docker-compose.yml with the password you chose. Key metrics exposed by the MySQL exporter include:

mysql_global_status_queries — total query count (use rate() for queries per second)
mysql_global_status_threads_connected — current connection count
mysql_global_status_slow_queries — slow query count
mysql_global_status_innodb_buffer_pool_reads — buffer pool cache misses
mysql_global_status_connections — total connection attempts

Docker Metrics via cAdvisor

cAdvisor (Container Advisor) exposes per-container resource usage metrics. This is essential if you run your applications in Docker containers. The key metrics include:

container_cpu_usage_seconds_total — CPU usage per container
container_memory_usage_bytes — memory usage per container
container_network_receive_bytes_total — network I/O per container
container_fs_usage_bytes — filesystem usage per container

cAdvisor automatically discovers all running containers and labels metrics with the container name, image, and ID — no additional configuration needed.

Grafana Data Source Provisioning

Instead of configuring the Prometheus data source manually through the Grafana UI, we'll provision it automatically via a YAML file:

cat > ~/monitoring/grafana/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: '15s'
      httpMethod: POST
EOF

This tells Grafana to connect to Prometheus using the Docker network hostname. The timeInterval matches our Prometheus scrape interval, which helps Grafana choose appropriate step sizes for graph queries.

Dashboard Provisioning

Configure Grafana to look for dashboard JSON files in a specific directory:

cat > ~/monitoring/grafana/provisioning/dashboards/dashboards.yml << 'EOF'
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Provisioned'
    type: file
    disableDeletion: false
    editable: true
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: false
EOF

Launch the Stack

Start all services:

cd ~/monitoring
docker compose up -d

# Verify all containers are running
docker compose ps

Expected output:

NAME               IMAGE                                 STATUS          PORTS
cadvisor           gcr.io/cadvisor/cadvisor:v0.49.1      Up 2 minutes
grafana            grafana/grafana:11.1.0                 Up 2 minutes    127.0.0.1:3000->3000/tcp
mysqld_exporter    prom/mysqld-exporter:v0.15.1           Up 2 minutes
nginx_exporter     nginx/nginx-prometheus-exporter:1.1    Up 2 minutes
node_exporter      prom/node-exporter:v1.8.1              Up 2 minutes
prometheus         prom/prometheus:v2.53.0                Up 2 minutes    127.0.0.1:9090->9090/tcp

Verify Prometheus is scraping all targets:

# Check Prometheus targets status
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"health"'

All targets should show "health": "up". If any show "health": "down", check the container logs:

# Check logs for a specific exporter
docker compose logs nginx_exporter --tail 20

Importing Pre-Built Community Dashboards

Grafana has thousands of community-created dashboards on grafana.com/grafana/dashboards. Here are the most useful ones for this stack:

Dashboard	ID	What It Shows
Node Exporter Full	1860	Complete system metrics (CPU, memory, disk, network)
Docker Container Monitoring	893	Per-container resource usage via cAdvisor
Nginx Exporter	12708	Nginx connections and request rates
MySQL Overview	7362	MySQL queries, connections, buffer pool
Prometheus Stats	2	Prometheus self-monitoring

To import a dashboard:

Access Grafana at http://localhost:3000 (or via your reverse proxy)
Log in with the admin credentials from your docker-compose.yml
Navigate to Dashboards → New → Import
Enter the dashboard ID (e.g., 1860) and click Load
Select Prometheus as the data source and click Import

The Node Exporter Full dashboard (ID 1860) is particularly comprehensive. It shows over 30 panels covering CPU utilization by mode, memory usage breakdown, disk I/O rates, network traffic, system load, and more.

Creating Custom Dashboards

Community dashboards are a great starting point, but you'll want custom dashboards tailored to your specific stack. Here's how to build them.

Essential PromQL Queries

Before creating panels, understand the key PromQL patterns:

# CPU usage percentage (all cores combined)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage for root filesystem
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# Network received bytes per second
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# Nginx requests per second
rate(nginx_http_requests_total[5m])

# MySQL queries per second
rate(mysql_global_status_queries[5m])

# Container CPU usage percentage
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

# Container memory usage in MB
container_memory_usage_bytes{name!=""} / 1024 / 1024

# Disk I/O wait time (indicates storage bottleneck)
rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100

# Filesystem predicted full (linear prediction, 4 hours ahead)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0

Building a VPS Overview Dashboard

Create a dashboard with these panels for a single-VPS overview:

Row 1 — Stat panels (single values):

CPU Usage % — gauge visualization, thresholds at 70% (yellow) and 90% (red)
Memory Usage % — gauge visualization, same thresholds
Disk Usage % — gauge visualization, thresholds at 80% and 95%
Uptime — stat visualization showing time() - node_boot_time_seconds

Row 2 — Time-series panels:

CPU Usage Over Time — stacked area chart by mode (user, system, iowait)
Memory Usage Over Time — stacked area chart (used, buffers, cached, free)

Row 3 — Time-series panels:

Network Traffic — two series (received and transmitted) on the same panel
Disk I/O — read and write bytes per second

Row 4 — Application panels:

Nginx Requests/sec — time-series from nginx exporter
MySQL Queries/sec — time-series from MySQL exporter

Using Dashboard Variables

Dashboard variables (also called template variables) let you create reusable dashboards. For example, add a variable to select the network interface dynamically:

Open Dashboard Settings → Variables → New Variable
Name: interface
Type: Query
Query: label_values(node_network_receive_bytes_total, device)
Regex: /^(eth|ens|enp).*/ (filters out virtual interfaces)

Then use $interface in your PromQL queries:

rate(node_network_receive_bytes_total{device="$interface"}[5m])

This adds a dropdown at the top of the dashboard where you can select which network interface to display.

Alert Rules and Notification Channels

Grafana's built-in alerting system can send notifications when metrics cross thresholds. Here's how to set up production-ready alerts.

Configure Contact Points

In Grafana, navigate to Alerting → Contact points. You can configure multiple notification channels:

Email — Uses the SMTP settings from our Docker Compose environment variables. Add a contact point of type "Email" and specify recipient addresses.

Slack — Create an incoming webhook in your Slack workspace, then add a contact point of type "Slack" with the webhook URL:

# Slack webhook URL format
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX

Telegram — Create a bot via @BotFather, get the bot token and chat ID, then configure a Telegram contact point with those values.

Essential Alert Rules

Create these alert rules under Alerting → Alert rules → New alert rule:

High CPU Usage:

# Query A (PromQL)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Condition: Is above 85
# For: 10m (sustained for 10 minutes before firing)
# Summary: CPU usage above 85% for 10 minutes

Memory Running Low:

# Query A (PromQL)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Condition: Is above 90
# For: 5m
# Summary: Memory usage above 90% for 5 minutes

Disk Space Critical:

# Query A (PromQL)
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# Condition: Is above 90
# For: 5m
# Summary: Root filesystem above 90% capacity

Disk Space Prediction (fills in 4 hours):

# Query A (PromQL)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0

# Condition: Is equal to 1 (the expression itself returns 1 when true)
# For: 30m
# Summary: Disk predicted to fill within 4 hours

High I/O Wait:

# Query A (PromQL)
avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100

# Condition: Is above 20
# For: 10m
# Summary: High I/O wait - possible storage bottleneck

Nginx Connections Spike:

# Query A (PromQL)
nginx_connections_active

# Condition: Is above 500
# For: 5m
# Summary: Nginx active connections above 500 for 5 minutes

MySQL Too Many Connections:

# Query A (PromQL)
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

# Condition: Is above 80
# For: 5m
# Summary: MySQL connection usage above 80%

Container Restart Loop:

# Query A (PromQL)
increase(container_last_seen{name!=""}[5m]) == 0

# Alternative: Use container start time changes
changes(container_start_time_seconds{name!=""}[10m]) > 2

# Summary: Container restarted more than twice in 10 minutes

Notification Policies

Under Alerting → Notification policies, configure how alerts route to contact points:

Critical alerts (disk full, OOM) → Telegram + Email (immediate)
Warning alerts (high CPU, connection spikes) → Slack channel (grouped, every 5 minutes)
Set a group wait of 30 seconds (time to wait before sending the first notification for a new group)
Set a group interval of 5 minutes (time between repeated notifications for the same group)
Set a repeat interval of 4 hours (time before re-sending a notification that has already been sent)

Securing Access

Since Prometheus and Grafana are bound to 127.0.0.1, they're not directly accessible from the internet. You have two options for external access:

Option 1: Nginx Reverse Proxy with SSL

server {
    listen 443 ssl http2;
    server_name monitoring.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;

    # Grafana
    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # WebSocket support for Grafana live
    location /api/live/ {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
    }
}

Option 2: SSH Tunnel

For quick access without configuring a reverse proxy, use an SSH tunnel from your local machine:

# Forward Grafana to your local port 3000
ssh -L 3000:127.0.0.1:3000 user@your-vps-ip

# Then open http://localhost:3000 in your browser

For more SSH tunnel patterns, see our WireGuard VPN guide for persistent secure access.

Storage Management for Time-Series Data

Prometheus stores all scraped metrics as time-series data on disk. Understanding storage growth is essential for long-term operation.

Estimating Storage Requirements

The formula for Prometheus storage:

Storage = num_series × scrape_interval_samples_per_day × bytes_per_sample × retention_days

Approximate:
- bytes_per_sample ≈ 1-2 bytes (after compression)
- samples_per_day at 15s interval = 5,760

For our stack with ~1,500 active time series (node_exporter ~700, cAdvisor ~400, MySQL ~200, Nginx ~50, Prometheus ~150):

1,500 series × 5,760 samples/day × 2 bytes × 30 days ≈ 500 MB

With more containers, more MySQL schemas, or shorter scrape intervals, this grows proportionally.

Monitoring Prometheus Storage

Add these PromQL queries to a "Prometheus Health" dashboard panel:

# Current TSDB storage size in GB
prometheus_tsdb_storage_size_bytes / 1024 / 1024 / 1024

# Number of active time series
prometheus_tsdb_head_series

# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Storage size trend over last 7 days (predicting growth)
predict_linear(prometheus_tsdb_storage_size_bytes[7d], 30*24*3600)

Compaction and Cleanup

Prometheus automatically compacts data blocks over time. You can also manage retention actively:

# Check current disk usage
docker exec prometheus du -sh /prometheus

# Current retention settings are in docker-compose.yml:
# --storage.tsdb.retention.time=30d
# --storage.tsdb.retention.size=10GB

# To change retention, update docker-compose.yml and restart:
docker compose up -d prometheus

If you need to free space immediately, reduce the retention time and Prometheus will clean up old blocks at the next compaction cycle (runs approximately every 2 hours).

Reloading Configuration Without Downtime

When you modify prometheus.yml (for example, to add new scrape targets), you don't need to restart the container:

# Reload Prometheus configuration via API
curl -X POST http://localhost:9090/-/reload

# Verify the reload was successful
curl -s http://localhost:9090/api/v1/status/config | python3 -m json.tool | head -5

This works because we enabled --web.enable-lifecycle in the Prometheus startup command.

Adding More Exporters

The Prometheus ecosystem has exporters for almost everything. Common additions:

Service	Exporter	Default Port
PostgreSQL	postgres_exporter	9187
Redis	redis_exporter	9121
PHP-FPM	php-fpm_exporter	9253
Blackbox (URL probing)	blackbox_exporter	9115
SSL Certificate	ssl_exporter	9219

To add a new exporter, add the container to docker-compose.yml and a corresponding job_name block in prometheus.yml, then reload both services.

Troubleshooting Common Issues

Target Shows "Down" in Prometheus

# Check if the exporter is running
docker compose ps

# Check exporter logs
docker compose logs node_exporter --tail 50

# Test connectivity from Prometheus container
docker exec prometheus wget -qO- http://node_exporter:9100/metrics | head -5

Grafana Shows "No Data"

Verify the data source is configured correctly: Configuration → Data Sources → Prometheus → Test
Check the time range selector — make sure it covers a period when data was being collected
Open the query inspector (panel edit → Query Inspector) to see the raw PromQL query and response

High Memory Usage by Prometheus

# Check Prometheus memory usage
docker stats prometheus --no-stream

# If memory is high, check the number of active series
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool

# Reduce cardinality by dropping unnecessary labels in prometheus.yml:
# metric_relabel_configs:
#   - source_labels: [__name__]
#     regex: 'go_.*'
#     action: drop

Prefer Managed Monitoring?

Setting up and maintaining a monitoring stack requires ongoing attention: updating container images, managing storage growth, tuning alert thresholds, and responding to incidents. If you'd rather focus on your application while experts handle the infrastructure monitoring, MassiveGRID's fully managed dedicated hosting includes 24/7 monitoring, proactive alerting, and incident response — all handled by our team.

PromQL queries are CPU-intensive for complex dashboards. If you're running dashboards with many panels, heavy aggregations, or long time ranges, dedicated CPU resources ensure consistent query performance without contention from other workloads.

Summary

You now have a production-grade monitoring stack running on your Ubuntu VPS:

Prometheus — scraping metrics every 15 seconds from four exporters, with 30-day retention
node_exporter — CPU, memory, disk, network system metrics
nginx_exporter — web server connection and request metrics
mysqld_exporter — database query and connection metrics
cAdvisor — per-container resource usage metrics
Grafana — dashboards with community and custom panels, alerting to Slack/email/Telegram

This stack scales well. You can add exporters for any new service, create dashboards for different audiences (ops team vs developers), and extend alerting rules as your application grows. The pull-based architecture means adding a new target is just a configuration change — no agents to install on the monitored service.

For ongoing disk management as your Prometheus data grows, check our disk space management guide. And if your monitoring needs grow beyond a single VPS, consider running Prometheus with remote storage backends like Thanos or Cortex for multi-server aggregation.