If you've been running Netdata or basic monitoring scripts on your Ubuntu VPS, you've probably hit a wall. Maybe you need metrics from multiple services correlated on a single dashboard. Maybe you need alerting that goes beyond "disk is full." Maybe you need to retain weeks of historical data to spot trends. That's when it's time to graduate to Prometheus and Grafana — the industry-standard open-source monitoring stack used by companies from startups to Fortune 500s.
This guide walks you through deploying a production-ready monitoring stack on an Ubuntu VPS using Docker Compose. By the end, you'll have Prometheus scraping metrics from your system, your web server, your database, and your containers — all visualized in Grafana dashboards with alerting configured.
MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10
Deploy a self-managed VPS — from $1.99/mo
Need dedicated resources? — from $19.80/mo
Want fully managed hosting? — we handle everything
When to Move Beyond Basic Monitoring
Tools like Netdata (covered in our VPS monitoring setup guide) are excellent for real-time system observation. But as your infrastructure grows, you'll need capabilities that basic tools can't provide:
- Multi-service correlation — overlay Nginx request rates with database query times and CPU usage on a single dashboard
- Custom metrics — instrument your own applications and expose business-level metrics
- Powerful query language — PromQL lets you compute rates, aggregates, percentiles, and predictions
- Long-term retention — store weeks or months of historical data for capacity planning
- Flexible alerting — alert on complex conditions like "error rate above 5% for 10 minutes" rather than simple thresholds
- Dashboard sharing — Grafana dashboards can be shared with your team, embedded, or exported
Architecture Overview
The Prometheus + Grafana stack follows a pull-based architecture:
┌─────────────┐ scrapes ┌──────────────────┐
│ node_exporter├────────────────► │ │
└─────────────┘ │ │
┌──────────────┐ scrapes │ Prometheus │ queries ┌─────────┐
│nginx_exporter├────────────────►│ (time-series DB)├────────────────►│ Grafana │
└──────────────┘ │ │ │ (UI) │
┌──────────────┐ scrapes │ │ └─────────┘
│mysqld_exporter├───────────────►│ │
└──────────────┘ └──────────────────┘
┌──────────────┐ scrapes │
│ cAdvisor ├────────────────────────┘
└──────────────┘
Prometheus periodically scrapes HTTP endpoints (called "targets") exposed by exporters. Each exporter translates service-specific metrics into Prometheus's text format. Prometheus stores these metrics as time-series data. Grafana connects to Prometheus as a data source and lets you build dashboards and alerts using PromQL queries.
Prerequisites
You'll need an Ubuntu VPS with at least 2 vCPU and 4 GB RAM. The full stack (Prometheus, Grafana, four exporters) uses approximately 500–800 MB of RAM at idle and scales with the number of metrics and retention period.
Docker and Docker Compose must be installed. If you haven't set those up yet, follow our Docker installation guide first.
Verify your Docker installation:
docker --version
# Docker version 27.x.x
docker compose version
# Docker Compose version v2.x.x
The full stack runs on a Cloud VPS with 2 vCPU / 4 GB RAM. Prometheus stores time-series data — scale storage independently as your retention grows.
Project Structure
Create the directory structure for your monitoring stack:
mkdir -p ~/monitoring/{prometheus,grafana/{provisioning/datasources,provisioning/dashboards}}
cd ~/monitoring
The final structure will look like this:
~/monitoring/
├── docker-compose.yml
├── prometheus/
│ └── prometheus.yml
└── grafana/
└── provisioning/
├── datasources/
│ └── prometheus.yml
└── dashboards/
└── dashboards.yml
Docker Compose: The Full Stack
Create the docker-compose.yml file that defines all services:
cat > ~/monitoring/docker-compose.yml << 'EOF'
services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
restart: unless-stopped
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
ports:
- "127.0.0.1:9090:9090"
networks:
- monitoring
grafana:
image: grafana/grafana:11.1.0
container_name: grafana
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme_strong_password_here
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://monitoring.yourdomain.com
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.gmail.com:587
- GF_SMTP_USER=alerts@yourdomain.com
- GF_SMTP_PASSWORD=your_app_password
- GF_SMTP_FROM_ADDRESS=alerts@yourdomain.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "127.0.0.1:3000:3000"
depends_on:
- prometheus
networks:
- monitoring
node_exporter:
image: prom/node-exporter:v1.8.1
container_name: node_exporter
restart: unless-stopped
command:
- '--path.rootfs=/host'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.netclass.ignored-devices=^(veth.*|br-.*|docker.*)$$'
volumes:
- /:/host:ro,rslave
- /proc:/host/proc:ro
- /sys:/host/sys:ro
pid: host
networks:
- monitoring
nginx_exporter:
image: nginx/nginx-prometheus-exporter:1.1
container_name: nginx_exporter
restart: unless-stopped
command:
- '-nginx.scrape-uri=http://host.docker.internal:8080/nginx_status'
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
mysqld_exporter:
image: prom/mysqld-exporter:v0.15.1
container_name: mysqld_exporter
restart: unless-stopped
environment:
- DATA_SOURCE_NAME=exporter:exporter_password@(host.docker.internal:3306)/
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
EOF
Key decisions in this configuration:
- Ports bound to 127.0.0.1 — Prometheus (9090) and Grafana (3000) are only accessible locally. You'll use an Nginx reverse proxy or SSH tunnel for external access.
- Retention settings — Prometheus keeps data for 30 days or up to 10 GB, whichever limit is hit first.
--web.enable-lifecycle— allows reloading Prometheus configuration via HTTP POST without restarting the container.- Named volumes —
prometheus_dataandgrafana_datapersist data across container restarts.
Prometheus Configuration
Create the Prometheus configuration file that defines what to scrape and how often:
cat > ~/monitoring/prometheus/prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/prometheus/alert_rules.yml
scrape_configs:
# Prometheus monitors itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus'
# System metrics from node_exporter
- job_name: 'node'
static_configs:
- targets: ['node_exporter:9100']
labels:
instance: 'vps-01'
# Nginx metrics
- job_name: 'nginx'
static_configs:
- targets: ['nginx_exporter:9113']
labels:
instance: 'nginx-main'
# MySQL/MariaDB metrics
- job_name: 'mysql'
static_configs:
- targets: ['mysqld_exporter:9104']
labels:
instance: 'mysql-main'
# Docker container metrics from cAdvisor
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
labels:
instance: 'docker-host'
EOF
The scrape_interval of 15 seconds is the standard for most deployments. Lower intervals (5s) increase storage requirements proportionally. Higher intervals (30s-60s) may miss short spikes.
Understanding Scrape Configuration
Each job_name groups related targets. The static_configs block lists the endpoints to scrape. Since all our exporters run in the same Docker network, we use container names as hostnames. The labels section adds custom metadata that you can use in PromQL queries and Grafana dashboards to filter and group metrics.
Node Exporter: System Metrics
The node_exporter service defined in our Docker Compose file exposes hundreds of system-level metrics. Here are the key metric families:
| Metric Prefix | What It Measures | Example Metric |
|---|---|---|
node_cpu_seconds_total | CPU time per core and mode | Time spent in user, system, idle, iowait |
node_memory_* | Memory allocation | MemTotal, MemAvailable, Buffers, Cached |
node_filesystem_* | Disk space and inodes | Size, free, available per mount point |
node_disk_* | Disk I/O operations | Reads/writes completed, bytes, time |
node_network_* | Network interface stats | Bytes received/transmitted, errors, drops |
node_load1/5/15 | System load averages | 1, 5, and 15-minute load averages |
Once the stack is running, you can verify node_exporter metrics directly:
curl -s http://localhost:9100/metrics | head -20
# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 283974.22
node_cpu_seconds_total{cpu="0",mode="iowait"} 142.87
node_cpu_seconds_total{cpu="0",mode="system"} 4821.63
node_cpu_seconds_total{cpu="0",mode="user"} 12487.91
...
Adding Nginx Exporter
The nginx_exporter reads from Nginx's stub_status module. First, enable stub_status on your Nginx server. Add a location block to your Nginx configuration:
sudo nano /etc/nginx/conf.d/status.conf
Add the following content:
server {
listen 8080;
server_name localhost;
location /nginx_status {
stub_status on;
allow 127.0.0.1;
allow 172.16.0.0/12; # Docker network range
deny all;
}
}
Test and reload Nginx:
sudo nginx -t
sudo systemctl reload nginx
# Verify stub_status works
curl http://localhost:8080/nginx_status
# Expected output:
# Active connections: 3
# server accepts handled requests
# 1542 1542 8923
# Reading: 0 Writing: 1 Waiting: 2
The Nginx exporter translates this into Prometheus metrics like nginx_connections_active, nginx_http_requests_total, and nginx_connections_reading/writing/waiting.
Adding MySQL/MariaDB Exporter
The mysqld_exporter needs a dedicated MySQL user with limited privileges. If you're running MySQL or MariaDB on the host (see our MySQL/MariaDB tuning guide for optimization), create the exporter user:
sudo mysql -u root << 'EOF'
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter_password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
FLUSH PRIVILEGES;
EOF
Security note: Use a strong password in production. The
exporteruser only needs read-only access. ThePROCESSprivilege allows reading query statistics, andREPLICATION CLIENTallows reading replication status.
Update the DATA_SOURCE_NAME environment variable in your docker-compose.yml with the password you chose. Key metrics exposed by the MySQL exporter include:
mysql_global_status_queries— total query count (userate()for queries per second)mysql_global_status_threads_connected— current connection countmysql_global_status_slow_queries— slow query countmysql_global_status_innodb_buffer_pool_reads— buffer pool cache missesmysql_global_status_connections— total connection attempts
Docker Metrics via cAdvisor
cAdvisor (Container Advisor) exposes per-container resource usage metrics. This is essential if you run your applications in Docker containers. The key metrics include:
container_cpu_usage_seconds_total— CPU usage per containercontainer_memory_usage_bytes— memory usage per containercontainer_network_receive_bytes_total— network I/O per containercontainer_fs_usage_bytes— filesystem usage per container
cAdvisor automatically discovers all running containers and labels metrics with the container name, image, and ID — no additional configuration needed.
Grafana Data Source Provisioning
Instead of configuring the Prometheus data source manually through the Grafana UI, we'll provision it automatically via a YAML file:
cat > ~/monitoring/grafana/provisioning/datasources/prometheus.yml << 'EOF'
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: '15s'
httpMethod: POST
EOF
This tells Grafana to connect to Prometheus using the Docker network hostname. The timeInterval matches our Prometheus scrape interval, which helps Grafana choose appropriate step sizes for graph queries.
Dashboard Provisioning
Configure Grafana to look for dashboard JSON files in a specific directory:
cat > ~/monitoring/grafana/provisioning/dashboards/dashboards.yml << 'EOF'
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Provisioned'
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboards
foldersFromFilesStructure: false
EOF
Launch the Stack
Start all services:
cd ~/monitoring
docker compose up -d
# Verify all containers are running
docker compose ps
Expected output:
NAME IMAGE STATUS PORTS
cadvisor gcr.io/cadvisor/cadvisor:v0.49.1 Up 2 minutes
grafana grafana/grafana:11.1.0 Up 2 minutes 127.0.0.1:3000->3000/tcp
mysqld_exporter prom/mysqld-exporter:v0.15.1 Up 2 minutes
nginx_exporter nginx/nginx-prometheus-exporter:1.1 Up 2 minutes
node_exporter prom/node-exporter:v1.8.1 Up 2 minutes
prometheus prom/prometheus:v2.53.0 Up 2 minutes 127.0.0.1:9090->9090/tcp
Verify Prometheus is scraping all targets:
# Check Prometheus targets status
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"health"'
All targets should show "health": "up". If any show "health": "down", check the container logs:
# Check logs for a specific exporter
docker compose logs nginx_exporter --tail 20
Importing Pre-Built Community Dashboards
Grafana has thousands of community-created dashboards on grafana.com/grafana/dashboards. Here are the most useful ones for this stack:
| Dashboard | ID | What It Shows |
|---|---|---|
| Node Exporter Full | 1860 | Complete system metrics (CPU, memory, disk, network) |
| Docker Container Monitoring | 893 | Per-container resource usage via cAdvisor |
| Nginx Exporter | 12708 | Nginx connections and request rates |
| MySQL Overview | 7362 | MySQL queries, connections, buffer pool |
| Prometheus Stats | 2 | Prometheus self-monitoring |
To import a dashboard:
- Access Grafana at
http://localhost:3000(or via your reverse proxy) - Log in with the admin credentials from your
docker-compose.yml - Navigate to Dashboards → New → Import
- Enter the dashboard ID (e.g.,
1860) and click Load - Select Prometheus as the data source and click Import
The Node Exporter Full dashboard (ID 1860) is particularly comprehensive. It shows over 30 panels covering CPU utilization by mode, memory usage breakdown, disk I/O rates, network traffic, system load, and more.
Creating Custom Dashboards
Community dashboards are a great starting point, but you'll want custom dashboards tailored to your specific stack. Here's how to build them.
Essential PromQL Queries
Before creating panels, understand the key PromQL patterns:
# CPU usage percentage (all cores combined)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage for root filesystem
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Network received bytes per second
rate(node_network_receive_bytes_total{device="eth0"}[5m])
# Nginx requests per second
rate(nginx_http_requests_total[5m])
# MySQL queries per second
rate(mysql_global_status_queries[5m])
# Container CPU usage percentage
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
# Container memory usage in MB
container_memory_usage_bytes{name!=""} / 1024 / 1024
# Disk I/O wait time (indicates storage bottleneck)
rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100
# Filesystem predicted full (linear prediction, 4 hours ahead)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
Building a VPS Overview Dashboard
Create a dashboard with these panels for a single-VPS overview:
Row 1 — Stat panels (single values):
- CPU Usage % — gauge visualization, thresholds at 70% (yellow) and 90% (red)
- Memory Usage % — gauge visualization, same thresholds
- Disk Usage % — gauge visualization, thresholds at 80% and 95%
- Uptime — stat visualization showing
time() - node_boot_time_seconds
Row 2 — Time-series panels:
- CPU Usage Over Time — stacked area chart by mode (user, system, iowait)
- Memory Usage Over Time — stacked area chart (used, buffers, cached, free)
Row 3 — Time-series panels:
- Network Traffic — two series (received and transmitted) on the same panel
- Disk I/O — read and write bytes per second
Row 4 — Application panels:
- Nginx Requests/sec — time-series from nginx exporter
- MySQL Queries/sec — time-series from MySQL exporter
Using Dashboard Variables
Dashboard variables (also called template variables) let you create reusable dashboards. For example, add a variable to select the network interface dynamically:
- Open Dashboard Settings → Variables → New Variable
- Name:
interface - Type: Query
- Query:
label_values(node_network_receive_bytes_total, device) - Regex:
/^(eth|ens|enp).*/(filters out virtual interfaces)
Then use $interface in your PromQL queries:
rate(node_network_receive_bytes_total{device="$interface"}[5m])
This adds a dropdown at the top of the dashboard where you can select which network interface to display.
Alert Rules and Notification Channels
Grafana's built-in alerting system can send notifications when metrics cross thresholds. Here's how to set up production-ready alerts.
Configure Contact Points
In Grafana, navigate to Alerting → Contact points. You can configure multiple notification channels:
Email — Uses the SMTP settings from our Docker Compose environment variables. Add a contact point of type "Email" and specify recipient addresses.
Slack — Create an incoming webhook in your Slack workspace, then add a contact point of type "Slack" with the webhook URL:
# Slack webhook URL format
https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX
Telegram — Create a bot via @BotFather, get the bot token and chat ID, then configure a Telegram contact point with those values.
Essential Alert Rules
Create these alert rules under Alerting → Alert rules → New alert rule:
High CPU Usage:
# Query A (PromQL)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Condition: Is above 85
# For: 10m (sustained for 10 minutes before firing)
# Summary: CPU usage above 85% for 10 minutes
Memory Running Low:
# Query A (PromQL)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Condition: Is above 90
# For: 5m
# Summary: Memory usage above 90% for 5 minutes
Disk Space Critical:
# Query A (PromQL)
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Condition: Is above 90
# For: 5m
# Summary: Root filesystem above 90% capacity
Disk Space Prediction (fills in 4 hours):
# Query A (PromQL)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
# Condition: Is equal to 1 (the expression itself returns 1 when true)
# For: 30m
# Summary: Disk predicted to fill within 4 hours
High I/O Wait:
# Query A (PromQL)
avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100
# Condition: Is above 20
# For: 10m
# Summary: High I/O wait - possible storage bottleneck
Nginx Connections Spike:
# Query A (PromQL)
nginx_connections_active
# Condition: Is above 500
# For: 5m
# Summary: Nginx active connections above 500 for 5 minutes
MySQL Too Many Connections:
# Query A (PromQL)
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100
# Condition: Is above 80
# For: 5m
# Summary: MySQL connection usage above 80%
Container Restart Loop:
# Query A (PromQL)
increase(container_last_seen{name!=""}[5m]) == 0
# Alternative: Use container start time changes
changes(container_start_time_seconds{name!=""}[10m]) > 2
# Summary: Container restarted more than twice in 10 minutes
Notification Policies
Under Alerting → Notification policies, configure how alerts route to contact points:
- Critical alerts (disk full, OOM) → Telegram + Email (immediate)
- Warning alerts (high CPU, connection spikes) → Slack channel (grouped, every 5 minutes)
- Set a group wait of 30 seconds (time to wait before sending the first notification for a new group)
- Set a group interval of 5 minutes (time between repeated notifications for the same group)
- Set a repeat interval of 4 hours (time before re-sending a notification that has already been sent)
Securing Access
Since Prometheus and Grafana are bound to 127.0.0.1, they're not directly accessible from the internet. You have two options for external access:
Option 1: Nginx Reverse Proxy with SSL
server {
listen 443 ssl http2;
server_name monitoring.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/monitoring.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/monitoring.yourdomain.com/privkey.pem;
# Grafana
location / {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# WebSocket support for Grafana live
location /api/live/ {
proxy_pass http://127.0.0.1:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
Option 2: SSH Tunnel
For quick access without configuring a reverse proxy, use an SSH tunnel from your local machine:
# Forward Grafana to your local port 3000
ssh -L 3000:127.0.0.1:3000 user@your-vps-ip
# Then open http://localhost:3000 in your browser
For more SSH tunnel patterns, see our WireGuard VPN guide for persistent secure access.
Storage Management for Time-Series Data
Prometheus stores all scraped metrics as time-series data on disk. Understanding storage growth is essential for long-term operation.
Estimating Storage Requirements
The formula for Prometheus storage:
Storage = num_series × scrape_interval_samples_per_day × bytes_per_sample × retention_days
Approximate:
- bytes_per_sample ≈ 1-2 bytes (after compression)
- samples_per_day at 15s interval = 5,760
For our stack with ~1,500 active time series (node_exporter ~700, cAdvisor ~400, MySQL ~200, Nginx ~50, Prometheus ~150):
1,500 series × 5,760 samples/day × 2 bytes × 30 days ≈ 500 MB
With more containers, more MySQL schemas, or shorter scrape intervals, this grows proportionally.
Monitoring Prometheus Storage
Add these PromQL queries to a "Prometheus Health" dashboard panel:
# Current TSDB storage size in GB
prometheus_tsdb_storage_size_bytes / 1024 / 1024 / 1024
# Number of active time series
prometheus_tsdb_head_series
# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])
# Storage size trend over last 7 days (predicting growth)
predict_linear(prometheus_tsdb_storage_size_bytes[7d], 30*24*3600)
Compaction and Cleanup
Prometheus automatically compacts data blocks over time. You can also manage retention actively:
# Check current disk usage
docker exec prometheus du -sh /prometheus
# Current retention settings are in docker-compose.yml:
# --storage.tsdb.retention.time=30d
# --storage.tsdb.retention.size=10GB
# To change retention, update docker-compose.yml and restart:
docker compose up -d prometheus
If you need to free space immediately, reduce the retention time and Prometheus will clean up old blocks at the next compaction cycle (runs approximately every 2 hours).
Reloading Configuration Without Downtime
When you modify prometheus.yml (for example, to add new scrape targets), you don't need to restart the container:
# Reload Prometheus configuration via API
curl -X POST http://localhost:9090/-/reload
# Verify the reload was successful
curl -s http://localhost:9090/api/v1/status/config | python3 -m json.tool | head -5
This works because we enabled --web.enable-lifecycle in the Prometheus startup command.
Adding More Exporters
The Prometheus ecosystem has exporters for almost everything. Common additions:
| Service | Exporter | Default Port |
|---|---|---|
| PostgreSQL | postgres_exporter | 9187 |
| Redis | redis_exporter | 9121 |
| PHP-FPM | php-fpm_exporter | 9253 |
| Blackbox (URL probing) | blackbox_exporter | 9115 |
| SSL Certificate | ssl_exporter | 9219 |
To add a new exporter, add the container to docker-compose.yml and a corresponding job_name block in prometheus.yml, then reload both services.
Troubleshooting Common Issues
Target Shows "Down" in Prometheus
# Check if the exporter is running
docker compose ps
# Check exporter logs
docker compose logs node_exporter --tail 50
# Test connectivity from Prometheus container
docker exec prometheus wget -qO- http://node_exporter:9100/metrics | head -5
Grafana Shows "No Data"
- Verify the data source is configured correctly: Configuration → Data Sources → Prometheus → Test
- Check the time range selector — make sure it covers a period when data was being collected
- Open the query inspector (panel edit → Query Inspector) to see the raw PromQL query and response
High Memory Usage by Prometheus
# Check Prometheus memory usage
docker stats prometheus --no-stream
# If memory is high, check the number of active series
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool
# Reduce cardinality by dropping unnecessary labels in prometheus.yml:
# metric_relabel_configs:
# - source_labels: [__name__]
# regex: 'go_.*'
# action: drop
Prefer Managed Monitoring?
Setting up and maintaining a monitoring stack requires ongoing attention: updating container images, managing storage growth, tuning alert thresholds, and responding to incidents. If you'd rather focus on your application while experts handle the infrastructure monitoring, MassiveGRID's fully managed dedicated hosting includes 24/7 monitoring, proactive alerting, and incident response — all handled by our team.
PromQL queries are CPU-intensive for complex dashboards. If you're running dashboards with many panels, heavy aggregations, or long time ranges, dedicated CPU resources ensure consistent query performance without contention from other workloads.
Summary
You now have a production-grade monitoring stack running on your Ubuntu VPS:
- Prometheus — scraping metrics every 15 seconds from four exporters, with 30-day retention
- node_exporter — CPU, memory, disk, network system metrics
- nginx_exporter — web server connection and request metrics
- mysqld_exporter — database query and connection metrics
- cAdvisor — per-container resource usage metrics
- Grafana — dashboards with community and custom panels, alerting to Slack/email/Telegram
This stack scales well. You can add exporters for any new service, create dashboards for different audiences (ops team vs developers), and extend alerting rules as your application grows. The pull-based architecture means adding a new target is just a configuration change — no agents to install on the monitored service.
For ongoing disk management as your Prometheus data grows, check our disk space management guide. And if your monitoring needs grow beyond a single VPS, consider running Prometheus with remote storage backends like Thanos or Cortex for multi-server aggregation.