Docker Swarm occupies a sweet spot in container orchestration that Kubernetes often overshoots. If you're running 2 to 10 servers, need service discovery and rolling updates, but don't want to operate etcd clusters, manage Helm charts, or dedicate three nodes just to the control plane, Swarm delivers orchestration with the tooling you already know. It's built into Docker, uses Compose files for deployment, and a three-node cluster can be running in under ten minutes.
This guide walks through deploying a Docker Swarm cluster across multiple Ubuntu VPS instances — from initializing the swarm and joining workers through overlay networking, service deployment, rolling updates, secrets management, and monitoring. If you've outgrown single-server Docker but aren't ready for Kubernetes complexity, Swarm is exactly where you should be.
MassiveGRID Ubuntu VPS includes: Ubuntu 24.04 LTS pre-installed · Proxmox HA cluster with automatic failover · Ceph 3x replicated NVMe storage · Independent CPU/RAM/storage scaling · 12 Tbps DDoS protection · 4 global datacenter locations · 100% uptime SLA · 24/7 human support rated 9.5/10
Deploy a self-managed VPS — from $1.99/mo
Need dedicated resources? — from $19.80/mo
Want fully managed hosting? — we handle everything
When Docker Swarm Makes Sense
Docker Swarm fits a specific niche in the orchestration landscape:
- 2-10 nodes — large enough that deploying containers manually across servers is error-prone, small enough that Kubernetes is overkill
- Teams already using Docker Compose — Swarm deploys Compose files natively with
docker stack deploy, so your existing Compose files work with minimal changes - Applications needing basic orchestration — rolling updates, health checks, service discovery, load balancing, and placement constraints without YAML sprawl
- Limited ops capacity — a single engineer can operate a Swarm cluster, while Kubernetes typically demands a dedicated platform team
Swarm is not the right choice if you need auto-scaling based on metrics, advanced traffic management (canary deploys, traffic splitting), custom resource types, or if your cluster will grow beyond 10-15 nodes. At that scale, the investment in Kubernetes or Nomad pays off.
Docker Swarm vs Kubernetes vs Nomad
An honest comparison for VPS users running small to medium infrastructure:
- Docker Swarm — zero additional installation (built into Docker), Compose-native, minimal learning curve, limited ecosystem. Best for: teams that want orchestration without a new platform to learn.
- Kubernetes (K3s/K0s) — massive ecosystem, infinite extensibility, complex operations, heavy resource overhead (even lightweight distributions need 512MB+ RAM per node for system components). Best for: teams with platform engineering capacity and growing infrastructure.
- HashiCorp Nomad — single binary, supports containers and non-container workloads, simpler than Kubernetes but more capable than Swarm, integrates with the HashiCorp ecosystem (Consul, Vault). Best for: mixed workloads or teams already using HashiCorp tools.
For most VPS-based deployments running web applications, APIs, and background workers across a handful of servers, Swarm provides the best ratio of capability to operational complexity.
Prerequisites
You need 2-3 Ubuntu 24.04 VPS instances with Docker installed. Docker Swarm requires at least 2 VPS. Deploy Cloud VPS instances in different datacenters for geographic distribution and fault isolation.
If you haven't installed Docker yet, follow the Docker installation guide for Ubuntu VPS on each node.
For this guide, we'll use three nodes:
- manager1 — 203.0.113.10 (Swarm manager + worker)
- worker1 — 203.0.113.11 (worker node)
- worker2 — 203.0.113.12 (worker node)
Open the following ports between all Swarm nodes:
# On all nodes — allow Swarm traffic
sudo ufw allow 2377/tcp # Swarm management (manager only, but open on all for promotion)
sudo ufw allow 7946/tcp # Node communication
sudo ufw allow 7946/udp # Node communication
sudo ufw allow 4789/udp # Overlay network (VXLAN)
# Verify
sudo ufw status
Initializing the Swarm
On the manager node, initialize the swarm. The --advertise-addr flag specifies the IP that other nodes use to reach this manager:
# On manager1 (203.0.113.10)
docker swarm init --advertise-addr 203.0.113.10
Docker outputs a docker swarm join command with a token. Save this — you'll run it on worker nodes. You can also retrieve the join tokens later:
# Get worker join token
docker swarm join-token worker
# Get manager join token (for adding additional managers)
docker swarm join-token manager
Verify the swarm is active:
docker node ls
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
# abc123 * manager1 Ready Active Leader 27.x.x
Joining Worker Nodes
On each worker node, run the join command from the init output:
# On worker1 (203.0.113.11)
docker swarm join --token SWMTKN-1-abc123... 203.0.113.10:2377
# On worker2 (203.0.113.12)
docker swarm join --token SWMTKN-1-abc123... 203.0.113.10:2377
Back on the manager, verify all nodes are joined:
docker node ls
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
# abc123 * manager1 Ready Active Leader 27.x.x
# def456 worker1 Ready Active 27.x.x
# ghi789 worker2 Ready Active 27.x.x
For production clusters, run 3 or 5 manager nodes for high availability. Swarm uses the Raft consensus algorithm — it tolerates (N-1)/2 manager failures. Three managers tolerate one failure; five tolerate two.
# Promote a worker to manager (run on existing manager)
docker node promote worker1
# Or add a node directly as manager using the manager join token
docker swarm join --token SWMTKN-1-manager-token... 203.0.113.10:2377
Overlay Networking
Swarm overlay networks enable encrypted communication between containers running on different nodes. Unlike bridge networks (single host), overlay networks span the entire cluster:
# Create an encrypted overlay network
docker network create \
--driver overlay \
--attachable \
--opt encrypted \
app-network
The --attachable flag allows standalone containers (not just Swarm services) to join the network. The --opt encrypted flag enables IPsec encryption for all inter-node traffic on this network — essential when nodes communicate over the public internet.
Containers on the same overlay network can reach each other by service name. Swarm provides built-in DNS-based service discovery — no Consul or external service registry needed:
# Container in "web" service can reach "api" service at:
# http://api:3000
# DNS resolves to the virtual IP load-balanced across all "api" task replicas
Deploying Services
A Swarm service is the unit of deployment — it defines which image to run, how many replicas, which network to attach, and how to update:
# Deploy a single-replica service
docker service create \
--name api \
--network app-network \
--publish 3000:3000 \
--replicas 1 \
myregistry/api:latest
# Deploy a scaled service (3 replicas across the cluster)
docker service create \
--name web \
--network app-network \
--publish 80:8080 \
--replicas 3 \
--update-delay 10s \
--update-parallelism 1 \
myregistry/web:latest
# List services
docker service ls
# See where replicas are running
docker service ps web
When you publish a port with --publish, Swarm creates an ingress routing mesh. Any node in the cluster accepts connections on that port and routes them to a healthy replica, regardless of which node the replica runs on. Hit port 80 on any node's IP and reach one of the three web replicas.
Docker Stack Deploys with Compose Files
For real applications, define your services in a Compose file and deploy as a stack. This is the Swarm-native approach to multi-service deployments:
# docker-compose.prod.yml
version: "3.8"
services:
web:
image: myregistry/web:${TAG:-latest}
networks:
- app-network
ports:
- "80:8080"
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 15s
failure_action: rollback
order: start-first
rollback_config:
parallelism: 0
order: stop-first
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
resources:
limits:
cpus: "0.5"
memory: 512M
reservations:
cpus: "0.25"
memory: 256M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
api:
image: myregistry/api:${TAG:-latest}
networks:
- app-network
environment:
- DATABASE_URL=postgresql://user:pass@db:5432/app
- REDIS_URL=redis://redis:6379
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
resources:
limits:
cpus: "1.0"
memory: 1G
redis:
image: redis:7-alpine
networks:
- app-network
volumes:
- redis-data:/data
deploy:
replicas: 1
placement:
constraints:
- node.role == manager
db:
image: postgres:16-alpine
networks:
- app-network
volumes:
- pg-data:/var/lib/postgresql/data
environment:
- POSTGRES_DB=app
- POSTGRES_USER=user
- POSTGRES_PASSWORD_FILE=/run/secrets/db_password
secrets:
- db_password
deploy:
replicas: 1
placement:
constraints:
- node.labels.storage == ssd
networks:
app-network:
driver: overlay
attachable: true
volumes:
redis-data:
pg-data:
secrets:
db_password:
external: true
Deploy the stack:
# Deploy or update the stack
TAG=v1.2.3 docker stack deploy -c docker-compose.prod.yml myapp
# List stacks
docker stack ls
# List services in a stack
docker stack services myapp
# View logs for a service
docker service logs -f myapp_web
# Remove the stack
docker stack rm myapp
Rolling Updates with Health Checks
Swarm performs rolling updates by default — replacing one replica at a time, waiting for the new replica to pass its health check before proceeding:
# Update the image for a service
docker service update --image myregistry/web:v1.3.0 myapp_web
# Watch the update progress
docker service ps myapp_web
# If something goes wrong, rollback
docker service rollback myapp_web
The update_config in the Compose file controls update behavior:
parallelism: 1— update one replica at a timedelay: 15s— wait 15 seconds between replicasfailure_action: rollback— automatically rollback if a replica fails to startorder: start-first— start the new replica before stopping the old one (ensures capacity during updates)
The health check is critical. Without it, Swarm considers a container healthy as soon as it starts. With a health check, Swarm waits for the health check to pass before routing traffic and before updating the next replica.
Placement Constraints
Placement constraints pin services to specific nodes based on node labels, roles, or hostnames:
# Add labels to nodes
docker node update --label-add storage=ssd worker1
docker node update --label-add storage=ssd worker2
docker node update --label-add region=us-east manager1
docker node update --label-add region=eu-west worker1
# Constrain database to SSD-labeled nodes
docker service create \
--name db \
--constraint 'node.labels.storage == ssd' \
postgres:16
# Run monitoring only on manager nodes
docker service create \
--name monitoring \
--constraint 'node.role == manager' \
--mode global \
prom/node-exporter
# Spread replicas across regions
docker service create \
--name web \
--replicas 4 \
--placement-pref 'spread=node.labels.region' \
myregistry/web:latest
The --mode global flag runs exactly one replica on every node matching the constraint — useful for monitoring agents, log collectors, and network tools.
Secrets Management
Swarm secrets store sensitive data (passwords, API keys, TLS certificates) encrypted in the Raft log and mount them as files in containers at /run/secrets/:
# Create a secret from stdin
echo "my_db_password_here" | docker secret create db_password -
# Create from a file
docker secret create tls_cert /path/to/cert.pem
# List secrets
docker secret ls
# Inspect (shows metadata, not the value)
docker secret inspect db_password
# Use in a service
docker service create \
--name api \
--secret db_password \
--secret tls_cert \
myregistry/api:latest
Inside the container, read secrets from the filesystem:
# In your application code (Python example)
import pathlib
def get_secret(name):
secret_path = pathlib.Path(f"/run/secrets/{name}")
if secret_path.exists():
return secret_path.read_text().strip()
raise ValueError(f"Secret {name} not found")
db_password = get_secret("db_password")
Secrets are never written to disk on worker nodes — they exist only in memory (tmpfs). When a service is removed or a secret is rotated, it's immediately cleaned up from all nodes.
To rotate a secret:
# Create new version
echo "new_password" | docker secret create db_password_v2 -
# Update the service to use the new secret
docker service update \
--secret-rm db_password \
--secret-add db_password_v2 \
myapp_api
Monitoring a Swarm
Two tools complement each other for Swarm monitoring:
Portainer provides a web UI for managing the Swarm — deploying stacks, viewing service logs, inspecting containers, and managing volumes. See installing Portainer on Ubuntu VPS for setup details. Deploy it as a Swarm service:
docker service create \
--name portainer \
--publish 9443:9443 \
--constraint 'node.role == manager' \
--mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock \
--mount type=volume,src=portainer_data,dst=/data \
portainer/portainer-ce:latest
Prometheus + Grafana provides metrics collection and visualization for deeper performance analysis. See Prometheus and Grafana on Ubuntu VPS for the full setup. Deploy node-exporter as a global service to collect system metrics from every node:
# Deploy node-exporter on every node
docker service create \
--name node-exporter \
--mode global \
--network app-network \
--mount type=bind,src=/proc,dst=/host/proc,readonly \
--mount type=bind,src=/sys,dst=/host/sys,readonly \
--mount type=bind,src=/,dst=/rootfs,readonly \
prom/node-exporter:latest \
--path.procfs=/host/proc \
--path.sysfs=/host/sys \
--path.rootfs=/rootfs
# Deploy cAdvisor for container-level metrics
docker service create \
--name cadvisor \
--mode global \
--network app-network \
--mount type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock,readonly \
--mount type=bind,src=/,dst=/rootfs,readonly \
--mount type=bind,src=/sys,dst=/sys,readonly \
--mount type=bind,src=/var/lib/docker/,dst=/var/lib/docker,readonly \
gcr.io/cadvisor/cadvisor:latest
Manager Nodes Need Consistent Performance
Swarm manager nodes run the Raft consensus algorithm, which is sensitive to latency. Manager nodes must exchange heartbeats and reach agreement on cluster state within tight timeouts. If a manager is slow to respond — due to CPU contention from noisy neighbors — the cluster can trigger unnecessary leader elections, causing brief service disruptions.
Run Swarm manager nodes on Dedicated VPS instances where CPU and memory resources are guaranteed. Raft consensus decisions aren't delayed by resource contention, and leader elections happen only when they should — during actual node failures, not performance dips.
Scaling: Adding and Removing Nodes
Swarm makes horizontal scaling straightforward:
# Scale a service up
docker service scale myapp_web=5
# Scale multiple services at once
docker service scale myapp_web=5 myapp_api=3
# Add a new worker node
# 1. Deploy a new VPS and install Docker
# 2. Get the join token from the manager
docker swarm join-token worker
# 3. Run the join command on the new node
docker swarm join --token SWMTKN-1-abc... 203.0.113.10:2377
# Swarm automatically schedules new replicas on the new node when you scale up
To remove a node gracefully:
# 1. Drain the node (reschedules all tasks to other nodes)
docker node update --availability drain worker2
# 2. Wait for tasks to migrate
docker node ps worker2 # Should show all tasks as "Shutdown"
# 3. On the worker node, leave the swarm
docker swarm leave
# 4. On the manager, remove the node
docker node rm worker2
Draining before removal ensures all containers are rescheduled to healthy nodes with no service interruption.
Prefer Managed Orchestration?
A multi-node Swarm cluster means managing OS updates, Docker engine upgrades, network security, TLS certificate rotation, and monitoring across every server in the cluster. Each node is an attack surface that needs patching, and a Docker upgrade on one node can introduce incompatibilities with others.
With MassiveGRID managed servers, the operations team handles infrastructure orchestration, OS maintenance, security patching, and Docker lifecycle management across your cluster — so you deploy your application stack while the platform team ensures the underlying infrastructure stays healthy, secure, and up to date.