Scale XWiki to 1000+ Users: Clustering Architecture

A single-server XWiki deployment works until it does not. The moment your wiki becomes the authoritative source for operational procedures, compliance documentation, or customer-facing knowledge, a single point of failure becomes an unacceptable business risk. High availability for XWiki means running multiple application nodes behind a load balancer, sharing a common database and filesystem, and ensuring that the failure of any single component does not take the wiki offline. This guide covers the architecture, configuration, and operational considerations for deploying XWiki in a highly available clustered configuration -- from application node setup and session management to database replication and cache invalidation.

Before implementing clustering, ensure your base XWiki installation is production-ready. Our XWiki production installation guide covers server sizing, database configuration, JVM tuning, and security hardening for single-node deployments. Clustering builds on that foundation -- every optimization and security measure from the single-node guide applies to each node in the cluster.

XWiki Clustering Architecture Overview

An XWiki cluster consists of multiple application nodes running identical XWiki instances that share a common PostgreSQL database and a shared filesystem for attachments and extension data. A load balancer distributes incoming requests across the application nodes, and a cache invalidation mechanism ensures that content changes on one node are visible to users connected to other nodes. The architecture looks like this:

Users connect to a load balancer (Nginx or HAProxy), which forwards requests to one of several XWiki application nodes (each running Tomcat with XWiki). All nodes connect to the same PostgreSQL database (which may itself be replicated for HA) and mount the same shared filesystem (NFS, GlusterFS, or Ceph) for attachment storage. JGroups handles inter-node communication for cache invalidation and event distribution.

Core Components

The load balancer is the entry point for all user traffic. It distributes requests across healthy application nodes, terminates SSL, and performs health checks to remove failed nodes from the rotation. Nginx and HAProxy are both viable choices, with HAProxy offering more sophisticated load balancing algorithms and health check options.

Each application node runs a standard XWiki installation on Tomcat, configured identically with the same database connection, the same filesystem mount, and the same XWiki configuration files. The nodes are stateless from a content perspective -- all persistent state lives in the database and shared filesystem -- but they maintain local caches for performance.

The shared database stores all wiki content, user data, permissions, and metadata. PostgreSQL is the recommended database for clustered XWiki deployments due to its robust replication capabilities and consistent behavior under concurrent write loads from multiple application nodes.

The shared filesystem stores attachments, extension data, and temporary files that need to be accessible from all nodes. The choice of shared filesystem depends on your infrastructure: NFS is the simplest option for small clusters, GlusterFS provides replicated distributed storage, and Ceph offers the most resilient option with self-healing and automatic rebalancing.

Load Balancing with Nginx and HAProxy

Nginx Configuration

Nginx functions as a reverse proxy and load balancer in front of the XWiki nodes. The upstream block defines the application nodes, and the server block routes traffic to them. Sticky sessions (IP hash or cookie-based) are recommended to ensure that a user's requests consistently reach the same node during a session, which reduces cache misses and provides a more consistent editing experience.

upstream xwiki_cluster {
    ip_hash;
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
    server 10.0.1.12:8080;
}

server {
    listen 443 ssl http2;
    server_name wiki.example.com;

    ssl_certificate /etc/letsencrypt/live/wiki.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/wiki.example.com/privkey.pem;

    client_max_body_size 100m;

    location / {
        proxy_pass http://xwiki_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 300s;
        proxy_connect_timeout 10s;
    }
}

HAProxy Configuration

HAProxy offers more granular control over load balancing behavior, which matters in larger clusters. Use the balance source directive for session affinity, configure TCP or HTTP health checks against XWiki's REST API endpoint, and set appropriate timeouts for long-running wiki operations like large page renders and export operations.

frontend xwiki_front
    bind *:443 ssl crt /etc/haproxy/certs/wiki.example.com.pem
    default_backend xwiki_back

backend xwiki_back
    balance source
    option httpchk GET /rest/
    http-check expect status 200
    server node1 10.0.1.10:8080 check inter 10s fall 3 rise 2
    server node2 10.0.1.11:8080 check inter 10s fall 3 rise 2
    server node3 10.0.1.12:8080 check inter 10s fall 3 rise 2

The health check (option httpchk) polls each node's REST API endpoint at 10-second intervals. If a node fails three consecutive checks (fall 3), it is removed from the rotation. It must pass two consecutive checks (rise 2) before being reintroduced. This prevents flapping during brief transient issues.

Session Replication and Cache Invalidation with JGroups

The most critical aspect of XWiki clustering is ensuring that all nodes have a consistent view of the wiki's content. When a user edits a page on Node A, users connected to Node B and Node C must see the updated content without delay. XWiki uses JGroups for inter-node communication to handle cache invalidation and event distribution.

Configuring JGroups

JGroups is a toolkit for reliable group communication in Java applications. XWiki uses it to form a cluster group where all nodes can exchange messages. When a page is saved on one node, a cache invalidation message is broadcast to all other nodes via JGroups, causing them to evict the stale page from their local caches. The next request for that page fetches the updated version from the database.

The JGroups configuration is set in XWiki's xwiki.properties file. The key properties are the cluster node identification, the JGroups channel configuration, and the discovery mechanism. For nodes on the same network, UDP multicast discovery is the simplest option. For nodes across different subnets or in cloud environments where multicast is not available, TCP with a static initial membership list or a cloud-native discovery protocol (such as JDBC_PING for database-based discovery) is required.

Cache Configuration for Clustered Environments

Each XWiki node maintains local caches for rendered pages, user data, permission computations, and extension metadata. In a clustered environment, these caches must be invalidated consistently across all nodes. XWiki's cache implementation supports cluster-aware invalidation through JGroups events. However, cache sizing requires careful tuning -- too small and you lose the performance benefit of caching, too large and JGroups invalidation traffic becomes excessive.

Start with the same cache sizes recommended in our XWiki performance tuning guide and monitor cache hit rates and invalidation traffic. If invalidation messages become a bottleneck, consider increasing the TTL (time-to-live) for caches that contain infrequently changing data and reducing TTL for caches with high churn.

Session Management

Sticky sessions at the load balancer level are the recommended approach for session management in XWiki clusters. When a user's requests consistently reach the same node, the session state remains local to that node and does not need to be replicated. If the node fails, the user's session is lost and they must re-authenticate, but this is a minor inconvenience compared to the complexity and overhead of full session replication across all nodes.

For organizations that require session persistence across node failures -- for example, to avoid losing unsaved edits during a failover -- Tomcat's session replication via the DeltaManager or BackupManager can be configured. The DeltaManager replicates all sessions to all nodes (suitable for small clusters of 2-4 nodes), while the BackupManager replicates each session to exactly one backup node (more efficient for larger clusters). Both use a multicast or TCP channel for session data transfer.

Database Clustering with PostgreSQL

The shared database is the most critical component in an XWiki cluster. If the database fails, all application nodes lose access to content, and the entire wiki goes offline regardless of how many application nodes are running. PostgreSQL replication provides the database-level high availability that completes the HA architecture.

Streaming Replication

PostgreSQL's built-in streaming replication creates one or more standby replicas that receive a continuous stream of write-ahead log (WAL) records from the primary server. The standby replicas maintain a near-real-time copy of the database and can be promoted to primary in the event of a failure. Streaming replication is asynchronous by default (the primary does not wait for replicas to confirm writes) but can be configured for synchronous operation where data consistency requirements demand it.

Configure the primary server with wal_level = replica, create a replication user, and add the standby servers to pg_hba.conf. On each standby, use pg_basebackup to create an initial copy of the database and configure the standby to connect to the primary for WAL streaming. Monitor replication lag to ensure that standby servers stay current.

Automatic Failover

Streaming replication alone does not provide automatic failover -- if the primary server fails, a manual intervention is required to promote a standby. Tools like Patroni, repmgr, and pg_auto_failover automate this process. Patroni, backed by a distributed consensus store (etcd or Consul), is the most robust option for production environments. It monitors the primary, detects failures, promotes the most up-to-date standby, and reconfigures the remaining standbys to follow the new primary. Update XWiki's database connection to use a virtual IP or a connection proxy (like PgBouncer) that automatically routes traffic to the current primary.

Connection Pooling

Multiple XWiki nodes connecting to a single PostgreSQL server can exhaust the database's connection limit. PgBouncer, a lightweight connection pooler for PostgreSQL, sits between the application nodes and the database, multiplexing many application connections onto a smaller number of database connections. Configure PgBouncer in transaction pooling mode for XWiki, which provides the best balance of connection efficiency and compatibility with XWiki's database access patterns.

Shared Filesystem Options

XWiki stores attachments, extension data, and temporary files on the filesystem. In a clustered deployment, all nodes must access the same filesystem to ensure that an attachment uploaded through one node is accessible from all others.

NFS is the simplest option and suitable for small clusters (2-3 nodes) where the NFS server is reliable and low-latency. Configure NFSv4 with Kerberos authentication for security and set appropriate mount options (hard,intr,noatime) for resilience.

GlusterFS provides replicated distributed storage where data is automatically replicated across multiple storage nodes. It eliminates the single point of failure that NFS introduces and provides better read performance through distributed reads. GlusterFS is well-suited for medium-sized clusters.

Ceph is the most resilient option, offering self-healing, automatic rebalancing, and no single point of failure. Ceph's RADOS Block Device (RBD) or CephFS provides the shared filesystem layer, and the CRUSH algorithm ensures that data is distributed across failure domains. MassiveGRID's infrastructure uses Ceph distributed storage, which is one of the reasons the managed hosting platform achieves its high-availability guarantees.

Why High Availability Matters for Enterprise Wikis

The business case for XWiki HA is straightforward. When the wiki contains operational procedures that teams consult during incidents, compliance documentation that auditors access during reviews, onboarding materials that new hires depend on during their first weeks, and customer-facing knowledge bases that support teams reference during every interaction, downtime has a direct and measurable impact on organizational productivity.

A single-server wiki with a 99.9% uptime target allows for approximately 8.7 hours of downtime per year. A properly configured HA cluster with redundant application nodes, database replication, and shared storage can target 99.99% (52 minutes per year) or better. The difference is not academic -- it is the difference between a wiki that is occasionally unavailable during business hours and one that teams can rely on unconditionally.

For organizations with distributed teams across time zones, high availability is even more critical. There is no maintenance window when your team spans New York, London, and Singapore. Every hour is business hours for someone, and wiki downtime anywhere affects the entire organization. For comprehensive data protection alongside HA, see our XWiki backup and disaster recovery guide.

Managed High-Availability XWiki with MassiveGRID

Building and maintaining an XWiki HA cluster requires expertise in load balancing, JGroups clustering, PostgreSQL replication, shared storage, and ongoing monitoring of all components. For organizations that need high availability but prefer to focus their engineering resources on their core business rather than wiki infrastructure, MassiveGRID's managed XWiki hosting provides enterprise-grade HA out of the box.

MassiveGRID deploys XWiki on Proxmox HA clusters with Ceph distributed storage, providing automatic virtual machine failover, self-healing storage, and no single point of failure at any layer of the stack. The managed service includes pre-tuned PostgreSQL with replication, load-balanced application delivery, automated daily backups with tested restore procedures, and 24/7 monitoring with proactive incident response. The 100% uptime SLA backs this architecture with a contractual guarantee.

With data centers in Frankfurt, London, New York, and Singapore, MassiveGRID enables you to deploy highly available XWiki instances close to your users while meeting data residency requirements. Whether you are running a single high-availability instance or a multi-region deployment for a globally distributed team, MassiveGRID handles the infrastructure complexity so your team can focus on building and using the knowledge base.

Written by MassiveGRID -- As an official XWiki hosting partner, MassiveGRID provides managed XWiki hosting on high-availability infrastructure across data centers in Frankfurt, London, New York, and Singapore.