[ClusterLabs] Setting up an Active/Active Pacemaker cluster for a Postfix/Dovecot cluster, using a DRBD backend for the data storage
Raphael DUBOIS-LISKI
raphael.dubois-liski at soget.fr
Wed Dec 6 02:32:23 EST 2023
Hi,
Thank you for your quick response,
I am indeed using a diskless watchdog,
I have already looked into a setting up a device dependant watchdog, but wouldn’t that create a single point of failure in the case that common drive becomes unavailable ?
[SOGET]
Raphael DUBOIS-LISKI
Ingénieur Système et Réseau
+33 2 35 19 25 54
SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR
[web]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09>
[linkedin]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09>
[twitter]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09>
Disclaimer<http://soget.fr/disclaimer>
De : Damiano Giuliani <damianogiuliani87 at gmail.com>
Envoyé : mardi 5 décembre 2023 18:30
À : Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Objet : Re: [ClusterLabs] Setting up an Active/Active Pacemaker cluster for a Postfix/Dovecot cluster, using a DRBD backend for the data storage
It could be the watchdog? Are u using diskless watchdog?Two nodes are not supported in diskless mode.
On Tue, Dec 5, 2023, 5:40 PM Raphael DUBOIS-LISKI <raphael.dubois-liski at soget.fr<mailto:raphael.dubois-liski at soget.fr>> wrote:
Hello,
I am seeking help for the setup of an Active/Active pacemaker cluster that relies on a DRBD cluster as the data storage backend, as the solution is mounted on 2 RHEL9 VMs, the file system used is GFS2.
Linked, is a PDF of the infrastructure that I am currently experimenting on.
For context, this is my Pacemaker cluster config:
Cluster Name: mycluster
Corosync Nodes:
Node1 Node2
Pacemaker Nodes:
Node1 Node2
Resources:
Clone: Data-clone
Meta Attributes: Data-clone-meta_attributes
clone-max=2
clone-node-max=1
notify=true
promotable=true
promoted-max=2
promoted-node-max=1
Resource: Data (class=ocf provider=linbit type=drbd)
Attributes: Data-instance_attributes
drbd_resource=drbd0
Operations:
demote: Data-demote-interval-0s
interval=0s timeout=90
monitor: Data-monitor-interval-60s
interval=60s
notify: Data-notify-interval-0s
interval=0s timeout=90
promote: Data-promote-interval-0s
interval=0s timeout=90
reload: Data-reload-interval-0s
interval=0s timeout=30
start: Data-start-interval-0s
interval=0s timeout=240
stop: Data-stop-interval-0s
interval=0s timeout=100
Clone: dlm-clone
Meta Attributes: dlm-clone-meta_attributes
clone-max=2
clone-node-max=1
Resource: dlm (class=ocf provider=pacemaker type=controld)
Operations:
monitor: dlm-monitor-interval-60s
interval=60s
start: dlm-start-interval-0s
interval=0s timeout=90s
stop: dlm-stop-interval-0s
interval=0s timeout=100s
Clone: FS-clone
Resource: FS (class=ocf provider=heartbeat type=Filesystem)
Attributes: FS-instance_attributes
device=/dev/drbd0
directory=/home/vusers
fstype=gfs2
Operations:
monitor: FS-monitor-interval-20s
interval=20s timeout=40s
start: FS-start-interval-0s
interval=0s timeout=60s
stop: FS-stop-interval-0s
interval=0s timeout=60s
Clone: smtp_postfix-clone
Meta Attributes: smtp_postfix-clone-meta_attributes
clone-max=2
clone-node-max=1
Resource: smtp_postfix (class=ocf provider=heartbeat type=postfix)
Operations:
monitor: smtp_postfix-monitor-interval-60s
interval=60s timeout=20s
reload: smtp_postfix-reload-interval-0s
interval=0s timeout=20s
start: smtp_postfix-start-interval-0s
interval=0s timeout=20s
stop: smtp_postfix-stop-interval-0s
interval=0s timeout=20s
Clone: WebSite-clone
Resource: WebSite (class=ocf provider=heartbeat type=apache)
Attributes: WebSite-instance_attributes
configfile=/etc/httpd/conf/httpd.conf
statusurl=http://localhost/server-status
Operations:
monitor: WebSite-monitor-interval-1min
interval=1min
start: WebSite-start-interval-0s
interval=0s timeout=40s
stop: WebSite-stop-interval-0s
interval=0s timeout=60s
Colocation Constraints:
resource 'FS-clone' with Promoted resource 'Data-clone' (id: colocation-FS-
Data-clone-INFINITY)
score=INFINITY
resource 'WebSite-clone' with resource 'FS-clone' (id: colocation-WebSite-FS-
INFINITY)
score=INFINITY
resource 'FS-clone' with resource 'dlm-clone' (id: colocation-FS-dlm-clone-
INFINITY)
score=INFINITY
resource 'FS-clone' with resource 'smtp_postfix-clone' (id: colocation-FS-
clone-smtp_postfix-clone-INFINITY)
score=INFINITY
Order Constraints:
promote resource 'Data-clone' then start resource 'FS-clone' (id: order-Data-
clone-FS-mandatory)
start resource 'FS-clone' then start resource 'WebSite-clone' (id: order-FS-
WebSite-mandatory)
start resource 'dlm-clone' then start resource 'FS-clone' (id: order-dlm-
clone-FS-mandatory)
start resource 'FS-clone' then start resource 'smtp_postfix-clone' (id: order-
FS-clone-smtp_postfix-clone-mandatory)
Resources Defaults:
Meta Attrs: build-resource-defaults
resource-stickiness=1 (id: build-resource-stickiness)
Operations Defaults:
Meta Attrs: op_defaults-meta_attributes
timeout=240s (id: op_defaults-meta_attributes-timeout)
Cluster Properties: cib-bootstrap-options
cluster-infrastructure=corosync
cluster-name=mycluster
dc-version=2.1.6-9.el9-6fdc9deea29
have-watchdog=true
last-lrm-refresh=1701787695
no-quorum-policy=ignore
stonith-enabled=true
stonith-watchdog-timeout=10
And this is my DRBD configuration :
global {
usage-count no;
}
common {
disk {
resync-rate 100M;
al-extents 257;
}
}
resource drbd0 {
protocol C;
handlers {
pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trig;
local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt;
fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh<http://crm-fence-peer.9.sh>";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.9.sh<http://crm-unfence-peer.9.sh>";
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
}
startup {
wfc-timeout 1;
degr-wfc-timeout 1;
become-primary-on both;
}
net {
# The following lines are dedicated to handle
# split-brain situations (e.g., if one of the nodes fails)
after-sb-0pri discard-zero-changes; # If both nodes are secondary, just make one of them primary
after-sb-1pri discard-secondary; # If one is primary, one is not, trust the primary node
after-sb-2pri disconnect;
allow-two-primaries yes;
verify-alg sha1;
}
disk {
on-io-error detach;
}
options {
auto-promote yes;
}
on fradevtestmail1 {
device /dev/drbd0;
disk /dev/rootvg/drbdlv;
address X.X.X.X:7788;
flexible-meta-disk internal;
}
on fradevtestmail2 {
device /dev/drbd0;
disk /dev/rootvg/drbdlv;
address X.X.X.X:7788;
flexible-meta-disk internal;
}
}
Knowing all this,
The cluster works perfectly as expected when both nodes are up, but a problem arises when I put the cluster in a degraded state by killing one of the nodes improperly (to simulate an unexpected crash).
This causes the remaining node to reboot, restart the cluster, with all going well in the resource start process, until it's time to mount the File System, where it times out and fails.
Would you have any idea why this behaviour happens, and possibly how I would be able to fix this behaviour, so that the cluster is still usable even with one node down?
Until we can get the second node back and running in case of an unexpected crash ?
Many thanks for your help,
Have a nice day,
BR,
[SOGET]
Raphael DUBOIS-LISKI
Ingénieur Système et Réseau
+33 2 35 19 25 54
SOGET SA • 4, rue des Lamaneurs • 76600 Le Havre, FR
[web]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VzNGNG5oU1pWYjZRMTVQMUxBZ2xJZz09>
[linkedin]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-eFZMaExNZWZGQjMvaVVJaDArTTl6Zz09>
[twitter]<https://signature.soget.fr/l/dnA0alZ4cjRrTERoZlFWSDhaN0xYdz09-VjFWNTBIYlNCUDdIbXlxKzJyRzFPUT09>
Disclaimer<http://soget.fr/disclaimer>
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users
ClusterLabs home: https://www.clusterlabs.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231206/34245dcc/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6919 bytes
Desc: image001.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231206/34245dcc/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 1215 bytes
Desc: image002.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231206/34245dcc/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 828 bytes
Desc: image003.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231206/34245dcc/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1088 bytes
Desc: image004.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231206/34245dcc/attachment-0007.png>
More information about the Users
mailing list