[ClusterLabs] Issue with DRBD + a systemd resource

Wed Dec 13 14:53:41 EST 2017

Hello,

Its my first post on this mailing list so excuse any rookie mistake I 
may do in this thread.

We currently have clusters deployed using corosync/pacemaker that manage 
DRBD + a couple of systemd services.

My colleague Derek previously emailed the list about it but has left the 
company since then:
http://lists.clusterlabs.org/pipermail/users/2017-November/006796.html

I'm hoping to continue his work in order to fix it once and for all.

I looked into the Q&A that was done in that thread and have managed to 
track it down to the following:
- If I reboot the server that is running as the primary (DRBD + systemd 
resources started), then when it completes reboot, there is a split-brain
- If I stop pacemaker (systemctl stop pacemaker), then reboot that 
primary server, then it comes back online without any issues and no 
split-brain
- If I reboot the server that doesn't have the running resources, all 
goes well

Following those observations, my guess is that the way the pacemaker 
services are being stopped during a systemd shutdown is causing issues.
It seems that pacemaker isn't stopping the systemd resources in that 
case and thus, not un-mounting the DRBD partition, putting it in 
secondary before stopping DRBD which results in the split-brain.

Here is the interesting bit I found in the logs:
Dec 13 14:09:40 act-pass-2 lrmd[1133]:    error: Could not connect to 
System DBus: Did not receive a reply. Possible causes include: the 
remote application did not send a reply, the message bus security policy 
blocked the reply, the reply timeout expired, or the network connection 
was broken.
Dec 13 14:09:40 act-pass-2 lrmd[1133]:    error: systemd_unit_exec: 
Triggered fatal assert at systemd.c:730 : systemd_init()
Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: Managed process 
1133 (lrmd) dumped core
Dec 13 14:09:40 act-pass-2 pacemakerd[1083]:    error: The lrmd process 
(1133) terminated with signal 6 (core=1)

And a pastebin of the full journald output during the shutdown
https://pastebin.com/CB38BiwC

Not sure where to go from there, may be a dependency to another systemd 
resource but it seems more like an issue connecting to systemd itself to 
stop the systemd resources of the cluster (that's a wild guess) since 
systemd isn't accepting commands since its stopping. At this point, this 
goes beyond my knowledge of systemd so I'd like some guidance on any 
required adjustment or further necessary troubleshooting.

Best Regards,

-- 
Julien Semaan
jsemaan at inverse.ca   ::  +1 (866) 353-6153 *155  ::www.inverse.ca
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171213/c6c35a0f/attachment-0002.html>