[Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.

Thu Aug 25 16:22:38 EDT 2011

Hello again,

Replying to my own message with a "for the archives" post, my issue with
services being started concurrently after a node reboot came down to the
fact that I'm using the VirtualDomain RA, but by default CentOS 6.0 and
Scientific Linux 6.1 (and presumably RHEL6 as well) start libvirtd as one of
the very last services, after pacemaker has already been fired up.  The
VirtualDomain RA does some initial monitoring checks when pacemaker starts,
but gets a "connection refused" error since libvirtd isn't running yet.
 This appears to be what causes the empty state files in
/var/run/heartbeat/rsctmp/VirtualDomain-<name>.state and also to trigger the
starting of the guest VMs even though they're actually running on the other
node.

So, you go from a good state:
--------------------------------------------------------------------------------------------------
============
Last updated: Thu Aug 25 14:35:49 2011
Stack: cman
Current DC: kvm2 - partition WITHOUT quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.
============

Online: [ kvm2 ]
OFFLINE: [ kvm1 ]

fw1     (ocf::heartbeat:VirtualDomain): Started kvm2
 Clone Set: shared_volgrp
     Started: [ kvm2 ]
     Stopped: [ vgClusterDisk:0 ]
apache1 (ocf::heartbeat:VirtualDomain): Started kvm2
--------------------------------------------------------------------------------------------------

To a not-good state for about a minute:
--------------------------------------------------------------------------------------------------
============
Last updated: Thu Aug 25 14:38:06 2011
Stack: cman
Current DC: kvm2 - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.
============

Online: [ kvm1 kvm2 ]

fw1     (ocf::heartbeat:VirtualDomain) Started [        kvm1    kvm2 ]
 Clone Set: shared_volgrp
     Started: [ kvm1 kvm2 ]
apache1 (ocf::heartbeat:VirtualDomain) Started [        kvm1    kvm2 ]

Failed actions:
    fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error
    fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error
    apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown
error
    apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error
--------------------------------------------------------------------------------------------------

And then if finallly settles on this state, where it thinks the VMs are
running on the freshly booted node and unmanaged but they're actually dead
everywhere:
--------------------------------------------------------------------------------------------------
============
Last updated: Thu Aug 25 14:38:13 2011
Stack: cman
Current DC: kvm2 - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, unknown expected votes
3 Resources configured.
============

Online: [ kvm1 kvm2 ]

fw1     (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED
 Clone Set: shared_volgrp
     Started: [ kvm1 kvm2 ]
apache1 (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED

Failed actions:
    fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error
    fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error
    apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown
error
    apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error
--------------------------------------------------------------------------------------------------

At this point, starting a VMs is impossible regardless of any attempts you
make with 'resource cleanup' or 'resource manage'.  You have to manually
echo the name of the domain into its state file, then do a cleanup, and
everything will start.

So, I disabled pacemaker from starting via the normal route and just added a
line to /etc/rc.local that starts it, since that's the absolute last thing
done at boot.  I didn't want to mess with the chkconfig settings in the init
script and get bitten by this down the line somewhere after an update that
replaced the init script.  Now libvirtd is there for pacemaker and things
behave as expected, at least after three reboot tests in a row.

I suppose it may also work to make libvirtd a pacemaker resource, with an
order constraint so it's started before any VMs are ever probed/started.
 That'd take away easy/painless restarts of libvirtd, though.  I'll have to
do some further digging to see what makes the most sense.

Anyhow, sorry for the noise on the list, but I always hate it when someone
posts a problem, then either disappears forever or replies back to the list
with, "Nevermind, fixed it!" and no explanation.

Regards,
Mark

--- 8< -- snipped everything else, this is too long as it is ---
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110825/17b6d822/attachment-0003.html>