<div>Hello again,</div><div><br></div>Replying to my own message with a "for the archives" post, my issue with services being started concurrently after a node reboot came down to the fact that I'm using the VirtualDomain RA, but by default CentOS 6.0 and Scientific Linux 6.1 (and presumably RHEL6 as well) start libvirtd as one of the very last services, after pacemaker has already been fired up. The VirtualDomain RA does some initial monitoring checks when pacemaker starts, but gets a "connection refused" error since libvirtd isn't running yet. This appears to be what causes the empty state files in /var/run/heartbeat/rsctmp/VirtualDomain-<name>.state and also to trigger the starting of the guest VMs even though they're actually running on the other node.<div>
<br></div><div>So, you go from a good state:</div><div>--------------------------------------------------------------------------------------------------</div>============<br>Last updated: Thu Aug 25 14:35:49 2011<br>Stack: cman<br>
Current DC: kvm2 - partition WITHOUT quorum<br>Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe<br>2 Nodes configured, unknown expected votes<br>3 Resources configured.<br>============<br><br>Online: [ kvm2 ]<br>OFFLINE: [ kvm1 ]<br>
<br>fw1 (ocf::heartbeat:VirtualDomain): Started kvm2<br> Clone Set: shared_volgrp<br> Started: [ kvm2 ]<br> Stopped: [ vgClusterDisk:0 ]<br><div><div>apache1 (ocf::heartbeat:VirtualDomain): Started kvm2</div></div>
<div><div>--------------------------------------------------------------------------------------------------</div><div><br></div><div><br></div><div><br></div><div>To a not-good state for about a minute:</div><div>--------------------------------------------------------------------------------------------------</div>
<div><div>============</div><div>Last updated: Thu Aug 25 14:38:06 2011</div><div>Stack: cman</div><div>Current DC: kvm2 - partition with quorum</div><div>Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe</div><div>
2 Nodes configured, unknown expected votes</div><div>3 Resources configured.</div><div>============</div><div><br></div><div>Online: [ kvm1 kvm2 ]</div><div><br></div><div>fw1 (ocf::heartbeat:VirtualDomain) Started [ kvm1 kvm2 ]</div>
<div> Clone Set: shared_volgrp</div><div> Started: [ kvm1 kvm2 ]</div><div>apache1 (ocf::heartbeat:VirtualDomain) Started [ kvm1 kvm2 ]</div><div><br></div><div>Failed actions:</div><div> fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error</div>
<div> fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error</div><div> apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown error</div><div> apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error</div>
</div><div>--------------------------------------------------------------------------------------------------</div><div><br></div><div><br></div><div>And then if finallly settles on this state, where it thinks the VMs are running on the freshly booted node and unmanaged but they're actually dead everywhere:</div>
<div>--------------------------------------------------------------------------------------------------</div><div><div>============</div><div>Last updated: Thu Aug 25 14:38:13 2011</div><div>Stack: cman</div><div>Current DC: kvm2 - partition with quorum</div>
<div>Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe</div><div>2 Nodes configured, unknown expected votes</div><div>3 Resources configured.</div><div>============</div><div><br></div><div>Online: [ kvm1 kvm2 ]</div>
<div><br></div><div>fw1 (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED</div><div> Clone Set: shared_volgrp</div><div> Started: [ kvm1 kvm2 ]</div><div>apache1 (ocf::heartbeat:VirtualDomain): Started kvm1 (unmanaged) FAILED</div>
<div><br></div><div>Failed actions:</div><div> fw1_monitor_0 (node=kvm1, call=2, rc=1, status=complete): unknown error</div><div> fw1_stop_0 (node=kvm1, call=5, rc=1, status=complete): unknown error</div><div> apache1_monitor_0 (node=kvm1, call=4, rc=1, status=complete): unknown error</div>
<div> apache1_stop_0 (node=kvm1, call=6, rc=1, status=complete): unknown error</div></div><div>--------------------------------------------------------------------------------------------------</div><div><br></div><div>
<br></div><div>At this point, starting a VMs is impossible regardless of any attempts you make with 'resource cleanup' or 'resource manage'. You have to manually echo the name of the domain into its state file, then do a cleanup, and everything will start.</div>
<div><br></div><div>So, I disabled pacemaker from starting via the normal route and just added a line to /etc/rc.local that starts it, since that's the absolute last thing done at boot. I didn't want to mess with the chkconfig settings in the init script and get bitten by this down the line somewhere after an update that replaced the init script. Now libvirtd is there for pacemaker and things behave as expected, at least after three reboot tests in a row.</div>
<div><br></div><div>I suppose it may also work to make libvirtd a pacemaker resource, with an order constraint so it's started before any VMs are ever probed/started. That'd take away easy/painless restarts of libvirtd, though. I'll have to do some further digging to see what makes the most sense.</div>
<div><br></div><div>Anyhow, sorry for the noise on the list, but I always hate it when someone posts a problem, then either disappears forever or replies back to the list with, "Nevermind, fixed it!" and no explanation.</div>
<div><br></div><div>Regards,</div><div>Mark</div><div><br></div><div>--- 8< -- snipped everything else, this is too long as it is ---</div></div>