[Pacemaker] reboot of non-vm host results in VM restart -- of chickens and eggs and VMs

Wed Jan 8 00:14:37 EST 2014

Hi Andrew,

With configuration fumble, err, test,  that brought about this "of
chickens and eggs and VMs" request, the situation is that the reboot of
the non-host server results in the restart of the VM running on the host
server.

>From earlier [Pacemaker] thread:

> From: Tom Fernandes <anyaddress at gmx.net>
> Subject: [Pacemaker] chicken-egg-problem with libvirtd and a VM within
> cluster
> Date: Thu, 11 Oct 2012 18:09:30 +0200 (09:09 PDT)
> ...
> I observed that when I stop and start corosync on one of the nodes,
> pacemaker 
> (when starting corosync again) wants to check the status of the vm
> before 
> starting libvirtd. This check fails as libvirtd needs to be running
> for this 
> check. After trying for 20s libvirtd starts. The vm gets restarted
> after those 
> 20s and then runs on one of the nodes. I am left with a
> monitoring-error to 
> cleanup and my vm has rebooted.

And the same issue raised by myself earlier:

> From: Bob Haxo <bhaxo at sgi.com>
> Subject: [Pacemaker] GFS2 with Pacemaker on RHEL6.3 restarts with
> reboot
> Date: Wed, 8 Aug 2012 19:14:31 -0700
> ...
> 
> Problem: When the the non-VM-host is rebooted, then when Pacemaker
> restarts the gfs2 filesystem gets restarted on the VM host, which
> causes
> the stop and start of the VirtualDomain. The gfs2 filesystem also gets
> restarted without of the VirtualDomain resource included. 

The "chicken and egg and VMs" configured cluster is no longer available.
Perhaps the output of "crm configure show" has been saved.

Regarding the "chicken and egg and VMs" question, I now avoid the
issue ... somehow, and have moved on to new issues.

Please see the thread: [Pacemaker] "stonith_admin -F node" results in a
pair of reboots.  In particular the Tue, 7 Jan 2014 09:21:54 +0100
response from Fabio Di Nitto. 

The information from Fabio was very helpful. I currently seem to have
arrived at a RHEL 6.5 HA virtual server solution: no  "chicken and egg
and VMs" problem, no "fencing of both servers when only one was
explicitly fenced", no "clvmd startup timed out" resulting in "clvmd:pid
blocked for more than 120 seconds", but with a working VM, a working
live migration and a correct response to a manual fence cmd.  Tomorrow I
will add to that thread the results of my work today.

Regards,
Bob Haxo

On Wed, 2014-01-08 at 10:32 +1100, Andrew Beekhof wrote:

> On 20 Dec 2013, at 5:30 am, Bob Haxo <bhaxo at sgi.com> wrote:
> 
> > Hello,
> > 
> > Earlier emails related to this topic:
> > [pacemaker] chicken-egg-problem with libvirtd and a VM within cluster
> > [pacemaker] VirtualDomain problem after reboot of one node
> > 
> > 
> > My configuration:
> > 
> > RHEL6.5/CMAN/gfs2/Pacemaker/crmsh
> > 
> > pacemaker-libs-1.1.10-14.el6_5.1.x86_64
> > pacemaker-cli-1.1.10-14.el6_5.1.x86_64
> > pacemaker-1.1.10-14.el6_5.1.x86_64
> > pacemaker-cluster-libs-1.1.10-14.el6_5.1.x86_64
> > 
> > Two node HA VM cluster using real shared drive, not drbd.
> > 
> > Resources (relevant to this discussion):
> > primitive p_fs_images ocf:heartbeat:Filesystem \
> > primitive p_libvirtd lsb:libvirtd \
> > primitive virt ocf:heartbeat:VirtualDomain \
> > 
> > services chkconfig on: cman, clvmd, pacemaker
> > services chkconfig off: corosync, gfs2, libvirtd
> > 
> > Observation:
> > 
> 
> > Rebooting the NON-host system results in the restart of the VM
> merrily running on the host system.
> 
> I'm still bootstrapping after the break, but I'm not following this.
> Can you rephrase? 
> 
> 
> > 
> > Apparent cause:
> > 
> 
> > Upon startup, Pacemaker apparently checks the status of configured
> resources. However, the status request for the virt
> (ocf:heartbeat:VirtualDomain) resource fails with:
> 
> > 
> > Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: child_timeout_callback:        virt_monitor_0 process (PID 4158) timed out
> > Dec 18 12:19:30 [4147] mici-admin2       lrmd:  warning: operation_finished:    virt_monitor_0:4158 - timed out after 200000ms
> > Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to reconnect to the hypervisor ]
> > Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: no valid connection ]
> > Dec 18 12:19:30 [4147] mici-admin2       lrmd:   notice: operation_finished:    virt_monitor_0:4158:stderr [ error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory ]
> 
> Sounds like the agent should perhaps be returning OCF_NOT_RUNNING in this case.
> 
> 
> > 
> > 
> > This failure then snowballs into an "orphan" situation in which the
> running VM is restarted.
> > 
> > There was the suggestion of chkconfig on libvirtd (and presumably
> deleting the resource) so that the /var/run/libvirt/libvirt-sock has
> been created by service libvirtd. With libvirtd started by the system,
> there is no un-needed reboot of the VM.
> > 
> > However, it may be that removing libvirtd from Pacemaker control
> leaves the VM vdisk filesystem susceptible to corruption during a
> reboot induced failover.
> > 
> > Question:
> > 
> > Is there an accepted Pacemaker configuration such that the un-needed
> restart of the VM does not occur with the reboot of the non-host
> system?
> 
> > 
> > Regards,
> > Bob Haxo
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140107/71c3a902/attachment-0003.html>