[Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource?

Wed Jun 20 21:40:37 UTC 2012

Hi Lars, 

Thank you for clarifying this for me. I have increased the number of ping attempts to 8 so that the ping test will only fail if a host is unavailable for a longer amount of time: 

primitive p_ping ocf:pacemaker:ping \ 
params name="p_ping" host_list="192.168.1.25 192.168.1.26" multiplier="1000" attempts="8" debug="true" \ 
op start interval="0" timeout="60" \ 
op monitor interval="10s" timeout="60" 

Thus, any network hiccup that causes a host to be unreachable for less than 8 seconds will not trigger a failover, but longer will. 

Thanks again, 

Andrew 
----- Original Message -----

From: "Lars Ellenberg" <lars.ellenberg at linbit.com> 
To: pacemaker at oss.clusterlabs.org 
Sent: Tuesday, June 19, 2012 5:33:46 PM 
Subject: Re: [Pacemaker] Why Did Pacemaker Restart this VirtualDomain Resource? 

On Tue, Jun 19, 2012 at 09:38:50AM -0500, Andrew Martin wrote: 
> Hello, 
> 
> 
> I have a 3 node Pacemaker+Heartbeat cluster (two real nodes and one "standby" quorum node) with Ubuntu 10.04 LTS on the nodes and using the Pacemaker+Heartbeat packages from the Ubuntu HA Team PPA ( https://launchpad.net/~ubuntu-ha-maintainers/+archive/ppa ). I have configured 3 DRBD resources, a filesystem mount, and a KVM-based virtual machine (using the VirtualDomain resource). I have constraints in place so that the DRBD devices must become primary and the filesystem must be mounted before the VM can start: 

> location loc_run_on_most_connected g_vm \ 
> rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 

This is the rule 

> This has been working well, however last week Pacemaker all of a 
> sudden stopped the p_vm_myvm resource and then started it up again. I 
> have attached the relevant section of /var/log/daemon.log - I am 
> unable to determine what caused Pacemaker to restart this resource. 
> Based on the log, could you tell me what event triggered this? 
> 
> 
> Thanks, 
> 
> 
> Andrew 

> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: rsc:p_sysadmin_notify:0 monitor[18] (pid 3661) 
> Jun 14 15:25:00 vmhost1 lrmd: [3853]: info: operation monitor[18] on p_sysadmin_notify:0 for client 3856: pid 3661 exited with return code 0 
> Jun 14 15:26:42 vmhost1 cib: [3852]: info: cib_stats: Processed 219 operations (182.00us average, 0% utilization) in the last 10min 
> Jun 14 15:32:43 vmhost1 lrmd: [3853]: info: operation monitor[22] on p_ping:0 for client 3856: pid 10059 exited with return code 0 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_vmstore:0 monitor[55] (pid 12323) 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount2:0 monitor[53] (pid 12324) 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[55] on p_drbd_vmstore:0 for client 3856: pid 12323 exited with return code 8 
> Jun 14 15:35:27 vmhost1 lrmd: [3853]: info: operation monitor[53] on p_drbd_mount2:0 for client 3856: pid 12324 exited with return code 8 
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: rsc:p_drbd_mount1:0 monitor[54] (pid 12396) 
> Jun 14 15:35:31 vmhost1 lrmd: [3853]: info: operation monitor[54] on p_drbd_mount1:0 for client 3856: pid 12396 exited with return code 8 
> Jun 14 15:36:42 vmhost1 cib: [3852]: info: cib_stats: Processed 220 operations (272.00us average, 0% utilization) in the last 10min 
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm monitor[57] (pid 14061) 
> Jun 14 15:37:34 vmhost1 lrmd: [3853]: info: operation monitor[57] on p_vm_myvm for client 3856: pid 14061 exited with return code 0 

> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (1000) 
> Jun 14 15:42:35 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 163: p_ping=1000 

And here the score on the location constraint changes for this node. 

You asked for "run on most connected", and your pingd resource 
determined that "the other" one was "better" connected. 

> Jun 14 15:42:36 vmhost1 crmd: [3856]: info: do_lrm_rsc_op: Performing key=136:2351:0:7f6d66f7-cfe5-4820-8289-0e47d8c9102b op=p_vm_myvm_stop_0 ) 
> Jun 14 15:42:36 vmhost1 lrmd: [3853]: info: rsc:p_vm_myvm stop[58] (pid 18174) 

... 

> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_trigger_update: Sending flush op to all hosts for: p_ping (2000) 
> Jun 14 15:43:32 vmhost1 attrd: [3855]: notice: attrd_perform_update: Sent update 165: p_ping=2000 

And there it is back on 2000 again ... 

Lars 

-- 
: Lars Ellenberg 
: LINBIT | Your Way to High Availability 
: DRBD/HA support and consulting http://www.linbit.com 

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120620/41f61656/attachment.htm>