[Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

Mon Aug 27 14:47:25 EDT 2012

----- Original Message -----
> From: "Andrew Martin" <amartin at xes-inc.com>
> To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, August 27, 2012 1:01:54 PM
> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces	resources	to restart?
> 
> Jake,
> 
> 
> Attached is the log from the same period for node2. If I am reading
> this correctly, it looks like there was a 7 second difference
> between when node1 set its score to 1000 and when node2 set its
> score to 1000?

I agree and (I think) more importantly this is what caused the issue to the best of my knowledge - not necessarily fact ;-)

At 10:40:43 node1 updates it's pingd to 1000 causing the policy engine to recalculate node preference
At 10:40:44 transition 760 is initiated to move everything to the more preferred node2 because it's pingd value is 2000
At 10:40:50 node2's pingd value drops to 1000.  Policy engine doesn't stop/change the in-process transition - node1 and 2 are equal now but the transition is in process and node1 isn't more preferred so it continues.
At 10:41:02 ping is back on node1 and ready to update pingd to 2000
At 10:41:07 after dampen node1 updates pingd to 2000 which is greater than node2's value
At 10:41:08 cluster recognizes change in pingd value that requires a recalculation of node preference and aborts the in-process transition (760).  I believe the cluster then waits for all in-process actions to complete so the cluster is in a known state to recalculate
At 10:42:10 I'm guessing the shutdown timeout is reached without completing so then VirtualDomain is forcibly shutdown
Once all of that is done the transition 760 is done stopping/aborting with some transactions completed and some not:

Aug 22 10:42:13 node1 crmd: [4403]: notice: run_graph: Transition 760 (Complete=20, Pending=0, Fired=0, Skipped=39, Incomplete=30, Source=/var/lib/pengine/pe-input-2952.bz2): Stopped
Then the cluster recalculates the node preference and restarts those services that are stopped on node1 because pingd scores between node1 and node2 are equal so there is preference to stay on node1 where some services are still active (drbd or such I'm guessing are still running on node1)

> Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked:
> attrd_updater -n p_ping -v 1000 -d 5s

Before this is the ping fail:

Aug 22 10:40:31 node1 ping[1668]: [1823]: WARNING: 192.168.0.128 is inactive: PING 192.168.0.128 (192.168.0.128) 56(84) bytes of data.#012#012--- 192.168.0.128 ping statistics ---#0128 packets transmitted, 0 received, 100% packet loss, time 7055ms

Then you get the 7 second delay to do the 8 attempts I believe and then the 5 second dampen (-d 5s) brings us to:

> Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: p_ping (1000)
> Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update:
> Sent update 265: p_ping=1000
> 

Same thing on node2 - fails at 10:40:38 and then 7 seconds later:

> Aug 22 10:40:45 node2 attrd_updater: [27245]: info: Invoked:
> attrd_updater -n p_ping -v 1000 -d 5s

5s Dampen

> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: p_ping (1000)
> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_perform_update:
> Sent update 122: p_ping=1000
> 
> I had changed the attempts value to 8 (from the default 2) to address
> this same issue - to avoid resource migration based on brief
> connectivity problems with these IPs - however if we can get dampen
> configured correctly I'll set it back to the default.
> 

Well after looking through both more closely I'm not sure dampen is what you'll need to fix the deeper problem.  The time between fail and return was 10:40:31 to 10:41:02 or 32 seconds (31 on node2).  I believe if you had a dampen value that was greater than monitor value plus time failed then nothing would have happened (dampen > 10 + 32).  However I'm not sure I would call 32 seconds a blip in connection - that's up to you.  And since the dampen applies to all of the ping clones equally assuming a ping failure longer than your dampen value you would still have the same problem.  For example assuming a dampen of 45 seconds:
Node1 fails at 1:01, node2 fails at 1:08
Node1 will still update its pingd value at 1:52 - 7 seconds before node2 will and the transition will still happen even though both nodes have the same connectivity in reality.

I guess what I'm saying in the end is that dampen is there to prevent movement for a momentary outage/blip in the pings with the idea being that the pings will return before the dampen expires.  It isn't going to wait out the dampen on the other node(s) before making a decision.  You would need to be able to add something like a sleep 10s in there AFTER the pingd value is updated BEFORE evaluating the node preference scoring!

So in the end I don't have a fix for you except maybe to set dampen in the 45-60 second range if you expect around 30 second outages that you want to ride out without moving to be common place in your setup.  However that would extend the time to wait till failover in case of a complete failure of pings on one node only.

:-(

Jake

> 
> Thanks,
> 
> 
> Andrew
> 
> ----- Original Message -----
> 
> From: "Jake Smith" <jsmith at argotec.com>
> To: "The Pacemaker cluster resource manager"
> <pacemaker at oss.clusterlabs.org>
> Sent: Monday, August 27, 2012 9:39:30 AM
> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
> resources to restart?
> 
> 
> ----- Original Message -----
> > From: "Andrew Martin" <amartin at xes-inc.com>
> > To: "The Pacemaker cluster resource manager"
> > <pacemaker at oss.clusterlabs.org>
> > Sent: Thursday, August 23, 2012 7:36:26 PM
> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
> > resources to restart?
> >
> > Hi Florian,
> >
> >
> > Thanks for the suggestion. I gave it a try, but even with a dampen
> > value greater than 2* the monitoring interval the same behavior
> > occurred (pacemaker restarted the resources on the same node). Here
> > are my current ocf:pacemaker:ping settings:
> >
> > primitive p_ping ocf:pacemaker:ping \
> > params name="p_ping" host_list="192.168.0.128 192.168.0.129"
> > dampen="25s" multiplier="1000" attempts="8" debug="true" \
> > op start interval="0" timeout="60" \
> > op monitor interval="10s" timeout="60"
> >
> >
> > Any other ideas on what is causing this behavior? My understanding
> > is
> > the above config tells the cluster to attempt 8 pings to each of
> > the
> > IPs, and will assume that an IP is down if none of the 8 come back.
> > Thus, an IP would have to be down for more than 8 seconds to be
> > considered down. The dampen parameter tells the cluster to wait
> > before making any decision, so that if the IP comes back online
> > within the dampen period then no action is taken. Is this correct?
> >
> >
> 
> I'm no expert on this either but I believe the dampen isn't long
> enough - I think what you say above is correct but not only does the
> IP need to come back online but the cluster must attempt to ping it
> successfully also. I would suggest trying dampen with greater than
> 3*monitor value.
> 
> I don't think it's a problem but why change the attempts from the
> default 2 to 8?
> 
> > Thanks,
> >
> >
> > Andrew
> >
> >
> > ----- Original Message -----
> >
> > From: "Florian Crouzat" <gentoo at floriancrouzat.net>
> > To: pacemaker at oss.clusterlabs.org
> > Sent: Thursday, August 23, 2012 3:57:02 AM
> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
> > resources to restart?
> >
> > Le 22/08/2012 18:23, Andrew Martin a écrit :
> > > Hello,
> > >
> > >
> > > I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and
> > > 1
> > > quorum node that cannot run resources) running on Ubuntu 12.04
> > > Server amd64. This cluster has a DRBD resource that it mounts and
> > > then runs a KVM virtual machine from. I have configured the
> > > cluster to use ocf:pacemaker:ping with two other devices on the
> > > network (192.168.0.128, 192.168.0.129), and set constraints to
> > > move the resources to the most well-connected node (whichever
> > > node
> > > can see more of these two devices):
> > >
> > > primitive p_ping ocf:pacemaker:ping \
> > > params name="p_ping" host_list="192.168.0.128 192.168.0.129"
> > > multiplier="1000" attempts="8" debug="true" \
> > > op start interval="0" timeout="60" \
> > > op monitor interval="10s" timeout="60"
> > > ...
> > >
> > > clone cl_ping p_ping \
> > > meta interleave="true"
> > >
> > > ...
> > > location loc_run_on_most_connected g_vm \
> > > rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping
> > >
> > >
> > > Today, 192.168.0.128's network cable was unplugged for a few
> > > seconds and then plugged back in. During this time, pacemaker
> > > recognized that it could not ping 192.168.0.128 and restarted all
> > > of the resources, but left them on the same node. My
> > > understanding
> > > was that since neither node could ping 192.168.0.128 during this
> > > period, pacemaker would do nothing with the resources (leave them
> > > running). It would only migrate or restart the resources if for
> > > example node2 could ping 192.168.0.128 but node1 could not (move
> > > the resources to where things are better-connected). Is this
> > > understanding incorrect? If so, is there a way I can change my
> > > configuration so that it will only restart/migrate resources if
> > > one node is found to be better connected?
> > >
> > > Can you tell me why these resources were restarted? I have
> > > attached
> > > the syslog as well as my full CIB configuration.
> > >
> 
> As was said already the log shows node1 changed it's value for pingd
> to 1000, waited the 5 seconds of dampening and then started actions
> to move the resources. In the midst of stopping everything ping ran
> again successfully and the value increase back to 2000. This caused
> the policy engine to recalculate scores for all resources (before
> they had the chance to start on node2). I'm no scoring expert but I
> know there is additional value given to keep resources that are
> collocated together with their partners that are already running and
> resource stickiness to not move. So in this situation the score to
> stay/run on node1 once pingd was back at 2000 was greater that the
> score to move so things that were stopped or stopping restarted on
> node1. So increasing the dampen value should help/fix.
> 
> Unfortunately you didn't include the log from node2 so we can't
> correlate what node2's pingd values are at the same times as node1.
> I believe if you look at the pingd values and times that movement is
> started between the nodes you will be able to make a better guess at
> how high a dampen value would make sure the nodes had the same pingd
> value *before* the dampen time ran out and that should prevent
> movement.
> 
> HTH
> 
> Jake
> 
> > > Thanks,
> > >
> > > Andrew Martin
> > >
> >
> > This is an interesting question and I'm also interested in answers.
> >
> > I had the same observations, and there is also the case where the
> > monitor() aren't synced across all nodes so, "Node 1 issue a
> > monitor()
> > on the ping resource and finds ping-node dead, node2 hasn't pinged
> > yet,
> > so node1 moves things to node2 but node2 now issue a monitor() and
> > also
> > finds ping-node dead."
> >
> > The only solution I found was to adjust the dampen parameter to at
> > least
> > 2*monitor().interval so that I can be *sure* that all nodes have
> > issued
> > a monitor() and they all decreased they scores so that when a
> > decision
> > occurs, nothings move.
> >
> > It's been a long time I haven't tested, my cluster is very very
> > stable,
> > I guess I should retry to validate it's still a working trick.
> >
> > ====
> >
> > dampen (integer, [5s]): Dampening interval
> > The time to wait (dampening) further changes occur
> >
> > Eg:
> >
> > primitive ping-nq-sw-swsec ocf:pacemaker:ping \
> > params host_list="192.168.10.1 192.168.2.11 192.168.2.12"
> > dampen="35s" attempts="2" timeout="2" multiplier="100" \
> > op monitor interval="15s"
> >
> >
> >
> >
> > --
> > Cheers,
> > Florian Crouzat
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org