[Pacemaker] Long failover

Wed Jan 7 22:41:09 EST 2015

I need to see logs from both nodes that relate to the same instance of the issue.

Why are the dates so crazy?
One is from a year ago and the other is in the (at the time) future.

> On 2 Dec 2014, at 7:04 pm, Dmitry Matveichev <d.matveichev at mfisoft.ru> wrote:
> 
> Hello,
> Any thoughts about this issue? It still affects our cluster. 
> 
> ------------------------
> Kind regards,
> Dmitriy Matveichev. 
> 
> 
> -----Original Message-----
> From: Dmitry Matveichev 
> Sent: Monday, November 17, 2014 12:32 PM
> To: The Pacemaker cluster resource manager
> Subject: RE: [Pacemaker] Long failover
> 
> Hello,
> 
> Debug logs from slave are attached. Hope it helps. 
> 
> ------------------------
> Kind regards,
> Dmitriy Matveichev. 
> 
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net] 
> Sent: Monday, November 17, 2014 10:48 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Long failover
> 
> 
>> On 17 Nov 2014, at 6:17 pm, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>> 
>> On Mon, Nov 17, 2014 at 9:34 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>> 
>>>> On 14 Nov 2014, at 10:57 pm, Dmitry Matveichev <d.matveichev at mfisoft.ru> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> We have a cluster configured via pacemaker+corosync+crm. The configuration is:
>>>> 
>>>> node master
>>>> node slave
>>>> primitive HA-VIP1 IPaddr2 \
>>>>       params ip=192.168.22.71 nic=bond0 \
>>>>       op monitor interval=1s
>>>> primitive HA-variator lsb: variator \
>>>>       op monitor interval=1s \
>>>>       meta migration-threshold=1 failure-timeout=1s group HA-Group 
>>>> HA-VIP1 HA-variator property cib-bootstrap-options: \
>>>>       dc-version=1.1.10-14.el6-368c726 \
>>>>       cluster-infrastructure="classic openais (with plugin)" \
>>> 
>>> General advice, don't use the plugin. See:
>>> 
>>> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
>>> http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/
>>> 
>>>>       expected-quorum-votes=2 \
>>>>       stonith-enabled=false \
>>>>      no-quorum-policy=ignore \
>>>>       last-lrm-refresh=1383871087
>>>> rsc_defaults rsc-options: \
>>>>       resource-stickiness=100
>>>> 
>>>> Firstly I make the variator service down  on the master node (actually I delete the service binary and kill the variator process, so the variator fails to restart). Resources very quickly move on the slave node as expected. Then I return the binary on the master and restart the variator service. Now I make the same stuff with binary and service on slave node. The crm status command quickly shows me HA-variator   (lsb: variator):        Stopped. But it take to much time (for us) before recourses are switched on the master node (around 1 min).
>>> 
>>> I see what you mean:
>>> 
>>> 2013-12-21T07:04:12.230827+04:00 master crmd[14267]:   notice: te_rsc_command: Initiating action 2: monitor HA-variator_monitor_1000 on slave.mfisoft.ru
>>> 2013-12-21T05:45:09+04:00 slave crmd[7086]:   notice: process_lrm_event: slave.mfisoft.ru-HA-variator_monitor_1000:106 [ variator.x is stopped\n ]
>>> 
>>> (1 minute goes by)
>>> 
>>> 2013-12-21T07:05:14.232029+04:00 master crmd[14267]:    error: print_synapse: [Action    2]: In-flight rsc op HA-variator_monitor_1000 on slave.mfisoft.ru (priority: 0, waiting: none)
>>> 2013-12-21T07:05:14.232102+04:00 master crmd[14267]:  warning: 
>>> cib_action_update: rsc_op 2: HA-variator_monitor_1000 on 
>>> slave.mfisoft.ru timed out
>>> 
>> 
>> Is it possible that pacemaker is confused by time difference on master 
>> and slave?
> 
> Timeouts are all calculated locally. So it shouldn't be an issue (aside from trying to read the logs)
> 
>> 
>>> Is there a corosync log file configured?  That would have more detail on slave.
>>> 
>>>> Then line
>>>> Failed actions:
>>>>   HA- variator _monitor_1000 on slave 'unknown error' (1): call=-1, 
>>>> status=Timed Out, last-rc-change='Sat Dec 21 03:59:45 2013', queued=0ms, exec=0ms appears in the crm status and recourses are switched.
>>>> 
>>>> What is that timeout? Where I can change it?
>>>> 
>>>> ------------------------
>>>> Kind regards,
>>>> Dmitriy Matveichev.
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org Getting started: 
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org Getting started: 
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org