[Pacemaker] Failed-over incomplete

Thu Dec 4 04:23:01 EST 2014

On Thu, Dec 4, 2014 at 9:52 AM, Teerapatr Kittiratanachai
<maillist.tk at gmail.com> wrote:
> Dear Andrei,
>
> Since the failed over is uncompleted so all the resource isn't failed
> over to another node.
>
> I think this case happened because of the res.vBKN is go into unmanaged state.
>

There is no resource res.vBKN in your logs or configuration snippet
you have shown.

> But why? Since there is no configuration is changed.
>
> --teenigma
>
> On Thu, Dec 4, 2014 at 1:41 PM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>> On Thu, Dec 4, 2014 at 4:56 AM, Teerapatr Kittiratanachai
>> <maillist.tk at gmail.com> wrote:
>>> Dear List,
>>>
>>> We are using Pacemaker and Corosync with CMAN as our HA software as
>>> below version.
>>>
>>>     OS:            CentOS release 6.5 (Final) 64-bit
>>>     Pacemaker:        pacemaker.x86_64        1.1.10-14.el6_5.3
>>>     Corosync:        corosync.x86_64        1.4.1-17.el6_5.1
>>>     CMAN:            cman.x86_64            3.0.12.1-59.el6_5.2
>>>     Resource-Agent:    resource-agents.x86_64    3.9.5-3.12
>>>
>>>     Topology:        2 Nodes with Active/Standby model. (MySQL is
>>> Active/Active by clone)
>>>
>>> All packages are install from CentOS official repository, and the
>>> Resource-Agent is only one which be installed from OpenSUSE repository
>>> (http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/).
>>>
>>> The system is work normally for few months until yesterday morning,
>>> around 03:35 UTC+0700, we found that one of resource is go into
>>> UNMANAGED state without any configuration changed. After another
>>> resource is failed, the pacemaker try to failed-over resource to
>>> another node but it incomplete after facing this resource.
>>>
>>> Configuration of some resource is below and the LOG during event is in
>>> attached file.
>>>
>>
>> The log just covers resource monitor failure and stopping of
>> resources. It does not contain any event related to starting resources
>> on another nodes.
>>
>> You would need to collect crm_report with start time before resource
>> failed and stop time after resources were started on another node.
>>
>>> primitive res.vBKN6 IPv6addr \
>>>         params ipv6addr="2001:db8:0:f::61a" cidr_netmask=64 nic=eth0 \
>>>         op monitor interval=10s
>>>
>>> primitive res.vDMZ6 IPv6addr \
>>>         params ipv6addr="2001:db8:0:9::61a" cidr_netmask=64 nic=eth1 \
>>>         op monitor interval=10s
>>>
>>> group gr.mainService res.vDMZ4 res.vDMZ6 res.vBKN4 res.vBKN6 res.http res.ftp
>>>
>>> rsc_defaults rsc_defaults-options: \
>>>         migration-threshold=1
>>>
>>> Please help me to solve this problem.
>>>
>>> --teenigma
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org