[ClusterLabs] Behavior of corosync kill

Tue Aug 25 10:45:47 EDT 2020

Thanks Ken. Let me check resource-stickiness property at my end.

Regards,
Rohit

On Tue, Aug 25, 2020 at 8:07 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote:
> > Hi All,
> > I am seeing the following behavior. Can someone clarify if this is
> > intended behavior. If yes, then why so? Please let me know if logs
> > are needed for better clarity.
> >
> > 1. Without Stonith:
> > Continuous corosync kill on master causes switchover and makes
> > another node as master. But as soon as this corosync recovers, it
> > becomes master again. Shouldn't it become slave now?
>
> Where resources are active or take on the master role depends on the
> cluster configuration, not past node issues.
>
> You may be interested in the resource-stickiness property:
>
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes
>
>
> > 2. With Stonith:
> > Sometimes, on corosync kill, that node gets shooted by stonith but
> > sometimes not. Not able to understand this fluctuating behavior. Does
> > it have to do anything with faster recovery of corosync, which
> > stonith fails to detect?
>
> It's not failing to detect it, but recovering satisfactorily without
> fencing.
>
> At any given time, one of the cluster nodes is elected the designated
> controller (DC). When new events occur, such as a node leaving the
> corosync ring unexpectedly, the DC runs pacemaker's scheduler to see
> what needs to be done about it. In the case of a lost node, it will
> also erase the node's resource history, to indicate that the state of
> resources on the node is no longer accurately known.
>
> If no further events happened during that time, the scheduler would
> schedule fencing, and the cluster would carry it out.
>
> However, systemd monitors corosync and will restart it if it dies. If
> systemd respawns corosync fast enough (it often is sub-second), the
> node will rejoin the cluster before the scheduler completes its
> calculations and fencing is initiated. Rejoining the cluster includes
> re-sync'ing its resource history with the other nodes.
>
> The node join is considered new information, so the former scheduler
> run is cancelled (the "transition" is "aborted") and a new one is
> started. Since the node is now happily part of the cluster, and the
> resource history tells us the state of all resources on the node, no
> fencing is needed.
>
>
> > I am using
> > corosync-2.4.5-4.el7.x86_64
> > pacemaker-1.1.19-8.el7.x86_64
> > centos 7.6.1810
> >
> > Thanks,
> > Rohit
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200825/f088de8a/attachment-0001.htm>