[ClusterLabs] why is node fenced ?

Mon Aug 12 19:19:54 EDT 2019

On Mon, 2019-08-12 at 18:09 +0200, Lentes, Bernd wrote:
> Hi,
> 
> last Friday (9th of August) i had to install patches on my two-node
> cluster.
> I put one of the nodes (ha-idg-2) into standby (crm node standby ha-
> idg-2), patched it, rebooted, 
> started the cluster (systemctl start pacemaker) again, put the node
> again online, everything fine.
> 
> Then i wanted to do the same procedure with the other node (ha-idg-
> 1).
> I put it in standby, patched it, rebooted, started pacemaker again.
> But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.
> I know that nodes which are unclean need to be shutdown, that's
> logical.
> 
> But i don't know from where the conclusion comes that the node is
> unclean respectively why it is unclean,
> i searched in the logs and didn't find any hint.

The key messages are:

Aug 09 17:43:27 [6326] ha-idg-1       crmd:     info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped (20000ms)
Aug 09 17:43:27 [6326] ha-idg-1       crmd:  warning: do_log:   Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped

That indicates the newly rebooted node didn't hear from the other node
within 20s, and so assumed it was dead.

The new node had quorum, but never saw the other node's corosync, so
I'm guessing you have two_node and/or wait_for_all disabled in
corosync.conf, and/or you have no-quorum-policy=ignore in pacemaker.

I'd recommend two_node: 1 in corosync.conf, with no explicit
wait_for_all or no-quorum-policy setting. That would ensure a
rebooted/restarted node doesn't get initial quorum until it has seen
the other node.

> I put the syslog and the pacemaker log on a seafile share, i'd be
> very thankful if you'll have a look.
> https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/
> 
> Here the cli history of the commands:
> 
> 17:03:04  crm node standby ha-idg-2
> 17:07:15  zypper up (install Updates on ha-idg-2)
> 17:17:30  systemctl reboot
> 17:25:21  systemctl start pacemaker.service
> 17:25:47  crm node online ha-idg-2
> 17:26:35  crm node standby ha-idg1-
> 17:30:21  zypper up (install Updates on ha-idg-1)
> 17:37:32  systemctl reboot
> 17:43:04  systemctl start pacemaker.service
> 17:44:00  ha-idg-1 is fenced
> 
> Thanks.
> 
> Bernd
> 
> OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1
> 
> 
-- 
Ken Gaillot <kgaillot at redhat.com>