[ClusterLabs] Corosync quorum vs. pacemaker quorum confusion

Sun Dec 3 06:03:17 EST 2017

I assumed that with corosync 2.x quorum is maintained by corosync and
pacemaker simply gets yes/no. Apparently this is more complicated.

Trivial test two node cluster (two_node is intentionally not set to
simulate "normal" behavior).

ha1:~ # crm configure show
node 1084752129: ha1
node 1084752130: ha2
primitive stonith-sbd stonith:external/sbd \
	params pcmk_delay_max=30s
property cib-bootstrap-options: \
	have-watchdog=true \
	dc-version=1.1.17-3.3-36d2962a8 \
	cluster-infrastructure=corosync \
	cluster-name=hacluster \
	stonith-enabled=true \
	placement-strategy=balanced \
	stonith-timeout=172 \
	no-quorum-policy=suicide
rsc_defaults rsc-options: \
	resource-stickiness=1 \
	migration-threshold=3
op_defaults op-options: \
	timeout=600 \
	record-pending=true

I boot one node.

ha1:~ # corosync-quorumtool
Quorum information
------------------
Date:             Sun Dec  3 13:44:55 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          1084752129
Ring ID:          1084752129/240
Quorate:          No

Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:

Membership information
----------------------
    Nodeid      Votes Name
1084752129          1 ha1 (local)

ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition WITHOUT quorum
Last updated: Sun Dec  3 13:48:46 2017
Last change: Sun Dec  3 12:09:19 2017 by root via cibadmin on ha1

2 nodes configured
1 resource configured

Node ha2: UNCLEAN (offline)
Online: [ ha1 ]

Full list of resources:

 stonith-sbd	(stonith:external/sbd):	Stopped

Migration Summary:
* Node ha1:

So far that's expected. We are out of quorum so nothing happens.
Although the first surprise was this message (which confirmed past
empirical observations):

Dec 03 13:44:57 [1632] ha1    pengine:   notice: stage6:	Cannot fence
unclean nodes until quorum is attained (or no-quorum-policy is set to
ignore)

I assume this is intentional behavior in which case it would be really
good to have it mentioned in documentation as well. So far I have not
seen comprehensive explanation of pacemaker startup logic (what and when
it decides to do).

OK, let's pretend we have quorum.

ha1:~ # corosync-cmapctl -s quorum.expected_votes u32 1
ha1:~ # corosync-quorumtool
Quorum information
------------------
Date:             Sun Dec  3 13:52:19 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          1084752129
Ring ID:          1084752129/240
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate

Membership information
----------------------
    Nodeid      Votes Name
1084752129          1 ha1 (local)

So corosync apparently believes we are in quorum now. What pacemaker does?

ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Sun Dec  3 13:53:22 2017
Last change: Sun Dec  3 12:09:19 2017 by root via cibadmin on ha1

2 nodes configured
1 resource configured

Node ha2: UNCLEAN (offline)
Online: [ ha1 ]

Full list of resources:

 stonith-sbd	(stonith:external/sbd):	Stopped

Migration Summary:
* Node ha1:

Nothing really changed. Although it quite clearly tells we are in
quorum, it still won't start any resource nor attempt to fence another
node. Although logs say

Dec 03 13:52:07 [1633] ha1       crmd:   notice:
pcmk_quorum_notification:	Quorum acquired | membership=240 members=1
Dec 03 13:52:07 [1626] ha1 pacemakerd:   notice:
pcmk_quorum_notification:	Quorum acquired | membership=240 members=1

There is still *no* attempt to do anything.

This may be related to previous message

Dec 03 13:44:57 [1629] ha1 stonith-ng:   notice: unpack_config:
Resetting no-quorum-policy to 'stop': cluster has never had quorum

Which opens up question - where can I see this temporary value for
no-quorum-policy? It is not present in CIB, how can I query the
"effective" value of property?

Still even though pacemaker does not attempt to actually start
resources, it apparently believes it was in quorum, because as soon as I
increase number of votes back to 2, node immediately resets (due to
no-quorum-policy=suicide).

Confused ... is it intentional behavior or a bug?