[ClusterLabs] Corosync quorum vs. pacemaker quorum confusion
    Andrei Borzenkov 
    arvidjaar at gmail.com
       
    Sun Dec  3 06:03:17 EST 2017
    
    
  
I assumed that with corosync 2.x quorum is maintained by corosync and
pacemaker simply gets yes/no. Apparently this is more complicated.
Trivial test two node cluster (two_node is intentionally not set to
simulate "normal" behavior).
ha1:~ # crm configure show
node 1084752129: ha1
node 1084752130: ha2
primitive stonith-sbd stonith:external/sbd \
	params pcmk_delay_max=30s
property cib-bootstrap-options: \
	have-watchdog=true \
	dc-version=1.1.17-3.3-36d2962a8 \
	cluster-infrastructure=corosync \
	cluster-name=hacluster \
	stonith-enabled=true \
	placement-strategy=balanced \
	stonith-timeout=172 \
	no-quorum-policy=suicide
rsc_defaults rsc-options: \
	resource-stickiness=1 \
	migration-threshold=3
op_defaults op-options: \
	timeout=600 \
	record-pending=true
I boot one node.
ha1:~ # corosync-quorumtool
Quorum information
------------------
Date:             Sun Dec  3 13:44:55 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          1084752129
Ring ID:          1084752129/240
Quorate:          No
Votequorum information
----------------------
Expected votes:   2
Highest expected: 2
Total votes:      1
Quorum:           2 Activity blocked
Flags:
Membership information
----------------------
    Nodeid      Votes Name
1084752129          1 ha1 (local)
ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition WITHOUT quorum
Last updated: Sun Dec  3 13:48:46 2017
Last change: Sun Dec  3 12:09:19 2017 by root via cibadmin on ha1
2 nodes configured
1 resource configured
Node ha2: UNCLEAN (offline)
Online: [ ha1 ]
Full list of resources:
 stonith-sbd	(stonith:external/sbd):	Stopped
Migration Summary:
* Node ha1:
So far that's expected. We are out of quorum so nothing happens.
Although the first surprise was this message (which confirmed past
empirical observations):
Dec 03 13:44:57 [1632] ha1    pengine:   notice: stage6:	Cannot fence
unclean nodes until quorum is attained (or no-quorum-policy is set to
ignore)
I assume this is intentional behavior in which case it would be really
good to have it mentioned in documentation as well. So far I have not
seen comprehensive explanation of pacemaker startup logic (what and when
it decides to do).
OK, let's pretend we have quorum.
ha1:~ # corosync-cmapctl -s quorum.expected_votes u32 1
ha1:~ # corosync-quorumtool
Quorum information
------------------
Date:             Sun Dec  3 13:52:19 2017
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          1084752129
Ring ID:          1084752129/240
Quorate:          Yes
Votequorum information
----------------------
Expected votes:   1
Highest expected: 1
Total votes:      1
Quorum:           1
Flags:            Quorate
Membership information
----------------------
    Nodeid      Votes Name
1084752129          1 ha1 (local)
So corosync apparently believes we are in quorum now. What pacemaker does?
ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Sun Dec  3 13:53:22 2017
Last change: Sun Dec  3 12:09:19 2017 by root via cibadmin on ha1
2 nodes configured
1 resource configured
Node ha2: UNCLEAN (offline)
Online: [ ha1 ]
Full list of resources:
 stonith-sbd	(stonith:external/sbd):	Stopped
Migration Summary:
* Node ha1:
Nothing really changed. Although it quite clearly tells we are in
quorum, it still won't start any resource nor attempt to fence another
node. Although logs say
Dec 03 13:52:07 [1633] ha1       crmd:   notice:
pcmk_quorum_notification:	Quorum acquired | membership=240 members=1
Dec 03 13:52:07 [1626] ha1 pacemakerd:   notice:
pcmk_quorum_notification:	Quorum acquired | membership=240 members=1
There is still *no* attempt to do anything.
This may be related to previous message
Dec 03 13:44:57 [1629] ha1 stonith-ng:   notice: unpack_config:
Resetting no-quorum-policy to 'stop': cluster has never had quorum
Which opens up question - where can I see this temporary value for
no-quorum-policy? It is not present in CIB, how can I query the
"effective" value of property?
Still even though pacemaker does not attempt to actually start
resources, it apparently believes it was in quorum, because as soon as I
increase number of votes back to 2, node immediately resets (due to
no-quorum-policy=suicide).
Confused ... is it intentional behavior or a bug?
    
    
More information about the Users
mailing list