[ClusterLabs] Corosync quorum vs. pacemaker quorum confusion
Andrei Borzenkov
arvidjaar at gmail.com
Sun Dec 3 06:03:17 EST 2017
I assumed that with corosync 2.x quorum is maintained by corosync and
pacemaker simply gets yes/no. Apparently this is more complicated.
Trivial test two node cluster (two_node is intentionally not set to
simulate "normal" behavior).
ha1:~ # crm configure show
node 1084752129: ha1
node 1084752130: ha2
primitive stonith-sbd stonith:external/sbd \
params pcmk_delay_max=30s
property cib-bootstrap-options: \
have-watchdog=true \
dc-version=1.1.17-3.3-36d2962a8 \
cluster-infrastructure=corosync \
cluster-name=hacluster \
stonith-enabled=true \
placement-strategy=balanced \
stonith-timeout=172 \
no-quorum-policy=suicide
rsc_defaults rsc-options: \
resource-stickiness=1 \
migration-threshold=3
op_defaults op-options: \
timeout=600 \
record-pending=true
I boot one node.
ha1:~ # corosync-quorumtool
Quorum information
------------------
Date: Sun Dec 3 13:44:55 2017
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 1084752129
Ring ID: 1084752129/240
Quorate: No
Votequorum information
----------------------
Expected votes: 2
Highest expected: 2
Total votes: 1
Quorum: 2 Activity blocked
Flags:
Membership information
----------------------
Nodeid Votes Name
1084752129 1 ha1 (local)
ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition WITHOUT quorum
Last updated: Sun Dec 3 13:48:46 2017
Last change: Sun Dec 3 12:09:19 2017 by root via cibadmin on ha1
2 nodes configured
1 resource configured
Node ha2: UNCLEAN (offline)
Online: [ ha1 ]
Full list of resources:
stonith-sbd (stonith:external/sbd): Stopped
Migration Summary:
* Node ha1:
So far that's expected. We are out of quorum so nothing happens.
Although the first surprise was this message (which confirmed past
empirical observations):
Dec 03 13:44:57 [1632] ha1 pengine: notice: stage6: Cannot fence
unclean nodes until quorum is attained (or no-quorum-policy is set to
ignore)
I assume this is intentional behavior in which case it would be really
good to have it mentioned in documentation as well. So far I have not
seen comprehensive explanation of pacemaker startup logic (what and when
it decides to do).
OK, let's pretend we have quorum.
ha1:~ # corosync-cmapctl -s quorum.expected_votes u32 1
ha1:~ # corosync-quorumtool
Quorum information
------------------
Date: Sun Dec 3 13:52:19 2017
Quorum provider: corosync_votequorum
Nodes: 1
Node ID: 1084752129
Ring ID: 1084752129/240
Quorate: Yes
Votequorum information
----------------------
Expected votes: 1
Highest expected: 1
Total votes: 1
Quorum: 1
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
1084752129 1 ha1 (local)
So corosync apparently believes we are in quorum now. What pacemaker does?
ha1:~ # crm_mon -1rf
Stack: corosync
Current DC: ha1 (version 1.1.17-3.3-36d2962a8) - partition with quorum
Last updated: Sun Dec 3 13:53:22 2017
Last change: Sun Dec 3 12:09:19 2017 by root via cibadmin on ha1
2 nodes configured
1 resource configured
Node ha2: UNCLEAN (offline)
Online: [ ha1 ]
Full list of resources:
stonith-sbd (stonith:external/sbd): Stopped
Migration Summary:
* Node ha1:
Nothing really changed. Although it quite clearly tells we are in
quorum, it still won't start any resource nor attempt to fence another
node. Although logs say
Dec 03 13:52:07 [1633] ha1 crmd: notice:
pcmk_quorum_notification: Quorum acquired | membership=240 members=1
Dec 03 13:52:07 [1626] ha1 pacemakerd: notice:
pcmk_quorum_notification: Quorum acquired | membership=240 members=1
There is still *no* attempt to do anything.
This may be related to previous message
Dec 03 13:44:57 [1629] ha1 stonith-ng: notice: unpack_config:
Resetting no-quorum-policy to 'stop': cluster has never had quorum
Which opens up question - where can I see this temporary value for
no-quorum-policy? It is not present in CIB, how can I query the
"effective" value of property?
Still even though pacemaker does not attempt to actually start
resources, it apparently believes it was in quorum, because as soon as I
increase number of votes back to 2, node immediately resets (due to
no-quorum-policy=suicide).
Confused ... is it intentional behavior or a bug?
More information about the Users
mailing list