<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
I'm attempting to upgrade a two node cluster with no quorum
requirement to a three node cluster with a two member quorum
requirement. Each node is running CentOS 7, Pacemaker 1.1.12-22 and
Crosync 2.3.4-4.<br>
<br>
If a node that's running resources loses quorum, then I want it to
stop all of its resources. The goal was partially accomplished by
setting the following in corosync.conf:<br>
<br>
quorum {<br>
provider: corosync_votequorum<br>
two_node: 1<br>
}<br>
<br>
...and updating Pacemaker's configuration with:<br>
<br>
pcs property set no-quorum-policy=stop<br>
<br>
With the above configuration, Two failure scenarios work as I would
expect:<br>
<br>
1. If I power up a single node, it sees that there is no quorum, and
refuses to start any resources until it sees a second node come up.<br>
<br>
2. If there are two nodes running, and I power down a node that's
running resources, the other node sees that it lost quorum, and
refuses to start any resources.<br>
<br>
However, a third failure scenario does not work as I would expect:<br>
<br>
3. If there are two nodes running, and I power down a node that's
not running resources, the node that is running resources notes in
its log that it lost quorum, but does not actually shutdown any of
its running services.<br>
<br>
Any ideas on what the problem may be would be greatly appreciated.
It in case it helps, I included the output of "pcs status", "pcs
config show", the contents of "corosync.conf", and the pacemaker and
corosync logs from the period during which resources were not
stopped.<br>
<br>
<b>"pcs status" shows the resources still running after quorum is
lost:</b><br>
<br>
Cluster name:<br>
Last updated: Thu May 28 10:27:47 2015<br>
Last change: Thu May 28 10:03:05 2015<br>
Stack: corosync<br>
Current DC: node1 (1) - partition WITHOUT quorum<br>
Version: 1.1.12-a14efad<br>
3 Nodes configured<br>
12 Resources configured<br>
<br>
<br>
Node node3 (3): OFFLINE (standby)<br>
Online: [ node1 ]<br>
OFFLINE: [ node2 ]<br>
<br>
Full list of resources:<br>
<br>
Resource Group: primary<br>
virtual_ip_primary (ocf::heartbeat:IPaddr2): Started
node1<br>
GreenArrowFS (ocf::heartbeat:Filesystem): Started node1<br>
GreenArrow (ocf::drh:greenarrow): Started node1<br>
virtual_ip_1 (ocf::heartbeat:IPaddr2): Started node1<br>
virtual_ip_2 (ocf::heartbeat:IPaddr2): Started node1<br>
Resource Group: secondary<br>
virtual_ip_secondary (ocf::heartbeat:IPaddr2): Stopped<br>
GreenArrow-Secondary (ocf::drh:greenarrow-secondary):
Stopped<br>
Clone Set: ping-clone [ping]<br>
Started: [ node1 ]<br>
Stopped: [ node2 node3 ]<br>
Master/Slave Set: GreenArrowDataClone [GreenArrowData]<br>
Masters: [ node1 ]<br>
Stopped: [ node2 node3 ]<br>
<br>
PCSD Status:<br>
node1: Online<br>
node2: Offline<br>
node3: Offline<br>
<br>
Daemon Status:<br>
corosync: active/enabled<br>
pacemaker: active/enabled<br>
pcsd: active/enabled<br>
<br>
<b>"pcs config show"</b><b> shows that the "no-quorum-policy: stop"
setting is in place:</b><br>
<br>
Cluster Name:<br>
Corosync Nodes:<br>
node1 node2 node3<br>
Pacemaker Nodes:<br>
node1 node2 node3<br>
<br>
Resources:<br>
Group: primary<br>
Resource: virtual_ip_primary (class=ocf provider=heartbeat
type=IPaddr2)<br>
Attributes: ip=10.10.10.1 cidr_netmask=32<br>
Operations: start interval=0s timeout=20s
(virtual_ip_primary-start-timeout-20s)<br>
stop interval=0s timeout=20s
(virtual_ip_primary-stop-timeout-20s)<br>
monitor interval=30s
(virtual_ip_primary-monitor-interval-30s)<br>
Resource: GreenArrowFS (class=ocf provider=heartbeat
type=Filesystem)<br>
Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs
options=noatime,discard<br>
Operations: start interval=0s timeout=60
(GreenArrowFS-start-timeout-60)<br>
stop interval=0s timeout=60
(GreenArrowFS-stop-timeout-60)<br>
monitor interval=20 timeout=40
(GreenArrowFS-monitor-interval-20)<br>
Resource: GreenArrow (class=ocf provider=drh type=greenarrow)<br>
Operations: start interval=0s timeout=30
(GreenArrow-start-timeout-30)<br>
stop interval=0s timeout=240
(GreenArrow-stop-timeout-240)<br>
monitor interval=10 timeout=20
(GreenArrow-monitor-interval-10)<br>
Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)<br>
Attributes: ip=64.21.76.51 cidr_netmask=32<br>
Operations: start interval=0s timeout=20s
(virtual_ip_1-start-timeout-20s)<br>
stop interval=0s timeout=20s
(virtual_ip_1-stop-timeout-20s)<br>
monitor interval=30s
(virtual_ip_1-monitor-interval-30s)<br>
Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)<br>
Attributes: ip=64.21.76.63 cidr_netmask=32<br>
Operations: start interval=0s timeout=20s
(virtual_ip_2-start-timeout-20s)<br>
stop interval=0s timeout=20s
(virtual_ip_2-stop-timeout-20s)<br>
monitor interval=30s
(virtual_ip_2-monitor-interval-30s)<br>
Group: secondary<br>
Resource: virtual_ip_secondary (class=ocf provider=heartbeat
type=IPaddr2)<br>
Attributes: ip=10.10.10.4 cidr_netmask=32<br>
Operations: start interval=0s timeout=20s
(virtual_ip_secondary-start-timeout-20s)<br>
stop interval=0s timeout=20s
(virtual_ip_secondary-stop-timeout-20s)<br>
monitor interval=30s
(virtual_ip_secondary-monitor-interval-30s)<br>
Resource: GreenArrow-Secondary (class=ocf provider=drh
type=greenarrow-secondary)<br>
Operations: start interval=0s timeout=30
(GreenArrow-Secondary-start-timeout-30)<br>
stop interval=0s timeout=240
(GreenArrow-Secondary-stop-timeout-240)<br>
monitor interval=10 timeout=20
(GreenArrow-Secondary-monitor-interval-10)<br>
Clone: ping-clone<br>
Resource: ping (class=ocf provider=pacemaker type=ping)<br>
Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1<br>
Operations: start interval=0s timeout=60 (ping-start-timeout-60)<br>
stop interval=0s timeout=20 (ping-stop-timeout-20)<br>
monitor interval=10 timeout=60
(ping-monitor-interval-10)<br>
Master: GreenArrowDataClone<br>
Meta Attrs: master-max=1 master-node-max=1 clone-max=2
clone-node-max=1 notify=true<br>
Resource: GreenArrowData (class=ocf provider=linbit type=drbd)<br>
Attributes: drbd_resource=r0<br>
Operations: start interval=0s timeout=240
(GreenArrowData-start-timeout-240)<br>
promote interval=0s timeout=90
(GreenArrowData-promote-timeout-90)<br>
demote interval=0s timeout=90
(GreenArrowData-demote-timeout-90)<br>
stop interval=0s timeout=100
(GreenArrowData-stop-timeout-100)<br>
monitor interval=60s
(GreenArrowData-monitor-interval-60s)<br>
<br>
Stonith Devices:<br>
Fencing Levels:<br>
<br>
Location Constraints:<br>
Resource: primary<br>
Enabled on: node1 (score:INFINITY)
(id:location-primary-node1-INFINITY)<br>
Constraint: location-primary<br>
Rule: score=-INFINITY boolean-op=or
(id:location-primary-rule)<br>
Expression: pingd lt 1 (id:location-primary-rule-expr)<br>
Expression: not_defined pingd
(id:location-primary-rule-expr-1)<br>
Ordering Constraints:<br>
promote GreenArrowDataClone then start GreenArrowFS
(kind:Mandatory)
(id:order-GreenArrowDataClone-GreenArrowFS-mandatory)<br>
stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory)
(id:order-GreenArrowFS-GreenArrowDataClone-mandatory)<br>
Colocation Constraints:<br>
GreenArrowFS with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Master)
(id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)<br>
virtual_ip_secondary with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Slave)
(id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)<br>
virtual_ip_primary with GreenArrowDataClone (score:INFINITY)
(with-rsc-role:Master)
(id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)<br>
<br>
Cluster Properties:<br>
cluster-infrastructure: corosync<br>
cluster-name: cluster_greenarrow<br>
dc-version: 1.1.12-a14efad<br>
have-watchdog: false<br>
no-quorum-policy: stop<br>
stonith-enabled: false<br>
Node Attributes:<br>
node3: standby=on<br>
<br>
<b>Here's what was logged</b>:<br>
<br>
May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node
node3 for shutdown<br>
May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message:
Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2<br>
May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete<br>
May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]<br>
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
do_shutdown of node3 (op 64) is complete<br>
May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state:
attrd_peer_change_cb: Node node3[3] - state is now lost (was member)<br>
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove:
Removing all node3 attributes for attrd_peer_change_cb<br>
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb:
Lost attribute writer node3<br>
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list
contains incorrect address. This is sign of misconfiguration between
nodes!<br>
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership
(64.21.76.61:25740) was formed. Members left: 3<br>
May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within
the non-primary component and will NOT provide any services.<br>
May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1<br>
May 28 10:19:51 node1 corosync[1040]: [MAIN ] Completed service
synchronization, ready to provide service.<br>
May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification:
Membership 25740: quorum lost (1)<br>
May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node3[3] - state is now lost (was
member)<br>
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback:
do_shutdown of node3 (op 64) is complete<br>
May 28 10:19:51 node1 pacemakerd[1254]: notice:
pcmk_quorum_notification: Membership 25740: quorum lost (1)<br>
May 28 10:19:51 node1 pacemakerd[1254]: notice:
crm_update_peer_state: pcmk_quorum_notification: Node node3[3] -
state is now lost (was member)<br>
May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically
recovered ring 1<br>
<br>
<b>H</b><b>ere's corosync.conf:</b><br>
<br>
totem {<br>
version: 2<br>
secauth: off<br>
cluster_name: cluster_greenarrow<br>
rrp_mode: passive<br>
transport: udpu<br>
}<br>
<br>
nodelist {<br>
node {<br>
ring0_addr: node1<br>
ring1_addr: 10.10.10.2<br>
nodeid: 1<br>
}<br>
node {<br>
ring0_addr: node2<br>
ring1_addr: 10.10.10.3<br>
nodeid: 2<br>
}<br>
node {<br>
ring0_addr: node3<br>
nodeid: 3<br>
}<br>
}<br>
<br>
quorum {<br>
provider: corosync_votequorum<br>
two_node: 0<br>
}<br>
<br>
logging {<br>
to_syslog: yes<br>
}<br>
<br>
Thanks,<br>
<br>
Matt<br>
</body>
</html>