[ClusterLabs] Resources not always stopped when quorum lost

Matt Rideout mrideout at windserve.com
Thu May 28 15:39:27 UTC 2015


I'm attempting to upgrade a two node cluster with no quorum requirement 
to a three node cluster with a two member quorum requirement. Each node 
is running CentOS 7, Pacemaker 1.1.12-22 and Crosync 2.3.4-4.

If a node that's running resources loses quorum, then I want it to stop 
all of its resources.  The goal was partially accomplished by setting 
the following in corosync.conf:

quorum {
   provider: corosync_votequorum
   two_node: 1
}

...and updating Pacemaker's configuration with:

pcs property set no-quorum-policy=stop

With the above configuration, Two failure scenarios work as I would expect:

1. If I power up a single node, it sees that there is no quorum, and 
refuses to start any resources until it sees a second node come up.

2. If there are two nodes running, and I power down a node that's 
running resources, the other node sees that it lost quorum, and refuses 
to start any resources.

However, a third failure scenario does not work as I would expect:

3. If there are two nodes running, and I power down a node that's not 
running resources, the node that is running resources notes in its log 
that it lost quorum, but does not actually shutdown any of its running 
services.

Any ideas on what the problem may be would be greatly appreciated. It in 
case it helps, I included the output of "pcs status", "pcs config show", 
the contents of "corosync.conf", and the pacemaker and corosync logs 
from the period during which resources were not stopped.

*"pcs status" shows the resources still running after quorum is lost:*

Cluster name:
Last updated: Thu May 28 10:27:47 2015
Last change: Thu May 28 10:03:05 2015
Stack: corosync
Current DC: node1 (1) - partition WITHOUT quorum
Version: 1.1.12-a14efad
3 Nodes configured
12 Resources configured


Node node3 (3): OFFLINE (standby)
Online: [ node1 ]
OFFLINE: [ node2 ]

Full list of resources:

  Resource Group: primary
      virtual_ip_primary    (ocf::heartbeat:IPaddr2):    Started node1
      GreenArrowFS    (ocf::heartbeat:Filesystem):    Started node1
      GreenArrow    (ocf::drh:greenarrow):    Started node1
      virtual_ip_1    (ocf::heartbeat:IPaddr2):    Started node1
      virtual_ip_2    (ocf::heartbeat:IPaddr2):    Started node1
  Resource Group: secondary
      virtual_ip_secondary    (ocf::heartbeat:IPaddr2):    Stopped
      GreenArrow-Secondary    (ocf::drh:greenarrow-secondary): Stopped
  Clone Set: ping-clone [ping]
      Started: [ node1 ]
      Stopped: [ node2 node3 ]
  Master/Slave Set: GreenArrowDataClone [GreenArrowData]
      Masters: [ node1 ]
      Stopped: [ node2 node3 ]

PCSD Status:
   node1: Online
   node2: Offline
   node3: Offline

Daemon Status:
   corosync: active/enabled
   pacemaker: active/enabled
   pcsd: active/enabled

*"pcs config show"**shows that the "no-quorum-policy: stop" setting is 
in place:*

Cluster Name:
Corosync Nodes:
  node1 node2 node3
Pacemaker Nodes:
  node1 node2 node3

Resources:
  Group: primary
   Resource: virtual_ip_primary (class=ocf provider=heartbeat type=IPaddr2)
    Attributes: ip=10.10.10.1 cidr_netmask=32
    Operations: start interval=0s timeout=20s 
(virtual_ip_primary-start-timeout-20s)
                stop interval=0s timeout=20s 
(virtual_ip_primary-stop-timeout-20s)
                monitor interval=30s 
(virtual_ip_primary-monitor-interval-30s)
   Resource: GreenArrowFS (class=ocf provider=heartbeat type=Filesystem)
    Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs 
options=noatime,discard
    Operations: start interval=0s timeout=60 (GreenArrowFS-start-timeout-60)
                stop interval=0s timeout=60 (GreenArrowFS-stop-timeout-60)
                monitor interval=20 timeout=40 
(GreenArrowFS-monitor-interval-20)
   Resource: GreenArrow (class=ocf provider=drh type=greenarrow)
    Operations: start interval=0s timeout=30 (GreenArrow-start-timeout-30)
                stop interval=0s timeout=240 (GreenArrow-stop-timeout-240)
                monitor interval=10 timeout=20 
(GreenArrow-monitor-interval-10)
   Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)
    Attributes: ip=64.21.76.51 cidr_netmask=32
    Operations: start interval=0s timeout=20s 
(virtual_ip_1-start-timeout-20s)
                stop interval=0s timeout=20s (virtual_ip_1-stop-timeout-20s)
                monitor interval=30s (virtual_ip_1-monitor-interval-30s)
   Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)
    Attributes: ip=64.21.76.63 cidr_netmask=32
    Operations: start interval=0s timeout=20s 
(virtual_ip_2-start-timeout-20s)
                stop interval=0s timeout=20s (virtual_ip_2-stop-timeout-20s)
                monitor interval=30s (virtual_ip_2-monitor-interval-30s)
  Group: secondary
   Resource: virtual_ip_secondary (class=ocf provider=heartbeat 
type=IPaddr2)
    Attributes: ip=10.10.10.4 cidr_netmask=32
    Operations: start interval=0s timeout=20s 
(virtual_ip_secondary-start-timeout-20s)
                stop interval=0s timeout=20s 
(virtual_ip_secondary-stop-timeout-20s)
                monitor interval=30s 
(virtual_ip_secondary-monitor-interval-30s)
   Resource: GreenArrow-Secondary (class=ocf provider=drh 
type=greenarrow-secondary)
    Operations: start interval=0s timeout=30 
(GreenArrow-Secondary-start-timeout-30)
                stop interval=0s timeout=240 
(GreenArrow-Secondary-stop-timeout-240)
                monitor interval=10 timeout=20 
(GreenArrow-Secondary-monitor-interval-10)
  Clone: ping-clone
   Resource: ping (class=ocf provider=pacemaker type=ping)
    Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1
    Operations: start interval=0s timeout=60 (ping-start-timeout-60)
                stop interval=0s timeout=20 (ping-stop-timeout-20)
                monitor interval=10 timeout=60 (ping-monitor-interval-10)
  Master: GreenArrowDataClone
   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true
   Resource: GreenArrowData (class=ocf provider=linbit type=drbd)
    Attributes: drbd_resource=r0
    Operations: start interval=0s timeout=240 
(GreenArrowData-start-timeout-240)
                promote interval=0s timeout=90 
(GreenArrowData-promote-timeout-90)
                demote interval=0s timeout=90 
(GreenArrowData-demote-timeout-90)
                stop interval=0s timeout=100 
(GreenArrowData-stop-timeout-100)
                monitor interval=60s (GreenArrowData-monitor-interval-60s)

Stonith Devices:
Fencing Levels:

Location Constraints:
   Resource: primary
     Enabled on: node1 (score:INFINITY) (id:location-primary-node1-INFINITY)
     Constraint: location-primary
       Rule: score=-INFINITY boolean-op=or (id:location-primary-rule)
         Expression: pingd lt 1  (id:location-primary-rule-expr)
         Expression: not_defined pingd (id:location-primary-rule-expr-1)
Ordering Constraints:
   promote GreenArrowDataClone then start GreenArrowFS (kind:Mandatory) 
(id:order-GreenArrowDataClone-GreenArrowFS-mandatory)
   stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory) 
(id:order-GreenArrowFS-GreenArrowDataClone-mandatory)
Colocation Constraints:
   GreenArrowFS with GreenArrowDataClone (score:INFINITY) 
(with-rsc-role:Master) 
(id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)
   virtual_ip_secondary with GreenArrowDataClone (score:INFINITY) 
(with-rsc-role:Slave) 
(id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)
   virtual_ip_primary with GreenArrowDataClone (score:INFINITY) 
(with-rsc-role:Master) 
(id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)

Cluster Properties:
  cluster-infrastructure: corosync
  cluster-name: cluster_greenarrow
  dc-version: 1.1.12-a14efad
  have-watchdog: false
  no-quorum-policy: stop
  stonith-enabled: false
Node Attributes:
  node3: standby=on

*Here's what was logged*:

May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node 
node3 for shutdown
May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message: 
Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2
May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7 
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete
May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: 
do_shutdown of node3 (op 64) is complete
May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state: 
attrd_peer_change_cb: Node node3[3] - state is now lost (was member)
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove: Removing 
all node3 attributes for attrd_peer_change_cb
May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb: Lost 
attribute writer node3
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list 
contains incorrect address. This is sign of misconfiguration between nodes!
May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership 
(64.21.76.61:25740) was formed. Members left: 3
May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1
May 28 10:19:51 node1 corosync[1040]: [MAIN  ] Completed service 
synchronization, ready to provide service.
May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification: 
Membership 25740: quorum lost (1)
May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state: 
pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: 
do_shutdown of node3 (op 64) is complete
May 28 10:19:51 node1 pacemakerd[1254]: notice: 
pcmk_quorum_notification: Membership 25740: quorum lost (1)
May 28 10:19:51 node1 pacemakerd[1254]: notice: crm_update_peer_state: 
pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically recovered 
ring 1

*H**ere's corosync.conf:*

totem {
   version: 2
   secauth: off
   cluster_name: cluster_greenarrow
   rrp_mode: passive
   transport: udpu
}

nodelist {
   node {
     ring0_addr: node1
     ring1_addr: 10.10.10.2
     nodeid: 1
   }
   node {
     ring0_addr: node2
     ring1_addr: 10.10.10.3
     nodeid: 2
   }
   node {
     ring0_addr: node3
     nodeid: 3
   }
}

quorum {
   provider: corosync_votequorum
   two_node: 0
}

logging {
   to_syslog: yes
}

Thanks,

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150528/edd542ba/attachment-0003.html>


More information about the Users mailing list