[ClusterLabs] Resources not always stopped when quorum lost

Tue Jun 2 00:34:15 UTC 2015

> On 29 May 2015, at 4:22 am, Matt Rideout <mrideout at windserve.com> wrote:
> 
> It turns out that if I wait, the node that has resources already started when a quorum is lost does stop its resources after 15 minutes. I repeated the test, and saw the same 15-minute delay.
> 
> cluster-recheck-interval is set to 15 minutes by default, so I dropped it to 1 minute with:
> 
> pcs property set cluster-recheck-interval="60"
> 
> This successfully reduced the delay to 1 minute.
> 
> Is it normal for Pacemaker to wait for cluster-recheck-interval before shutting down resources that were already running at the time quorum was lost?

It may be normal, but its not expected.
I’ll make a note to investigate

> 
> Thanks,
> 
> Matt
> 
> On 5/28/15 11:39 AM, Matt Rideout wrote:
>> I'm attempting to upgrade a two node cluster with no quorum requirement to a three node cluster with a two member quorum requirement. Each node is running CentOS 7, Pacemaker 1.1.12-22 and Crosync 2.3.4-4.
>> 
>> If a node that's running resources loses quorum, then I want it to stop all of its resources.  The goal was partially accomplished by setting the following in corosync.conf:
>> 
>> quorum {
>>   provider: corosync_votequorum
>>   two_node: 1
>> }
>> 
>> ...and updating Pacemaker's configuration with:
>> 
>> pcs property set no-quorum-policy=stop
>> 
>> With the above configuration, Two failure scenarios work as I would expect:
>> 
>> 1. If I power up a single node, it sees that there is no quorum, and refuses to start any resources until it sees a second node come up.
>> 
>> 2. If there are two nodes running, and I power down a node that's running resources, the other node sees that it lost quorum, and refuses to start any resources.
>> 
>> However, a third failure scenario does not work as I would expect:
>> 
>> 3. If there are two nodes running, and I power down a node that's not running resources, the node that is running resources notes in its log that it lost quorum, but does not actually shutdown any of its running services.
>> 
>> Any ideas on what the problem may be would be greatly appreciated. It in case it helps, I included the output of "pcs status", "pcs config show", the contents of "corosync.conf", and the pacemaker and corosync logs from the period during which resources were not stopped.
>> 
>> "pcs status" shows the resources still running after quorum is lost:
>> 
>> Cluster name:
>> Last updated: Thu May 28 10:27:47 2015
>> Last change: Thu May 28 10:03:05 2015
>> Stack: corosync
>> Current DC: node1 (1) - partition WITHOUT quorum
>> Version: 1.1.12-a14efad
>> 3 Nodes configured
>> 12 Resources configured
>> 
>> 
>> Node node3 (3): OFFLINE (standby)
>> Online: [ node1 ]
>> OFFLINE: [ node2 ]
>> 
>> Full list of resources:
>> 
>>  Resource Group: primary
>>      virtual_ip_primary    (ocf::heartbeat:IPaddr2):    Started node1
>>      GreenArrowFS    (ocf::heartbeat:Filesystem):    Started node1
>>      GreenArrow    (ocf::drh:greenarrow):    Started node1
>>      virtual_ip_1    (ocf::heartbeat:IPaddr2):    Started node1
>>      virtual_ip_2    (ocf::heartbeat:IPaddr2):    Started node1
>>  Resource Group: secondary
>>      virtual_ip_secondary    (ocf::heartbeat:IPaddr2):    Stopped
>>      GreenArrow-Secondary    (ocf::drh:greenarrow-secondary):    Stopped
>>  Clone Set: ping-clone [ping]
>>      Started: [ node1 ]
>>      Stopped: [ node2 node3 ]
>>  Master/Slave Set: GreenArrowDataClone [GreenArrowData]
>>      Masters: [ node1 ]
>>      Stopped: [ node2 node3 ]
>> 
>> PCSD Status:
>>   node1: Online
>>   node2: Offline
>>   node3: Offline
>> 
>> Daemon Status:
>>   corosync: active/enabled
>>   pacemaker: active/enabled
>>   pcsd: active/enabled
>> 
>> "pcs config show" shows that the "no-quorum-policy: stop" setting is in place:
>> 
>> Cluster Name:
>> Corosync Nodes:
>>  node1 node2 node3
>> Pacemaker Nodes:
>>  node1 node2 node3
>> 
>> Resources:
>>  Group: primary
>>   Resource: virtual_ip_primary (class=ocf provider=heartbeat type=IPaddr2)
>>    Attributes: ip=10.10.10.1 cidr_netmask=32
>>    Operations: start interval=0s timeout=20s (virtual_ip_primary-start-timeout-20s)
>>                stop interval=0s timeout=20s (virtual_ip_primary-stop-timeout-20s)
>>                monitor interval=30s (virtual_ip_primary-monitor-interval-30s)
>>   Resource: GreenArrowFS (class=ocf provider=heartbeat type=Filesystem)
>>    Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs options=noatime,discard
>>    Operations: start interval=0s timeout=60 (GreenArrowFS-start-timeout-60)
>>                stop interval=0s timeout=60 (GreenArrowFS-stop-timeout-60)
>>                monitor interval=20 timeout=40 (GreenArrowFS-monitor-interval-20)
>>   Resource: GreenArrow (class=ocf provider=drh type=greenarrow)
>>    Operations: start interval=0s timeout=30 (GreenArrow-start-timeout-30)
>>                stop interval=0s timeout=240 (GreenArrow-stop-timeout-240)
>>                monitor interval=10 timeout=20 (GreenArrow-monitor-interval-10)
>>   Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)
>>    Attributes: ip=64.21.76.51 cidr_netmask=32
>>    Operations: start interval=0s timeout=20s (virtual_ip_1-start-timeout-20s)
>>                stop interval=0s timeout=20s (virtual_ip_1-stop-timeout-20s)
>>                monitor interval=30s (virtual_ip_1-monitor-interval-30s)
>>   Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)
>>    Attributes: ip=64.21.76.63 cidr_netmask=32
>>    Operations: start interval=0s timeout=20s (virtual_ip_2-start-timeout-20s)
>>                stop interval=0s timeout=20s (virtual_ip_2-stop-timeout-20s)
>>                monitor interval=30s (virtual_ip_2-monitor-interval-30s)
>>  Group: secondary
>>   Resource: virtual_ip_secondary (class=ocf provider=heartbeat type=IPaddr2)
>>    Attributes: ip=10.10.10.4 cidr_netmask=32
>>    Operations: start interval=0s timeout=20s (virtual_ip_secondary-start-timeout-20s)
>>                stop interval=0s timeout=20s (virtual_ip_secondary-stop-timeout-20s)
>>                monitor interval=30s (virtual_ip_secondary-monitor-interval-30s)
>>   Resource: GreenArrow-Secondary (class=ocf provider=drh type=greenarrow-secondary)
>>    Operations: start interval=0s timeout=30 (GreenArrow-Secondary-start-timeout-30)
>>                stop interval=0s timeout=240 (GreenArrow-Secondary-stop-timeout-240)
>>                monitor interval=10 timeout=20 (GreenArrow-Secondary-monitor-interval-10)
>>  Clone: ping-clone
>>   Resource: ping (class=ocf provider=pacemaker type=ping)
>>    Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1
>>    Operations: start interval=0s timeout=60 (ping-start-timeout-60)
>>                stop interval=0s timeout=20 (ping-stop-timeout-20)
>>                monitor interval=10 timeout=60 (ping-monitor-interval-10)
>>  Master: GreenArrowDataClone
>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
>>   Resource: GreenArrowData (class=ocf provider=linbit type=drbd)
>>    Attributes: drbd_resource=r0
>>    Operations: start interval=0s timeout=240 (GreenArrowData-start-timeout-240)
>>                promote interval=0s timeout=90 (GreenArrowData-promote-timeout-90)
>>                demote interval=0s timeout=90 (GreenArrowData-demote-timeout-90)
>>                stop interval=0s timeout=100 (GreenArrowData-stop-timeout-100)
>>                monitor interval=60s (GreenArrowData-monitor-interval-60s)
>> 
>> Stonith Devices:
>> Fencing Levels:
>> 
>> Location Constraints:
>>   Resource: primary
>>     Enabled on: node1 (score:INFINITY) (id:location-primary-node1-INFINITY)
>>     Constraint: location-primary
>>       Rule: score=-INFINITY boolean-op=or  (id:location-primary-rule)
>>         Expression: pingd lt 1  (id:location-primary-rule-expr)
>>         Expression: not_defined pingd  (id:location-primary-rule-expr-1)
>> Ordering Constraints:
>>   promote GreenArrowDataClone then start GreenArrowFS (kind:Mandatory) (id:order-GreenArrowDataClone-GreenArrowFS-mandatory)
>>   stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory) (id:order-GreenArrowFS-GreenArrowDataClone-mandatory)
>> Colocation Constraints:
>>   GreenArrowFS with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)
>>   virtual_ip_secondary with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Slave) (id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)
>>   virtual_ip_primary with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)
>> 
>> Cluster Properties:
>>  cluster-infrastructure: corosync
>>  cluster-name: cluster_greenarrow
>>  dc-version: 1.1.12-a14efad
>>  have-watchdog: false
>>  no-quorum-policy: stop
>>  stonith-enabled: false
>> Node Attributes:
>>  node3: standby=on
>> 
>> Here's what was logged:
>> 
>> May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node node3 for shutdown
>> May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2
>> May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete
>> May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: do_shutdown of node3 (op 64) is complete
>> May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state: attrd_peer_change_cb: Node node3[3] - state is now lost (was member)
>> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove: Removing all node3 attributes for attrd_peer_change_cb
>> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb: Lost attribute writer node3
>> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list contains incorrect address. This is sign of misconfiguration between nodes!
>> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership (64.21.76.61:25740) was formed. Members left: 3
>> May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
>> May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1
>> May 28 10:19:51 node1 corosync[1040]: [MAIN  ] Completed service synchronization, ready to provide service.
>> May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification: Membership 25740: quorum lost (1)
>> May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
>> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: do_shutdown of node3 (op 64) is complete
>> May 28 10:19:51 node1 pacemakerd[1254]: notice: pcmk_quorum_notification: Membership 25740: quorum lost (1)
>> May 28 10:19:51 node1 pacemakerd[1254]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
>> May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically recovered ring 1
>> 
>> Here's corosync.conf:
>> 
>> totem {
>>   version: 2
>>   secauth: off
>>   cluster_name: cluster_greenarrow
>>   rrp_mode: passive
>>   transport: udpu
>> }
>> 
>> nodelist {
>>   node {
>>     ring0_addr: node1
>>     ring1_addr: 10.10.10.2
>>     nodeid: 1
>>   }
>>   node {
>>     ring0_addr: node2
>>     ring1_addr: 10.10.10.3
>>     nodeid: 2
>>   }
>>   node {
>>     ring0_addr: node3
>>     nodeid: 3
>>   }
>> }
>> 
>> quorum {
>>   provider: corosync_votequorum
>>   two_node: 0
>> }
>> 
>> logging {
>>   to_syslog: yes
>> }
>> 
>> Thanks,
>> 
>> Matt
>> 
>> 
>> _______________________________________________
>> Users mailing list: 
>> Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> 
>> Project Home: 
>> http://www.clusterlabs.org
>> 
>> Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> 
>> Bugs: 
>> http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org