[ClusterLabs] Resources not always stopped when quorum lost
Andrew Beekhof
andrew at beekhof.net
Tue Jun 2 00:34:15 UTC 2015
> On 29 May 2015, at 4:22 am, Matt Rideout <mrideout at windserve.com> wrote:
>
> It turns out that if I wait, the node that has resources already started when a quorum is lost does stop its resources after 15 minutes. I repeated the test, and saw the same 15-minute delay.
>
> cluster-recheck-interval is set to 15 minutes by default, so I dropped it to 1 minute with:
>
> pcs property set cluster-recheck-interval="60"
>
> This successfully reduced the delay to 1 minute.
>
> Is it normal for Pacemaker to wait for cluster-recheck-interval before shutting down resources that were already running at the time quorum was lost?
It may be normal, but its not expected.
I’ll make a note to investigate
>
> Thanks,
>
> Matt
>
> On 5/28/15 11:39 AM, Matt Rideout wrote:
>> I'm attempting to upgrade a two node cluster with no quorum requirement to a three node cluster with a two member quorum requirement. Each node is running CentOS 7, Pacemaker 1.1.12-22 and Crosync 2.3.4-4.
>>
>> If a node that's running resources loses quorum, then I want it to stop all of its resources. The goal was partially accomplished by setting the following in corosync.conf:
>>
>> quorum {
>> provider: corosync_votequorum
>> two_node: 1
>> }
>>
>> ...and updating Pacemaker's configuration with:
>>
>> pcs property set no-quorum-policy=stop
>>
>> With the above configuration, Two failure scenarios work as I would expect:
>>
>> 1. If I power up a single node, it sees that there is no quorum, and refuses to start any resources until it sees a second node come up.
>>
>> 2. If there are two nodes running, and I power down a node that's running resources, the other node sees that it lost quorum, and refuses to start any resources.
>>
>> However, a third failure scenario does not work as I would expect:
>>
>> 3. If there are two nodes running, and I power down a node that's not running resources, the node that is running resources notes in its log that it lost quorum, but does not actually shutdown any of its running services.
>>
>> Any ideas on what the problem may be would be greatly appreciated. It in case it helps, I included the output of "pcs status", "pcs config show", the contents of "corosync.conf", and the pacemaker and corosync logs from the period during which resources were not stopped.
>>
>> "pcs status" shows the resources still running after quorum is lost:
>>
>> Cluster name:
>> Last updated: Thu May 28 10:27:47 2015
>> Last change: Thu May 28 10:03:05 2015
>> Stack: corosync
>> Current DC: node1 (1) - partition WITHOUT quorum
>> Version: 1.1.12-a14efad
>> 3 Nodes configured
>> 12 Resources configured
>>
>>
>> Node node3 (3): OFFLINE (standby)
>> Online: [ node1 ]
>> OFFLINE: [ node2 ]
>>
>> Full list of resources:
>>
>> Resource Group: primary
>> virtual_ip_primary (ocf::heartbeat:IPaddr2): Started node1
>> GreenArrowFS (ocf::heartbeat:Filesystem): Started node1
>> GreenArrow (ocf::drh:greenarrow): Started node1
>> virtual_ip_1 (ocf::heartbeat:IPaddr2): Started node1
>> virtual_ip_2 (ocf::heartbeat:IPaddr2): Started node1
>> Resource Group: secondary
>> virtual_ip_secondary (ocf::heartbeat:IPaddr2): Stopped
>> GreenArrow-Secondary (ocf::drh:greenarrow-secondary): Stopped
>> Clone Set: ping-clone [ping]
>> Started: [ node1 ]
>> Stopped: [ node2 node3 ]
>> Master/Slave Set: GreenArrowDataClone [GreenArrowData]
>> Masters: [ node1 ]
>> Stopped: [ node2 node3 ]
>>
>> PCSD Status:
>> node1: Online
>> node2: Offline
>> node3: Offline
>>
>> Daemon Status:
>> corosync: active/enabled
>> pacemaker: active/enabled
>> pcsd: active/enabled
>>
>> "pcs config show" shows that the "no-quorum-policy: stop" setting is in place:
>>
>> Cluster Name:
>> Corosync Nodes:
>> node1 node2 node3
>> Pacemaker Nodes:
>> node1 node2 node3
>>
>> Resources:
>> Group: primary
>> Resource: virtual_ip_primary (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=10.10.10.1 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (virtual_ip_primary-start-timeout-20s)
>> stop interval=0s timeout=20s (virtual_ip_primary-stop-timeout-20s)
>> monitor interval=30s (virtual_ip_primary-monitor-interval-30s)
>> Resource: GreenArrowFS (class=ocf provider=heartbeat type=Filesystem)
>> Attributes: device=/dev/drbd1 directory=/media/drbd1 fstype=xfs options=noatime,discard
>> Operations: start interval=0s timeout=60 (GreenArrowFS-start-timeout-60)
>> stop interval=0s timeout=60 (GreenArrowFS-stop-timeout-60)
>> monitor interval=20 timeout=40 (GreenArrowFS-monitor-interval-20)
>> Resource: GreenArrow (class=ocf provider=drh type=greenarrow)
>> Operations: start interval=0s timeout=30 (GreenArrow-start-timeout-30)
>> stop interval=0s timeout=240 (GreenArrow-stop-timeout-240)
>> monitor interval=10 timeout=20 (GreenArrow-monitor-interval-10)
>> Resource: virtual_ip_1 (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=64.21.76.51 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (virtual_ip_1-start-timeout-20s)
>> stop interval=0s timeout=20s (virtual_ip_1-stop-timeout-20s)
>> monitor interval=30s (virtual_ip_1-monitor-interval-30s)
>> Resource: virtual_ip_2 (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=64.21.76.63 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (virtual_ip_2-start-timeout-20s)
>> stop interval=0s timeout=20s (virtual_ip_2-stop-timeout-20s)
>> monitor interval=30s (virtual_ip_2-monitor-interval-30s)
>> Group: secondary
>> Resource: virtual_ip_secondary (class=ocf provider=heartbeat type=IPaddr2)
>> Attributes: ip=10.10.10.4 cidr_netmask=32
>> Operations: start interval=0s timeout=20s (virtual_ip_secondary-start-timeout-20s)
>> stop interval=0s timeout=20s (virtual_ip_secondary-stop-timeout-20s)
>> monitor interval=30s (virtual_ip_secondary-monitor-interval-30s)
>> Resource: GreenArrow-Secondary (class=ocf provider=drh type=greenarrow-secondary)
>> Operations: start interval=0s timeout=30 (GreenArrow-Secondary-start-timeout-30)
>> stop interval=0s timeout=240 (GreenArrow-Secondary-stop-timeout-240)
>> monitor interval=10 timeout=20 (GreenArrow-Secondary-monitor-interval-10)
>> Clone: ping-clone
>> Resource: ping (class=ocf provider=pacemaker type=ping)
>> Attributes: dampen=30s multiplier=1000 host_list=64.21.76.1
>> Operations: start interval=0s timeout=60 (ping-start-timeout-60)
>> stop interval=0s timeout=20 (ping-stop-timeout-20)
>> monitor interval=10 timeout=60 (ping-monitor-interval-10)
>> Master: GreenArrowDataClone
>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
>> Resource: GreenArrowData (class=ocf provider=linbit type=drbd)
>> Attributes: drbd_resource=r0
>> Operations: start interval=0s timeout=240 (GreenArrowData-start-timeout-240)
>> promote interval=0s timeout=90 (GreenArrowData-promote-timeout-90)
>> demote interval=0s timeout=90 (GreenArrowData-demote-timeout-90)
>> stop interval=0s timeout=100 (GreenArrowData-stop-timeout-100)
>> monitor interval=60s (GreenArrowData-monitor-interval-60s)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Resource: primary
>> Enabled on: node1 (score:INFINITY) (id:location-primary-node1-INFINITY)
>> Constraint: location-primary
>> Rule: score=-INFINITY boolean-op=or (id:location-primary-rule)
>> Expression: pingd lt 1 (id:location-primary-rule-expr)
>> Expression: not_defined pingd (id:location-primary-rule-expr-1)
>> Ordering Constraints:
>> promote GreenArrowDataClone then start GreenArrowFS (kind:Mandatory) (id:order-GreenArrowDataClone-GreenArrowFS-mandatory)
>> stop GreenArrowFS then demote GreenArrowDataClone (kind:Mandatory) (id:order-GreenArrowFS-GreenArrowDataClone-mandatory)
>> Colocation Constraints:
>> GreenArrowFS with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-GreenArrowFS-GreenArrowDataClone-INFINITY)
>> virtual_ip_secondary with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Slave) (id:colocation-virtual_ip_secondary-GreenArrowDataClone-INFINITY)
>> virtual_ip_primary with GreenArrowDataClone (score:INFINITY) (with-rsc-role:Master) (id:colocation-virtual_ip_primary-GreenArrowDataClone-INFINITY)
>>
>> Cluster Properties:
>> cluster-infrastructure: corosync
>> cluster-name: cluster_greenarrow
>> dc-version: 1.1.12-a14efad
>> have-watchdog: false
>> no-quorum-policy: stop
>> stonith-enabled: false
>> Node Attributes:
>> node3: standby=on
>>
>> Here's what was logged:
>>
>> May 28 10:19:51 node1 pengine[1296]: notice: stage6: Scheduling Node node3 for shutdown
>> May 28 10:19:51 node1 pengine[1296]: notice: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-992.bz2
>> May 28 10:19:51 node1 crmd[1297]: notice: run_graph: Transition 7 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-992.bz2): Complete
>> May 28 10:19:51 node1 crmd[1297]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: do_shutdown of node3 (op 64) is complete
>> May 28 10:19:51 node1 attrd[1295]: notice: crm_update_peer_state: attrd_peer_change_cb: Node node3[3] - state is now lost (was member)
>> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_remove: Removing all node3 attributes for attrd_peer_change_cb
>> May 28 10:19:51 node1 attrd[1295]: notice: attrd_peer_change_cb: Lost attribute writer node3
>> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] Membership left list contains incorrect address. This is sign of misconfiguration between nodes!
>> May 28 10:19:51 node1 corosync[1040]: [TOTEM ] A new membership (64.21.76.61:25740) was formed. Members left: 3
>> May 28 10:19:51 node1 corosync[1040]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
>> May 28 10:19:51 node1 corosync[1040]: [QUORUM] Members[1]: 1
>> May 28 10:19:51 node1 corosync[1040]: [MAIN ] Completed service synchronization, ready to provide service.
>> May 28 10:19:51 node1 crmd[1297]: notice: pcmk_quorum_notification: Membership 25740: quorum lost (1)
>> May 28 10:19:51 node1 crmd[1297]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
>> May 28 10:19:51 node1 crmd[1297]: notice: peer_update_callback: do_shutdown of node3 (op 64) is complete
>> May 28 10:19:51 node1 pacemakerd[1254]: notice: pcmk_quorum_notification: Membership 25740: quorum lost (1)
>> May 28 10:19:51 node1 pacemakerd[1254]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node3[3] - state is now lost (was member)
>> May 28 10:19:52 node1 corosync[1040]: [TOTEM ] Automatically recovered ring 1
>>
>> Here's corosync.conf:
>>
>> totem {
>> version: 2
>> secauth: off
>> cluster_name: cluster_greenarrow
>> rrp_mode: passive
>> transport: udpu
>> }
>>
>> nodelist {
>> node {
>> ring0_addr: node1
>> ring1_addr: 10.10.10.2
>> nodeid: 1
>> }
>> node {
>> ring0_addr: node2
>> ring1_addr: 10.10.10.3
>> nodeid: 2
>> }
>> node {
>> ring0_addr: node3
>> nodeid: 3
>> }
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> two_node: 0
>> }
>>
>> logging {
>> to_syslog: yes
>> }
>>
>> Thanks,
>>
>> Matt
>>
>>
>> _______________________________________________
>> Users mailing list:
>> Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>>
>> Project Home:
>> http://www.clusterlabs.org
>>
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>
>> Bugs:
>> http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list