[ClusterLabs] Pacemaker quorum behavior

Thu Sep 29 08:35:25 EDT 2016

Dne 29.9.2016 v 00:14 Ken Gaillot napsal(a):
> On 09/28/2016 03:57 PM, Scott Greenlese wrote:
>> A quick addendum...
>>
>> After sending this post, I decided to stop pacemaker on the single,
>> Online node in the cluster,
>> and this effectively killed the corosync daemon:
>>
>> [root at zs93kl VD]# date;pcs cluster stop
>> Wed Sep 28 16:39:22 EDT 2016
>> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
>
> Correct, "pcs cluster stop" tries to stop both pacemaker and corosync.
>
>> [root at zs93kl VD]# date;ps -ef |grep coro|grep -v grep
>> Wed Sep 28 16:46:19 EDT 2016
>
> Totally irrelevant, but a little trick I picked up somewhere: when
> grepping for a process, square-bracketing a character lets you avoid the
> "grep -v", e.g. "ps -ef | grep cor[o]"
>
> It's nice when I remember to use it ;)
>
>> [root at zs93kl VD]#
>>
>>
>>
>> Next, I went to a node in "Pending" state, and sure enough... the pcs
>> cluster stop killed the daemon there, too:
>>
>> [root at zs95kj VD]# date;pcs cluster stop
>> Wed Sep 28 16:48:15 EDT 2016
>> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
>>
>> [root at zs95kj VD]# date;ps -ef |grep coro |grep -v grep
>> Wed Sep 28 16:48:38 EDT 2016
>> [root at zs95kj VD]#
>>
>> So, this answers my own question... cluster stop should kill corosync.
>> So, why isn't the `pcs cluster stop --all` failing to
>> kill corosync?
>
> It should. At least you've narrowed it down :)

This is a bug in pcs. Thanks for spotting it and providing detailed 
description. I filed the bug here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1380372

Regards,
Tomas

>
>> Thanks...
>>
>>
>> Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y.
>> INTERNET: swgreenl at us.ibm.com
>>
>>
>>
>> Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi
>> folks.. I have some follow-up questions about corosync Scott
>> Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up
>> questions about corosync daemon status after cluster shutdown.
>>
>> From: Scott Greenlese/Poughkeepsie/IBM
>> To: kgaillot at redhat.com, Cluster Labs - All topics related to
>> open-source clustering welcomed <users at clusterlabs.org>
>> Date: 09/28/2016 04:30 PM
>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>
>> ------------------------------------------------------------------------
>>
>>
>> Hi folks..
>>
>> I have some follow-up questions about corosync daemon status after
>> cluster shutdown.
>>
>> Basically, what should happen to corosync on a cluster node when
>> pacemaker is shutdown on that node?
>> On my 5 node cluster, when I do a global shutdown, the pacemaker
>> processes exit, but corosync processes remain active.
>>
>> Here's an example of where this led me into some trouble...
>>
>> My cluster is still configured to use the "symmetric" resource
>> distribution. I don't have any location constraints in place, so
>> pacemaker tries to evenly distribute resources across all Online nodes.
>>
>> With one cluster node (KVM host) powered off, I did the global cluster
>> stop:
>>
>> [root at zs90KP VD]# date;pcs cluster stop --all
>> Wed Sep 28 15:07:40 EDT 2016
>> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
>> zs90kppcs1: Stopping Cluster (pacemaker)...
>> zs95KLpcs1: Stopping Cluster (pacemaker)...
>> zs95kjpcs1: Stopping Cluster (pacemaker)...
>> zs93kjpcs1: Stopping Cluster (pacemaker)...
>> Error: unable to stop all nodes
>> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
>>
>> Note: The "No route to host" messages are expected because that node /
>> LPAR is powered down.
>>
>> (I don't show it here, but the corosync daemon is still running on the 4
>> active nodes. I do show it later).
>>
>> I then powered on the one zs93KLpcs1 LPAR, so in theory I should not
>> have quorum when it comes up and activates
>> pacemaker, which is enabled to autostart at boot time on all 5 cluster
>> nodes. At this point, only 1 out of 5
>> nodes should be Online to the cluster, and therefore ... no quorum.
>>
>> I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
>> Online, and "partition with quorum":
>
> Corosync determines quorum, pacemaker just uses it. If corosync is
> running, the node contributes to quorum.
>
>> [root at zs93kl ~]# date;pcs status |less
>> Wed Sep 28 15:25:13 EDT 2016
>> Cluster name: test_cluster_2
>> Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08
>> 2016 by root via crm_resource on zs95kjpcs1
>> Stack: corosync
>> Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>> partition with quorum
>> 106 nodes and 304 resources configured
>>
>> Node zs90kppcs1: pending
>> Node zs93kjpcs1: pending
>> Node zs95KLpcs1: pending
>> Node zs95kjpcs1: pending
>> Online: [ zs93KLpcs1 ]
>>
>> Full list of resources:
>>
>> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> .
>> .
>> .
>>
>>
>> Here you can see that corosync is up on all 5 nodes:
>>
>> [root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
>> zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync
>> |grep -v grep"; done
>> Wed Sep 28 15:22:21 EDT 2016
>> zs90KP
>> root 155374 1 0 Sep26 ? 00:10:17 corosync
>> zs95KL
>> root 22933 1 0 11:51 ? 00:00:54 corosync
>> zs95kj
>> root 19382 1 0 Sep26 ? 00:10:15 corosync
>> zs93kj
>> root 129102 1 0 Sep26 ? 00:12:10 corosync
>> zs93kl
>> root 21894 1 0 15:19 ? 00:00:00 corosync
>>
>>
>> But, pacemaker is only running on the one, online node:
>>
>> [root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
>> zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd
>> |grep -v grep"; done
>> Wed Sep 28 15:23:29 EDT 2016
>> zs90KP
>> zs95KL
>> zs95kj
>> zs93kj
>> zs93kl
>> root 23005 1 0 15:19 ? 00:00:00 /usr/sbin/pacemakerd -f
>> You have new mail in /var/spool/mail/root
>> [root at zs95kj VD]#
>>
>>
>> This situation wreaks havoc on my VirtualDomain resources, as the
>> majority of them are in FAILED or Stopped state, and to my
>> surprise... many of them show as Started:
>>
>> [root at zs93kl VD]# date;pcs resource show |grep zs93KL
>> Wed Sep 28 15:55:29 EDT 2016
>> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109064_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109077_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109078_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109083_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109085_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109090_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109092_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109095_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109096_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109097_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109102_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg109104_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110067_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110103_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110093_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110094_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110095_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110097_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110100_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110098_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110105_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110107_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110108_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110109_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110110_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110111_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110113_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110114_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110116_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110117_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110119_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110120_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110121_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110122_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110123_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110125_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110126_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110128_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110129_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110130_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110131_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110132_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110133_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110134_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110135_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110137_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110138_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110140_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110141_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110143_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110144_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110145_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110146_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110149_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110150_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110154_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110155_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110156_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110159_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110160_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110164_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
>> zs95kjg110165_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>> zs95kjg110166_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>
>>
>> Pacemaker is attempting to activate all VirtualDomain resources on the
>> one cluster node.
>>
>> So back to my original question... what should happen when I do a
>> cluster stop?
>> If it should be deactivating, what would prevent this?
>>
>> Also, I have tried simulating a failed cluster node (to trigger a
>> STONITH action) by killing the
>> corosync daemon on one node, but all that does is respawn the daemon ...
>
> Probably systemd
>
>> causing a temporary / transient
>> failure condition, and no fence takes place. Is there a way to kill
>> corosync in such a way
>> that it stays down? Is there a best practice for STONITH testing?
>
> It would be nice to collect something like that for the clusterlabs
> wiki. Best practice is to think up as many realistic failure scenarios
> as possible and try them all.
>
> Simplest are pulling the power, and pulling the network cables. Other
> common tests are unloading the network driver kernel module, and using
> the local firewall to block all corosync traffic (inbound *AND* more
> importantly outbound to port 5405 or whatever you're using). Tests
> specific to your applications and hardware setup are desirable, too.
>
>> As usual, thanks in advance for your advice.
>>
>> Scott Greenlese ... IBM KVM on System Z - Solutions Test, Poughkeepsie, N.Y.
>> INTERNET: swgreenl at us.ibm.com
>>
>>
>>
>>
>> Inactive hide details for Ken Gaillot ---09/09/2016 06:23:37 PM---On
>> 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/201Ken Gaillot
>> ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger
>> wrote: > On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>
>> From: Ken Gaillot <kgaillot at redhat.com>
>> To: users at clusterlabs.org
>> Date: 09/09/2016 06:23 PM
>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>> ------------------------------------------------------------------------
>>
>>
>>
>> On 09/09/2016 04:27 AM, Klaus Wenninger wrote:
>>> On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>>>
>>>> Hi Klaus, thanks for your prompt and thoughtful feedback...
>>>>
>>>> Please see my answers nested below (sections entitled, "Scott's
>>>> Reply"). Thanks!
>>>>
>>>> - Scott
>>>>
>>>>
>>>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>>>> INTERNET: swgreenl at us.ibm.com
>>>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>>>
>>>>
>>>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
>>>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
>>>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
>>>> wrote: >
>>>>
>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>> To: users at clusterlabs.org
>>>> Date: 09/08/2016 10:59 AM
>>>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>>>>>
>>>>> Hi all...
>>>>>
>>>>> I have a few very basic questions for the group.
>>>>>
>>>>> I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
>>>>> VirtualDomain pacemaker-remote nodes
>>>>> plus 100 "opaque" VirtualDomain resources. The cluster is configured
>>>>> to be 'symmetric' and I have no
>>>>> location constraints on the 200 VirtualDomain resources (other than to
>>>>> prevent the opaque guests
>>>>> from running on the pacemaker remote node resources). My quorum is set
>>>>> as:
>>>>>
>>>>> quorum {
>>>>> provider: corosync_votequorum
>>>>> }
>>>>>
>>>>> As an experiment, I powered down one LPAR in the cluster, leaving 4
>>>>> powered up with the pcsd service up on the 4 survivors
>>>>> but corosync/pacemaker down (pcs cluster stop --all) on the 4
>>>>> survivors. I then started pacemaker/corosync on a single cluster
>>>>>
>>>>
>>>> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
>>>> did you check the status of the individual services?
>>>>
>>>> Scott's reply:
>>>>
>>>> No, I only assumed that pacemaker was down because I got this back on
>>>> my pcs status
>>>> command from each cluster node:
>>>>
>>>> [root at zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
>>>> zs93kjpcs1 ; do ssh $host pcs status; done
>>>> Wed Sep 7 15:49:27 EDT 2016
>>>> Error: cluster is not currently running on this node
>>>> Error: cluster is not currently running on this node
>>>> Error: cluster is not currently running on this node
>>>> Error: cluster is not currently running on this node
>>
>> In my experience, this is sufficient to say that pacemaker and corosync
>> aren't running.
>>
>>>>
>>>> What else should I check?  The pcsd.service service was still up,
>>>> since I didn't not stop that
>>>> anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync
>>>>  to check the state before
>>>> assuming it was really down?
>>>>
>>>>
>>> Guess the answer from Poki should guide you well here ...
>>>>
>>>>
>>>>> node (pcs cluster start), and this resulted in the 200 VirtualDomain
>>>>> resources activating on the single node.
>>>>> This was not what I was expecting. I assumed that no resources would
>>>>> activate / start on any cluster nodes
>>>>> until 3 out of the 5 total cluster nodes had pacemaker/corosync
>> running.
>>
>> Your expectation is correct; I'm not sure what happened in this case.
>> There are some obscure corosync options (e.g. last_man_standing,
>> allow_downscale) that could theoretically lead to this, but I don't get
>> the impression you're using anything unusual.
>>
>>>>> After starting pacemaker/corosync on the single host (zs95kjpcs1),
>>>>> this is what I see :
>>>>>
>>>>> [root at zs95kj VD]# date;pcs status |less
>>>>> Wed Sep 7 15:51:17 EDT 2016
>>>>> Cluster name: test_cluster_2
>>>>> Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
>>>>> 2016 by hacluster via crmd on zs93kjpcs1
>>>>> Stack: corosync
>>>>> Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>>>>> partition with quorum
>>>>> 106 nodes and 304 resources configured
>>>>>
>>>>> Node zs93KLpcs1: pending
>>>>> Node zs93kjpcs1: pending
>>>>> Node zs95KLpcs1: pending
>>>>> Online: [ zs95kjpcs1 ]
>>>>> OFFLINE: [ zs90kppcs1 ]
>>>>>
>>>>> .
>>>>> .
>>>>> .
>>>>> PCSD Status:
>>>>> zs93kjpcs1: Online
>>>>> zs95kjpcs1: Online
>>>>> zs95KLpcs1: Online
>>>>> zs90kppcs1: Offline
>>>>> zs93KLpcs1: Online
>>
>> FYI the Online/Offline above refers only to pcsd, which doesn't have any
>> effect on the cluster itself -- just the ability to run pcs commands.
>>
>>>>> So, what exactly constitutes an "Online" vs. "Offline" cluster node
>>>>> w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
>>>>> nodes,
>>>>> so where does that fall? Any why "pending"? What does that mean?
>>
>> "pending" means that the node has joined the corosync cluster (which
>> allows it to contribute to quorum), but it has not yet completed the
>> pacemaker join process (basically a handshake with the DC).
>>
>> I think the corosync and pacemaker detail logs would be essential to
>> figuring out what's going on. Check the logs on the "pending" nodes to
>> see whether corosync somehow started up by this point, and check the
>> logs on this node to see what the most recent references to the pending
>> nodes were.
>>
>>>>> Also, what exactly is the cluster's expected reaction to quorum loss?
>>>>> Cluster resources will be stopped or something else?
>>>>>
>>>> Depends on how you configure it using cluster property no-quorum-policy
>>>> (default: stop).
>>>>
>>>> Scott's reply:
>>>>
>>>> This is how the policy is configured:
>>>>
>>>> [root at zs95kj VD]# date;pcs config |grep quorum
>>>> Thu Sep  8 13:18:33 EDT 2016
>>>>  no-quorum-policy: stop
>>>>
>>>> What should I expect with the 'stop' setting?
>>>>
>>>>
>>>>>
>>>>>
>>>>> Where can I find this documentation?
>>>>>
>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/
>>>>
>>>> Scott's reply:
>>>>
>>>> OK, I'll keep looking thru this doc, but I don't easily find the
>>>> no-quorum-policy explained.
>>>>
>>> Well, the index leads you to:
>>>
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html
>>> where you find an exhaustive description of the option.
>>>
>>> In short:
>>> you are running the default and that leads to all resources being
>>> stopped in a partition without quorum
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>