[ClusterLabs] Pacemaker quorum behavior

Thu Sep 29 00:14:13 CEST 2016

On 09/28/2016 03:57 PM, Scott Greenlese wrote:
> A quick addendum...
> 
> After sending this post, I decided to stop pacemaker on the single,
> Online node in the cluster,
> and this effectively killed the corosync daemon:
> 
> [root at zs93kl VD]# date;pcs cluster stop
> Wed Sep 28 16:39:22 EDT 2016
> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

Correct, "pcs cluster stop" tries to stop both pacemaker and corosync.

> [root at zs93kl VD]# date;ps -ef |grep coro|grep -v grep
> Wed Sep 28 16:46:19 EDT 2016

Totally irrelevant, but a little trick I picked up somewhere: when
grepping for a process, square-bracketing a character lets you avoid the
"grep -v", e.g. "ps -ef | grep cor[o]"

It's nice when I remember to use it ;)

> [root at zs93kl VD]#
> 
> 
> 
> Next, I went to a node in "Pending" state, and sure enough... the pcs
> cluster stop killed the daemon there, too:
> 
> [root at zs95kj VD]# date;pcs cluster stop
> Wed Sep 28 16:48:15 EDT 2016
> Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...
> 
> [root at zs95kj VD]# date;ps -ef |grep coro |grep -v grep
> Wed Sep 28 16:48:38 EDT 2016
> [root at zs95kj VD]#
> 
> So, this answers my own question... cluster stop should kill corosync.
> So, why isn't the `pcs cluster stop --all` failing to
> kill corosync?

It should. At least you've narrowed it down :)

> Thanks...
> 
> 
> Scott Greenlese ... IBM KVM on System Z Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
> 
> 
> 
> Inactive hide details for Scott Greenlese---09/28/2016 04:30:06 PM---Hi
> folks.. I have some follow-up questions about corosync Scott
> Greenlese---09/28/2016 04:30:06 PM---Hi folks.. I have some follow-up
> questions about corosync daemon status after cluster shutdown.
> 
> From: Scott Greenlese/Poughkeepsie/IBM
> To: kgaillot at redhat.com, Cluster Labs - All topics related to
> open-source clustering welcomed <users at clusterlabs.org>
> Date: 09/28/2016 04:30 PM
> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
> 
> ------------------------------------------------------------------------
> 
> 
> Hi folks..
> 
> I have some follow-up questions about corosync daemon status after
> cluster shutdown.
> 
> Basically, what should happen to corosync on a cluster node when
> pacemaker is shutdown on that node?
> On my 5 node cluster, when I do a global shutdown, the pacemaker
> processes exit, but corosync processes remain active.
> 
> Here's an example of where this led me into some trouble...
> 
> My cluster is still configured to use the "symmetric" resource
> distribution. I don't have any location constraints in place, so
> pacemaker tries to evenly distribute resources across all Online nodes.
> 
> With one cluster node (KVM host) powered off, I did the global cluster
> stop:
> 
> [root at zs90KP VD]# date;pcs cluster stop --all
> Wed Sep 28 15:07:40 EDT 2016
> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
> zs90kppcs1: Stopping Cluster (pacemaker)...
> zs95KLpcs1: Stopping Cluster (pacemaker)...
> zs95kjpcs1: Stopping Cluster (pacemaker)...
> zs93kjpcs1: Stopping Cluster (pacemaker)...
> Error: unable to stop all nodes
> zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
> 
> Note: The "No route to host" messages are expected because that node /
> LPAR is powered down.
> 
> (I don't show it here, but the corosync daemon is still running on the 4
> active nodes. I do show it later).
> 
> I then powered on the one zs93KLpcs1 LPAR, so in theory I should not
> have quorum when it comes up and activates
> pacemaker, which is enabled to autostart at boot time on all 5 cluster
> nodes. At this point, only 1 out of 5
> nodes should be Online to the cluster, and therefore ... no quorum.
> 
> I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
> Online, and "partition with quorum":

Corosync determines quorum, pacemaker just uses it. If corosync is
running, the node contributes to quorum.

> [root at zs93kl ~]# date;pcs status |less
> Wed Sep 28 15:25:13 EDT 2016
> Cluster name: test_cluster_2
> Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08
> 2016 by root via crm_resource on zs95kjpcs1
> Stack: corosync
> Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
> partition with quorum
> 106 nodes and 304 resources configured
> 
> Node zs90kppcs1: pending
> Node zs93kjpcs1: pending
> Node zs95KLpcs1: pending
> Node zs95kjpcs1: pending
> Online: [ zs93KLpcs1 ]
> 
> Full list of resources:
> 
> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> .
> .
> .
> 
> 
> Here you can see that corosync is up on all 5 nodes:
> 
> [root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
> zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync
> |grep -v grep"; done
> Wed Sep 28 15:22:21 EDT 2016
> zs90KP
> root 155374 1 0 Sep26 ? 00:10:17 corosync
> zs95KL
> root 22933 1 0 11:51 ? 00:00:54 corosync
> zs95kj
> root 19382 1 0 Sep26 ? 00:10:15 corosync
> zs93kj
> root 129102 1 0 Sep26 ? 00:12:10 corosync
> zs93kl
> root 21894 1 0 15:19 ? 00:00:00 corosync
> 
> 
> But, pacemaker is only running on the one, online node:
> 
> [root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
> zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd
> |grep -v grep"; done
> Wed Sep 28 15:23:29 EDT 2016
> zs90KP
> zs95KL
> zs95kj
> zs93kj
> zs93kl
> root 23005 1 0 15:19 ? 00:00:00 /usr/sbin/pacemakerd -f
> You have new mail in /var/spool/mail/root
> [root at zs95kj VD]#
> 
> 
> This situation wreaks havoc on my VirtualDomain resources, as the
> majority of them are in FAILED or Stopped state, and to my
> surprise... many of them show as Started:
> 
> [root at zs93kl VD]# date;pcs resource show |grep zs93KL
> Wed Sep 28 15:55:29 EDT 2016
> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109064_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109077_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109078_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109083_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109085_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109090_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109092_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109095_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109096_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109097_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109102_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg109104_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110067_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110103_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110093_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110094_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110095_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110097_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110100_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110098_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110105_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110107_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110108_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110109_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110110_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110111_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110113_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110114_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110116_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110117_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110119_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110120_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110121_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110122_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110123_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110125_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110126_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110128_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110129_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110130_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110131_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110132_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110133_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110134_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110135_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110137_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110138_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110140_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110141_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110143_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110144_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110145_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110146_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110149_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110150_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110154_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110155_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110156_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110159_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110160_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110164_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
> zs95kjg110165_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> zs95kjg110166_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
> 
> 
> Pacemaker is attempting to activate all VirtualDomain resources on the
> one cluster node.
> 
> So back to my original question... what should happen when I do a
> cluster stop?
> If it should be deactivating, what would prevent this?
> 
> Also, I have tried simulating a failed cluster node (to trigger a
> STONITH action) by killing the
> corosync daemon on one node, but all that does is respawn the daemon ...

Probably systemd

> causing a temporary / transient
> failure condition, and no fence takes place. Is there a way to kill
> corosync in such a way
> that it stays down? Is there a best practice for STONITH testing?

It would be nice to collect something like that for the clusterlabs
wiki. Best practice is to think up as many realistic failure scenarios
as possible and try them all.

Simplest are pulling the power, and pulling the network cables. Other
common tests are unloading the network driver kernel module, and using
the local firewall to block all corosync traffic (inbound *AND* more
importantly outbound to port 5405 or whatever you're using). Tests
specific to your applications and hardware setup are desirable, too.

> As usual, thanks in advance for your advice.
> 
> Scott Greenlese ... IBM KVM on System Z - Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
> 
> 
> 
> 
> Inactive hide details for Ken Gaillot ---09/09/2016 06:23:37 PM---On
> 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/201Ken Gaillot
> ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger
> wrote: > On 09/08/2016 07:31 PM, Scott Greenlese wrote:
> 
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Date: 09/09/2016 06:23 PM
> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
> ------------------------------------------------------------------------
> 
> 
> 
> On 09/09/2016 04:27 AM, Klaus Wenninger wrote:
>> On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>>
>>> Hi Klaus, thanks for your prompt and thoughtful feedback...
>>>
>>> Please see my answers nested below (sections entitled, "Scott's
>>> Reply"). Thanks!
>>>
>>> - Scott
>>>
>>>
>>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>>> INTERNET: swgreenl at us.ibm.com
>>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>>
>>>
>>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
>>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
>>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
>>> wrote: >
>>>
>>> From: Klaus Wenninger <kwenning at redhat.com>
>>> To: users at clusterlabs.org
>>> Date: 09/08/2016 10:59 AM
>>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>>> >
>>> > Hi all...
>>> >
>>> > I have a few very basic questions for the group.
>>> >
>>> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
>>> > VirtualDomain pacemaker-remote nodes
>>> > plus 100 "opaque" VirtualDomain resources. The cluster is configured
>>> > to be 'symmetric' and I have no
>>> > location constraints on the 200 VirtualDomain resources (other than to
>>> > prevent the opaque guests
>>> > from running on the pacemaker remote node resources). My quorum is set
>>> > as:
>>> >
>>> > quorum {
>>> > provider: corosync_votequorum
>>> > }
>>> >
>>> > As an experiment, I powered down one LPAR in the cluster, leaving 4
>>> > powered up with the pcsd service up on the 4 survivors
>>> > but corosync/pacemaker down (pcs cluster stop --all) on the 4
>>> > survivors. I then started pacemaker/corosync on a single cluster
>>> >
>>>
>>> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but
>>> did you check the status of the individual services?
>>>
>>> Scott's reply:
>>>
>>> No, I only assumed that pacemaker was down because I got this back on
>>> my pcs status
>>> command from each cluster node:
>>>
>>> [root at zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
>>> zs93kjpcs1 ; do ssh $host pcs status; done
>>> Wed Sep 7 15:49:27 EDT 2016
>>> Error: cluster is not currently running on this node
>>> Error: cluster is not currently running on this node
>>> Error: cluster is not currently running on this node
>>> Error: cluster is not currently running on this node
> 
> In my experience, this is sufficient to say that pacemaker and corosync
> aren't running.
> 
>>>
>>> What else should I check?  The pcsd.service service was still up,
>>> since I didn't not stop that
>>> anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync
>>>  to check the state before
>>> assuming it was really down?
>>>
>>>
>> Guess the answer from Poki should guide you well here ...
>>>
>>>
>>> > node (pcs cluster start), and this resulted in the 200 VirtualDomain
>>> > resources activating on the single node.
>>> > This was not what I was expecting. I assumed that no resources would
>>> > activate / start on any cluster nodes
>>> > until 3 out of the 5 total cluster nodes had pacemaker/corosync
> running.
> 
> Your expectation is correct; I'm not sure what happened in this case.
> There are some obscure corosync options (e.g. last_man_standing,
> allow_downscale) that could theoretically lead to this, but I don't get
> the impression you're using anything unusual.
> 
>>> > After starting pacemaker/corosync on the single host (zs95kjpcs1),
>>> > this is what I see :
>>> >
>>> > [root at zs95kj VD]# date;pcs status |less
>>> > Wed Sep 7 15:51:17 EDT 2016
>>> > Cluster name: test_cluster_2
>>> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
>>> > 2016 by hacluster via crmd on zs93kjpcs1
>>> > Stack: corosync
>>> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>>> > partition with quorum
>>> > 106 nodes and 304 resources configured
>>> >
>>> > Node zs93KLpcs1: pending
>>> > Node zs93kjpcs1: pending
>>> > Node zs95KLpcs1: pending
>>> > Online: [ zs95kjpcs1 ]
>>> > OFFLINE: [ zs90kppcs1 ]
>>> >
>>> > .
>>> > .
>>> > .
>>> > PCSD Status:
>>> > zs93kjpcs1: Online
>>> > zs95kjpcs1: Online
>>> > zs95KLpcs1: Online
>>> > zs90kppcs1: Offline
>>> > zs93KLpcs1: Online
> 
> FYI the Online/Offline above refers only to pcsd, which doesn't have any
> effect on the cluster itself -- just the ability to run pcs commands.
> 
>>> > So, what exactly constitutes an "Online" vs. "Offline" cluster node
>>> > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
>>> > nodes,
>>> > so where does that fall? Any why "pending"? What does that mean?
> 
> "pending" means that the node has joined the corosync cluster (which
> allows it to contribute to quorum), but it has not yet completed the
> pacemaker join process (basically a handshake with the DC).
> 
> I think the corosync and pacemaker detail logs would be essential to
> figuring out what's going on. Check the logs on the "pending" nodes to
> see whether corosync somehow started up by this point, and check the
> logs on this node to see what the most recent references to the pending
> nodes were.
> 
>>> > Also, what exactly is the cluster's expected reaction to quorum loss?
>>> > Cluster resources will be stopped or something else?
>>> >
>>> Depends on how you configure it using cluster property no-quorum-policy
>>> (default: stop).
>>>
>>> Scott's reply:
>>>
>>> This is how the policy is configured:
>>>
>>> [root at zs95kj VD]# date;pcs config |grep quorum
>>> Thu Sep  8 13:18:33 EDT 2016
>>>  no-quorum-policy: stop
>>>
>>> What should I expect with the 'stop' setting?
>>>
>>>
>>> >
>>> >
>>> > Where can I find this documentation?
>>> >
>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/
>>>
>>> Scott's reply:
>>>
>>> OK, I'll keep looking thru this doc, but I don't easily find the
>>> no-quorum-policy explained.
>>>
>> Well, the index leads you to:
>>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html
>> where you find an exhaustive description of the option.
>>
>> In short:
>> you are running the default and that leads to all resources being
>> stopped in a partition without quorum