[ClusterLabs] Pacemaker quorum behavior

Wed Sep 28 20:57:16 UTC 2016

A quick addendum...

After sending this post, I decided to stop pacemaker on the single, Online
node in the cluster,
and this effectively killed the corosync daemon:

[root at zs93kl VD]# date;pcs cluster stop
Wed Sep 28 16:39:22 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root at zs93kl VD]# date;ps -ef |grep coro|grep -v grep
Wed Sep 28 16:46:19 EDT 2016
[root at zs93kl VD]#

Next, I went to a node in "Pending" state, and sure enough... the pcs
cluster stop killed the daemon there, too:

[root at zs95kj VD]# date;pcs cluster stop
Wed Sep 28 16:48:15 EDT 2016
Stopping Cluster (pacemaker)... Stopping Cluster (corosync)...

[root at zs95kj VD]# date;ps -ef |grep coro |grep -v grep
Wed Sep 28 16:48:38 EDT 2016
[root at zs95kj VD]#

So, this answers my own question...  cluster stop should kill corosync.
So, why isn't the `pcs cluster stop --all` failing to
kill corosync?

Thanks...

Scott Greenlese ... IBM KVM on System Z Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com

From:	Scott Greenlese/Poughkeepsie/IBM
To:	kgaillot at redhat.com, Cluster Labs - All topics related to
            open-source clustering welcomed <users at clusterlabs.org>
Date:	09/28/2016 04:30 PM
Subject:	Re: [ClusterLabs] Pacemaker quorum behavior

Hi folks..

I have some follow-up questions about corosync daemon status after cluster
shutdown.

Basically, what should happen to corosync on a cluster node when pacemaker
is shutdown on that node?
On my 5 node cluster, when I do a global shutdown, the pacemaker processes
exit, but corosync processes remain active.

Here's an example of where this led me into some trouble...

My cluster is still configured to use the "symmetric" resource
distribution.   I don't have any location constraints in place, so
pacemaker tries to evenly distribute resources across all Online nodes.

With one cluster node (KVM host) powered off, I did the global cluster
stop:

[root at zs90KP VD]# date;pcs cluster stop --all
Wed Sep 28 15:07:40 EDT 2016
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)
zs90kppcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs93kjpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)

Note:  The "No route to host" messages are expected because that node /
LPAR is powered down.

(I don't show it here, but the corosync daemon is still running on the 4
active nodes. I do show it later).

I then powered on the one zs93KLpcs1 LPAR,  so in theory I should not have
quorum when it comes up and activates
pacemaker, which is enabled to autostart at boot time on all 5 cluster
nodes.  At this point, only 1 out of 5
nodes should be Online to the cluster, and therefore ... no quorum.

I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending'
Online, and "partition with quorum":

[root at zs93kl ~]# date;pcs status |less
Wed Sep 28 15:25:13 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep 28 15:25:13 2016          Last change: Mon Sep 26
16:15:08 2016 by root via crm_resource on zs95kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Node zs90kppcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Node zs95kjpcs1: pending
Online: [ zs93KLpcs1 ]

Full list of resources:

 zs95kjg109062_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
.
.
.

Here you can see that corosync is up on all 5 nodes:

[root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep
-v grep"; done
Wed Sep 28 15:22:21 EDT 2016
zs90KP
root     155374      1  0 Sep26 ?        00:10:17 corosync
zs95KL
root      22933      1  0 11:51 ?        00:00:54 corosync
zs95kj
root      19382      1  0 Sep26 ?        00:10:15 corosync
zs93kj
root     129102      1  0 Sep26 ?        00:12:10 corosync
zs93kl
root      21894      1  0 15:19 ?        00:00:00 corosync

But, pacemaker is only running on the one, online node:

[root at zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd |
grep -v grep"; done
Wed Sep 28 15:23:29 EDT 2016
zs90KP
zs95KL
zs95kj
zs93kj
zs93kl
root      23005      1  0 15:19 ?        00:00:00 /usr/sbin/pacemakerd -f
You have new mail in /var/spool/mail/root
[root at zs95kj VD]#

This situation wreaks havoc on my VirtualDomain resources, as the majority
of them are in FAILED or Stopped state, and to my
surprise... many of them show as Started:

[root at zs93kl VD]# date;pcs resource show |grep zs93KL
Wed Sep 28 15:55:29 EDT 2016
 zs95kjg109062_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109064_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109065_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109066_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109068_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109069_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109070_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109071_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109072_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109073_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109074_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109075_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109076_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109077_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109078_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109079_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109080_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109081_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109082_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109083_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109084_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109085_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109086_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109087_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109088_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109089_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109090_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109092_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109095_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109096_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109097_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109101_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109102_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109104_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110063_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110065_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110066_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110067_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110068_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110069_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110070_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110071_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110072_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110073_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110074_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110075_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110076_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110079_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110080_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110081_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110082_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110084_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110086_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110087_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110088_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110089_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110103_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110104_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110093_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110094_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110095_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110097_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110099_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110100_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110101_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110102_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110098_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110105_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110106_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110107_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110108_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110109_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110110_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110111_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110112_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110113_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110114_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110115_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110116_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110117_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110118_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110119_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110120_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110121_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110122_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110123_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110124_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110125_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110126_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110128_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110129_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110130_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110131_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110132_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110133_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110134_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110135_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110137_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110138_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110139_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110140_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110141_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110142_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110143_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110144_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110145_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110146_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110148_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110149_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110150_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110152_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110154_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110155_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110156_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110159_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110160_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110161_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110164_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg110165_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110166_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1

Pacemaker is attempting to activate all VirtualDomain resources on the one
cluster node.

So back to my original question... what should happen when I do a cluster
stop?
If it should be deactivating, what would prevent this?

Also,  I have tried simulating a failed cluster node (to trigger a STONITH
action) by killing the
corosync daemon on one node, but all that does is respawn the daemon ...
causing a temporary / transient
failure condition, and no fence takes place.   Is there a way to kill
corosync in such a way
that it stays down?   Is there a best practice for STONITH testing?

As usual, thanks in advance for your advice.

Scott Greenlese ... IBM KVM on System Z -  Solutions Test,  Poughkeepsie,
N.Y.
  INTERNET:  swgreenl at us.ibm.com

From:	Ken Gaillot <kgaillot at redhat.com>
To:	users at clusterlabs.org
Date:	09/09/2016 06:23 PM
Subject:	Re: [ClusterLabs] Pacemaker quorum behavior

On 09/09/2016 04:27 AM, Klaus Wenninger wrote:
> On 09/08/2016 07:31 PM, Scott Greenlese wrote:
>>
>> Hi Klaus, thanks for your prompt and thoughtful feedback...
>>
>> Please see my answers nested below (sections entitled, "Scott's
>> Reply"). Thanks!
>>
>> - Scott
>>
>>
>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>> INTERNET: swgreenl at us.ibm.com
>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>
>>
>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27
>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger
>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese
>> wrote: >
>>
>> From: Klaus Wenninger <kwenning at redhat.com>
>> To: users at clusterlabs.org
>> Date: 09/08/2016 10:59 AM
>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:
>> >
>> > Hi all...
>> >
>> > I have a few very basic questions for the group.
>> >
>> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100
>> > VirtualDomain pacemaker-remote nodes
>> > plus 100 "opaque" VirtualDomain resources. The cluster is configured
>> > to be 'symmetric' and I have no
>> > location constraints on the 200 VirtualDomain resources (other than to
>> > prevent the opaque guests
>> > from running on the pacemaker remote node resources). My quorum is set
>> > as:
>> >
>> > quorum {
>> > provider: corosync_votequorum
>> > }
>> >
>> > As an experiment, I powered down one LPAR in the cluster, leaving 4
>> > powered up with the pcsd service up on the 4 survivors
>> > but corosync/pacemaker down (pcs cluster stop --all) on the 4
>> > survivors. I then started pacemaker/corosync on a single cluster
>> >
>>
>> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster
but
>> did you check the status of the individual services?
>>
>> Scott's reply:
>>
>> No, I only assumed that pacemaker was down because I got this back on
>> my pcs status
>> command from each cluster node:
>>
>> [root at zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
>> zs93kjpcs1 ; do ssh $host pcs status; done
>> Wed Sep 7 15:49:27 EDT 2016
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node
>> Error: cluster is not currently running on this node

In my experience, this is sufficient to say that pacemaker and corosync
aren't running.

>>
>> What else should I check?  The pcsd.service service was still up,
>> since I didn't not stop that
>> anywhere. Should I have done,  ps -ef |grep -e pacemaker -e corosync
>>  to check the state before
>> assuming it was really down?
>>
>>
> Guess the answer from Poki should guide you well here ...
>>
>>
>> > node (pcs cluster start), and this resulted in the 200 VirtualDomain
>> > resources activating on the single node.
>> > This was not what I was expecting. I assumed that no resources would
>> > activate / start on any cluster nodes
>> > until 3 out of the 5 total cluster nodes had pacemaker/corosync
running.

Your expectation is correct; I'm not sure what happened in this case.
There are some obscure corosync options (e.g. last_man_standing,
allow_downscale) that could theoretically lead to this, but I don't get
the impression you're using anything unusual.

>> > After starting pacemaker/corosync on the single host (zs95kjpcs1),
>> > this is what I see :
>> >
>> > [root at zs95kj VD]# date;pcs status |less
>> > Wed Sep 7 15:51:17 EDT 2016
>> > Cluster name: test_cluster_2
>> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12
>> > 2016 by hacluster via crmd on zs93kjpcs1
>> > Stack: corosync
>> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>> > partition with quorum
>> > 106 nodes and 304 resources configured
>> >
>> > Node zs93KLpcs1: pending
>> > Node zs93kjpcs1: pending
>> > Node zs95KLpcs1: pending
>> > Online: [ zs95kjpcs1 ]
>> > OFFLINE: [ zs90kppcs1 ]
>> >
>> > .
>> > .
>> > .
>> > PCSD Status:
>> > zs93kjpcs1: Online
>> > zs95kjpcs1: Online
>> > zs95KLpcs1: Online
>> > zs90kppcs1: Offline
>> > zs93KLpcs1: Online

FYI the Online/Offline above refers only to pcsd, which doesn't have any
effect on the cluster itself -- just the ability to run pcs commands.

>> > So, what exactly constitutes an "Online" vs. "Offline" cluster node
>> > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3
>> > nodes,
>> > so where does that fall? Any why "pending"? What does that mean?

"pending" means that the node has joined the corosync cluster (which
allows it to contribute to quorum), but it has not yet completed the
pacemaker join process (basically a handshake with the DC).

I think the corosync and pacemaker detail logs would be essential to
figuring out what's going on. Check the logs on the "pending" nodes to
see whether corosync somehow started up by this point, and check the
logs on this node to see what the most recent references to the pending
nodes were.

>> > Also, what exactly is the cluster's expected reaction to quorum loss?
>> > Cluster resources will be stopped or something else?
>> >
>> Depends on how you configure it using cluster property no-quorum-policy
>> (default: stop).
>>
>> Scott's reply:
>>
>> This is how the policy is configured:
>>
>> [root at zs95kj VD]# date;pcs config |grep quorum
>> Thu Sep  8 13:18:33 EDT 2016
>>  no-quorum-policy: stop
>>
>> What should I expect with the 'stop' setting?
>>
>>
>> >
>> >
>> > Where can I find this documentation?
>> >
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/
>>
>> Scott's reply:
>>
>> OK, I'll keep looking thru this doc, but I don't easily find the
>> no-quorum-policy explained.
>>
> Well, the index leads you to:
>
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html

> where you find an exhaustive description of the option.
>
> In short:
> you are running the default and that leads to all resources being
> stopped in a partition without quorum
>
>> Thanks..
>>
>>
>> >
>> >
>> > Thanks!
>> >
>> > Scott Greenlese - IBM Solution Test Team.
>> >
>> >
>> >
>> > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
>> > INTERNET: swgreenl at us.ibm.com
>> > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160928/d274bbb5/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160928/d274bbb5/attachment-0002.gif>