<html><body><p>Hi folks..<br><br>I have some follow-up questions about corosync daemon status after cluster shutdown. <br><br>Basically, what should happen to corosync on a cluster node when pacemaker is shutdown on that node? <br>On my 5 node cluster, when I do a global shutdown, the pacemaker processes exit, but corosync processes remain active. <br><br>Here's an example of where this led me into some trouble... <br><br>My cluster is still configured to use the "symmetric" resource distribution. I don't have any location constraints in place, so pacemaker tries to evenly distribute resources across all Online nodes. <br><br>With one cluster node (KVM host) powered off, I did the global cluster stop: <br><br>[root@zs90KP VD]# date;pcs cluster stop --all<br>Wed Sep 28 15:07:40 EDT 2016<br>zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)<br>zs90kppcs1: Stopping Cluster (pacemaker)...<br>zs95KLpcs1: Stopping Cluster (pacemaker)...<br>zs95kjpcs1: Stopping Cluster (pacemaker)...<br>zs93kjpcs1: Stopping Cluster (pacemaker)...<br>Error: unable to stop all nodes<br>zs93KLpcs1: Unable to connect to zs93KLpcs1 ([Errno 113] No route to host)<br><br>Note: The "No route to host" messages are expected because that node / LPAR is powered down. <br><br>(I don't show it here, but the corosync daemon is still running on the 4 active nodes. I do show it later). <br><br>I then powered on the one zs93KLpcs1 LPAR, so in theory I should not have quorum when it comes up and activates<br>pacemaker, which is enabled to autostart at boot time on all 5 cluster nodes. At this point, only 1 out of 5<br>nodes should be Online to the cluster, and therefore ... no quorum. <br><br>I login to zs93KLpcs1, and pcs status shows those 4 nodes as 'pending' Online, and "partition with quorum": <br><br>[root@zs93kl ~]# date;pcs status |less<br>Wed Sep 28 15:25:13 EDT 2016<br>Cluster name: test_cluster_2<br>Last updated: Wed Sep 28 15:25:13 2016 Last change: Mon Sep 26 16:15:08 2016 by root via crm_resource on zs95kjpcs1<br>Stack: corosync<br>Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition with quorum<br>106 nodes and 304 resources configured<br><br>Node zs90kppcs1: pending<br>Node zs93kjpcs1: pending<br>Node zs95KLpcs1: pending<br>Node zs95kjpcs1: pending<br>Online: [ zs93KLpcs1 ]<br><br>Full list of resources:<br><br> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br>.<br>.<br>.<br><br><br>Here you can see that corosync is up on all 5 nodes: <br><br>[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep corosync |grep -v grep"; done<br>Wed Sep 28 15:22:21 EDT 2016<br>zs90KP<br>root 155374 1 0 Sep26 ? 00:10:17 corosync<br>zs95KL<br>root 22933 1 0 11:51 ? 00:00:54 corosync<br>zs95kj<br>root 19382 1 0 Sep26 ? 00:10:15 corosync<br>zs93kj<br>root 129102 1 0 Sep26 ? 00:12:10 corosync<br>zs93kl<br>root 21894 1 0 15:19 ? 00:00:00 corosync<br><br><br>But, pacemaker is only running on the one, online node: <br><br>[root@zs95kj VD]# date;for host in zs90kppcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 zs93KLpcs1 ; do ssh $host "hostname;ps -ef |grep pacemakerd |grep -v grep"; done<br>Wed Sep 28 15:23:29 EDT 2016<br>zs90KP<br>zs95KL<br>zs95kj<br>zs93kj<br>zs93kl<br>root 23005 1 0 15:19 ? 00:00:00 /usr/sbin/pacemakerd -f<br>You have new mail in /var/spool/mail/root<br>[root@zs95kj VD]#<br><br><br>This situation wreaks havoc on my VirtualDomain resources, as the majority of them are in FAILED or Stopped state, and to my<br>surprise... many of them show as Started: <br><br>[root@zs93kl VD]# date;pcs resource show |grep zs93KL<br>Wed Sep 28 15:55:29 EDT 2016<br> zs95kjg109062_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109064_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109077_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109078_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109083_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109085_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109090_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109092_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109095_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109096_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109097_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109102_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg109104_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110063_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110065_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110066_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110067_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110068_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110069_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110070_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110071_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110072_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110073_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110074_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110075_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110076_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110080_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110081_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110082_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110084_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110086_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110087_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110088_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110089_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110103_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110093_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110094_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110095_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110097_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110100_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110101_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110098_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110105_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110107_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110108_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110109_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110110_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110111_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110113_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110114_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110116_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110117_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110119_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110120_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110121_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110122_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110123_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110125_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110126_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110128_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110129_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110130_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110131_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110132_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110133_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110134_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110135_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110137_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110138_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110140_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110141_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110143_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110144_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110145_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110146_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110149_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110150_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110154_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110155_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110156_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110159_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110160_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110164_res (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1<br> zs95kjg110165_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br> zs95kjg110166_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br><br><br>Pacemaker is attempting to activate all VirtualDomain resources on the one cluster node. <br><br>So back to my original question... what should happen when I do a cluster stop? <br>If it should be deactivating, what would prevent this? <br><br>Also, I have tried simulating a failed cluster node (to trigger a STONITH action) by killing the <br>corosync daemon on one node, but all that does is respawn the daemon ... causing a temporary / transient<br>failure condition, and no fence takes place. Is there a way to kill corosync in such a way<br>that it stays down? Is there a best practice for STONITH testing?<br><br>As usual, thanks in advance for your advice. <br><br>Scott Greenlese ... IBM KVM on System Z - Solutions Test, Poughkeepsie, N.Y.<br> INTERNET: swgreenl@us.ibm.com <br> <br><br><br><img width="16" height="16" src="cid:1__=8FBB0AAFDFF82A568f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Ken Gaillot ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/201"><font color="#424282">Ken Gaillot ---09/09/2016 06:23:37 PM---On 09/09/2016 04:27 AM, Klaus Wenninger wrote: > On 09/08/2016 07:31 PM, Scott Greenlese wrote:</font><br><br><font size="2" color="#5F5F5F">From: </font><font size="2">Ken Gaillot <kgaillot@redhat.com></font><br><font size="2" color="#5F5F5F">To: </font><font size="2">users@clusterlabs.org</font><br><font size="2" color="#5F5F5F">Date: </font><font size="2">09/09/2016 06:23 PM</font><br><font size="2" color="#5F5F5F">Subject: </font><font size="2">Re: [ClusterLabs] Pacemaker quorum behavior</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt>On 09/09/2016 04:27 AM, Klaus Wenninger wrote:<br>> On 09/08/2016 07:31 PM, Scott Greenlese wrote:<br>>><br>>> Hi Klaus, thanks for your prompt and thoughtful feedback...<br>>><br>>> Please see my answers nested below (sections entitled, "Scott's<br>>> Reply"). Thanks!<br>>><br>>> - Scott<br>>><br>>><br>>> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.<br>>> INTERNET: swgreenl@us.ibm.com<br>>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966<br>>><br>>><br>>> Inactive hide details for Klaus Wenninger ---09/08/2016 10:59:27<br>>> AM---On 09/08/2016 03:55 PM, Scott Greenlese wrote: >Klaus Wenninger<br>>> ---09/08/2016 10:59:27 AM---On 09/08/2016 03:55 PM, Scott Greenlese<br>>> wrote: ><br>>><br>>> From: Klaus Wenninger <kwenning@redhat.com><br>>> To: users@clusterlabs.org<br>>> Date: 09/08/2016 10:59 AM<br>>> Subject: Re: [ClusterLabs] Pacemaker quorum behavior<br>>><br>>> ------------------------------------------------------------------------<br>>><br>>><br>>><br>>> On 09/08/2016 03:55 PM, Scott Greenlese wrote:<br>>> ><br>>> > Hi all...<br>>> ><br>>> > I have a few very basic questions for the group.<br>>> ><br>>> > I have a 5 node (Linux on Z LPARs) pacemaker cluster with 100<br>>> > VirtualDomain pacemaker-remote nodes<br>>> > plus 100 "opaque" VirtualDomain resources. The cluster is configured<br>>> > to be 'symmetric' and I have no<br>>> > location constraints on the 200 VirtualDomain resources (other than to<br>>> > prevent the opaque guests<br>>> > from running on the pacemaker remote node resources). My quorum is set<br>>> > as:<br>>> ><br>>> > quorum {<br>>> > provider: corosync_votequorum<br>>> > }<br>>> ><br>>> > As an experiment, I powered down one LPAR in the cluster, leaving 4<br>>> > powered up with the pcsd service up on the 4 survivors<br>>> > but corosync/pacemaker down (pcs cluster stop --all) on the 4<br>>> > survivors. I then started pacemaker/corosync on a single cluster<br>>> ><br>>><br>>> "pcs cluster stop" shuts down pacemaker & corosync on my test-cluster but<br>>> did you check the status of the individual services?<br>>><br>>> Scott's reply:<br>>><br>>> No, I only assumed that pacemaker was down because I got this back on<br>>> my pcs status<br>>> command from each cluster node:<br>>><br>>> [root@zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1<br>>> zs93kjpcs1 ; do ssh $host pcs status; done<br>>> Wed Sep 7 15:49:27 EDT 2016<br>>> Error: cluster is not currently running on this node<br>>> Error: cluster is not currently running on this node<br>>> Error: cluster is not currently running on this node<br>>> Error: cluster is not currently running on this node<br><br>In my experience, this is sufficient to say that pacemaker and corosync<br>aren't running.<br><br>>><br>>> What else should I check? The pcsd.service service was still up,<br>>> since I didn't not stop that<br>>> anywhere. Should I have done, ps -ef |grep -e pacemaker -e corosync<br>>> to check the state before<br>>> assuming it was really down?<br>>><br>>><br>> Guess the answer from Poki should guide you well here ...<br>>><br>>><br>>> > node (pcs cluster start), and this resulted in the 200 VirtualDomain<br>>> > resources activating on the single node.<br>>> > This was not what I was expecting. I assumed that no resources would<br>>> > activate / start on any cluster nodes<br>>> > until 3 out of the 5 total cluster nodes had pacemaker/corosync running.<br><br>Your expectation is correct; I'm not sure what happened in this case.<br>There are some obscure corosync options (e.g. last_man_standing,<br>allow_downscale) that could theoretically lead to this, but I don't get<br>the impression you're using anything unusual.<br><br>>> > After starting pacemaker/corosync on the single host (zs95kjpcs1),<br>>> > this is what I see :<br>>> ><br>>> > [root@zs95kj VD]# date;pcs status |less<br>>> > Wed Sep 7 15:51:17 EDT 2016<br>>> > Cluster name: test_cluster_2<br>>> > Last updated: Wed Sep 7 15:51:18 2016 Last change: Wed Sep 7 15:30:12<br>>> > 2016 by hacluster via crmd on zs93kjpcs1<br>>> > Stack: corosync<br>>> > Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -<br>>> > partition with quorum<br>>> > 106 nodes and 304 resources configured<br>>> ><br>>> > Node zs93KLpcs1: pending<br>>> > Node zs93kjpcs1: pending<br>>> > Node zs95KLpcs1: pending<br>>> > Online: [ zs95kjpcs1 ]<br>>> > OFFLINE: [ zs90kppcs1 ]<br>>> ><br>>> > .<br>>> > .<br>>> > .<br>>> > PCSD Status:<br>>> > zs93kjpcs1: Online<br>>> > zs95kjpcs1: Online<br>>> > zs95KLpcs1: Online<br>>> > zs90kppcs1: Offline<br>>> > zs93KLpcs1: Online<br><br>FYI the Online/Offline above refers only to pcsd, which doesn't have any<br>effect on the cluster itself -- just the ability to run pcs commands.<br><br>>> > So, what exactly constitutes an "Online" vs. "Offline" cluster node<br>>> > w.r.t. quorum calculation? Seems like in my case, it's "pending" on 3<br>>> > nodes,<br>>> > so where does that fall? Any why "pending"? What does that mean?<br><br>"pending" means that the node has joined the corosync cluster (which<br>allows it to contribute to quorum), but it has not yet completed the<br>pacemaker join process (basically a handshake with the DC).<br><br>I think the corosync and pacemaker detail logs would be essential to<br>figuring out what's going on. Check the logs on the "pending" nodes to<br>see whether corosync somehow started up by this point, and check the<br>logs on this node to see what the most recent references to the pending<br>nodes were.<br><br>>> > Also, what exactly is the cluster's expected reaction to quorum loss?<br>>> > Cluster resources will be stopped or something else?<br>>> ><br>>> Depends on how you configure it using cluster property no-quorum-policy<br>>> (default: stop).<br>>><br>>> Scott's reply:<br>>><br>>> This is how the policy is configured:<br>>><br>>> [root@zs95kj VD]# date;pcs config |grep quorum<br>>> Thu Sep 8 13:18:33 EDT 2016<br>>> no-quorum-policy: stop<br>>><br>>> What should I expect with the 'stop' setting?<br>>><br>>><br>>> ><br>>> ><br>>> > Where can I find this documentation?<br>>> ><br>>> </tt><tt><a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/">http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/</a></tt><tt><br>>><br>>> Scott's reply:<br>>><br>>> OK, I'll keep looking thru this doc, but I don't easily find the<br>>> no-quorum-policy explained.<br>>><br>> Well, the index leads you to:<br>> </tt><tt><a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html">http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-cluster-options.html</a></tt><tt><br>> where you find an exhaustive description of the option.<br>> <br>> In short:<br>> you are running the default and that leads to all resources being<br>> stopped in a partition without quorum<br>> <br>>> Thanks..<br>>><br>>><br>>> ><br>>> ><br>>> > Thanks!<br>>> ><br>>> > Scott Greenlese - IBM Solution Test Team.<br>>> ><br>>> ><br>>> ><br>>> > Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.<br>>> > INTERNET: swgreenl@us.ibm.com<br>>> > PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966<br><br>_______________________________________________<br>Users mailing list: Users@clusterlabs.org<br></tt><tt><a href="http://clusterlabs.org/mailman/listinfo/users">http://clusterlabs.org/mailman/listinfo/users</a></tt><tt><br><br>Project Home: </tt><tt><a href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></tt><tt><br>Getting started: </tt><tt><a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></tt><tt><br>Bugs: </tt><tt><a href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></tt><tt><br><br></tt><br><br><BR>
</body></html>