[ClusterLabs] Pacemaker quorum behavior

Fri Sep 9 20:13:00 CEST 2016

Poki,

Once again, I must apologize for presenting you and the users group with
some mis-information.    After triple checking my
note log ... it seems that I described the two actions to you backwards, as
it was the kill, not the gentle shutdown that I had issues with,
and I had done them in the reverse order.

In reality, I had attempted the individual `pcs cluster kill` commands
first, and it was the kills that were ineffective... and by "ineffective",
I mean that the resources were not stopping (I impatiently only waited
about 4 minutes before making that determination).
I then ran the `pcs cluster stop --all` which seemed to work... and by
"work", I mean subsequent attempts to issue
pcs status returned the message:    "Error: cluster is not currently
running on this node".

You are probably wondering why I would choose the last resort controversial
"kill" method over the gentler stop method as my first
attempt to stop the cluster.   I do not usually do this first, but... so
many times I have tried using 'stop' when
I have resources in 'failed" state, it takes up to 20 minutes for the stop
to quiesce / complete.   So, in this case
I thought I'd expedite things and try the kill first and see what happens.
I'm wondering now if having orphaned
virtual domains running is the expected behavior after the kill?

Full disclosure / with time stamps ...  (and a bit of a digression)

What led up to shutting down the cluster in the first place was that I had
numerous VirtualDomain resources and their domains
running on multiple hosts concurrently, a disastrous situation which
resulted in corruption of many of my virtual images volumes.

For example:

zs95kjg109082_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs93kjpcs1 ]
 zs95kjg109083_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs95kjpcs1 ]
 zs95kjg109084_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109085_res      (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1
zs93kjpcs1 ]
 zs95kjg109086_res      (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1
 zs95kjg109087_res      (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1
 zs95kjg109088_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109089_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs93kjpcs1 ]
 zs95kjg109090_res      (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1
 zs95kjg109091_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs95kjpcs1 ]
 zs95kjg109092_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109094_res      (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1
zs93kjpcs1 ]
 zs95kjg109095_res      (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1
 zs95kjg109096_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs95kjpcs1 ]
 zs95kjg109097_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109099_res      (ocf::heartbeat:VirtualDomain): Started[ zs95kjpcs1
zs93kjpcs1 ]
 zs95kjg109100_res      (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1
 zs95kjg109101_res      (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1
 zs95kjg109102_res      (ocf::heartbeat:VirtualDomain): Started zs93KLpcs1
 zs95kjg109104_res      (ocf::heartbeat:VirtualDomain): Started zs95KLpcs1
 zs95kjg109105_res      (ocf::heartbeat:VirtualDomain): Started zs93kjpcs1
 zs95kjg110061_res      (ocf::heartbeat:VirtualDomain): Started[ zs95KLpcs1
zs95kjpcs1 ]

I also had numerous FAILED resources, which... I strongly suspect, were due
to corruption of
the virtual system image volumes, which I later had to recover via fsck.

 zs95kjg110099_res      (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
 zs95kjg110100_res      (ocf::heartbeat:VirtualDomain): FAILED zs95kjpcs1
 zs95kjg110101_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110102_res      (ocf::heartbeat:VirtualDomain): FAILED zs95KLpcs1
 zs95kjg110098_res      (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
 WebSite        (ocf::heartbeat:apache):        FAILED zs95kjg110090
 fence_S90HMC1  (stonith:fence_ibmz):   Started zs95kjpcs1
 zs95kjg110105_res      (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
 zs95kjg110106_res      (ocf::heartbeat:VirtualDomain): FAILED zs95KLpcs1
 zs95kjg110107_res      (ocf::heartbeat:VirtualDomain): FAILED zs95kjpcs1
 zs95kjg110108_res      (ocf::heartbeat:VirtualDomain): FAILED[ zs95kjpcs1
zs93kjpcs1 ]

The pacemaker logs were jam packed with this message for many of my
VirtualDomain resources:

Sep 07 15:10:50 [32366] zs93kl crm_resource: (    native.c:97    )   debug:
native_add_running: zs95kjg110195_res is active on 2 nodes including
zs93kjpcs1: attempting recovery

I actually reported an earlier occurrence of this in an earlier thread,
subject: "[ClusterLabs] "VirtualDomain is active on 2 nodes" due to
transient network failure".
This happens to be a more recent occurrence of that issue, which we suspect
was caused by rebooting a cluster node without first stopping pacemaker on
that host.
We typically put the node in 'cluster standby', wait for resources to move
away from the node, and then issue 'reboot'.
The reboot action on our LPARs is configured to perform a halt
(deactivate), and then activate.  It is not a graceful system shutdown.
(end of digression).

Anyway,  in an attempt to stabilize and recover this mess... I did the
cluster kills, followed by the cluster stop all as follows:

[root at zs95kj VD]# date;pcs cluster kill
Wed Sep  7 15:28:26 EDT 2016
[root at zs95kj VD]#

[root at zs93kl VD]# date;pcs cluster kill
Wed Sep  7 15:28:44 EDT 2016

[root at zs95kj VD]# date;pcs cluster kill
Wed Sep  7 15:29:06 EDT 2016
[root at zs95kj VD]#

[root at zs95KL VD]# date;pcs cluster kill
Wed Sep  7 15:29:30 EDT 2016
[root at zs95KL VD]#

[root at zs93kj ~]# date;pcs cluster kill
Wed Sep  7 15:30:06 EDT 2016
[root at zs93kj ~]#

[root at zs95kj VD]# pcs status |less
Cluster name: test_cluster_2
Last updated: Wed Sep  7 15:31:24 2016          Last change: Wed Sep  7
15:14:07 2016 by hacluster via crmd on zs93kjpcs1
Stack: corosync
Current DC: zs93KLpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum
106 nodes and 304 resources configured

Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]
OFFLINE: [ zs90kppcs1 ]

As I said, I waited about 4 minutes and it didn't look like the resources
were stopping, and it also didn't look like the
nodes were going offline (note, zs90kppcs1 was shut down, so of course it's
offline).   So, impatiently, I then did the cluster stop,
which surprisingly completed very quickly ...

[root at zs95kj VD]# date;pcs cluster stop --all
Wed Sep  7 15:32:27 EDT 2016
zs90kppcs1: Unable to connect to zs90kppcs1 ([Errno 113] No route to host)
zs93kjpcs1: Stopping Cluster (pacemaker)...
zs95kjpcs1: Stopping Cluster (pacemaker)...
zs95KLpcs1: Stopping Cluster (pacemaker)...
zs93KLpcs1: Stopping Cluster (pacemaker)...
Error: unable to stop all nodes
zs90kppcs1: Unable to connect to zs90kppcs1 ([Errno 113] No route to host)
You have new mail in /var/spool/mail/root

This was when I discovered that the virtual domains themselves were still
running on all the hosts, I ssh'ed this script
to "destroy" (forcible shutdown) them...

[root at zs95kj VD]# cat destroyall.sh
for guest in `virsh list |grep running |awk '{print $2}'`; do virsh destroy
$guest; done

[root at zs95kj VD]# ./destroyall.sh
Domain zs95kjg110190 destroyed

Domain zs95kjg110211 destroyed

.
. (omitted dozens of Domain xxx destroyed messages)
.

and then checked them:

[root at zs95kj VD]# for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1 zs93kjpcs1 ;
do ssh $host virsh list; done
 Id    Name                           State
----------------------------------------------------

 Id    Name                           State
----------------------------------------------------

 Id    Name                           State
----------------------------------------------------

 Id    Name                           State
----------------------------------------------------

Next, I decided to run this quorum test...which resulted in the unexpected
behavior (as originally reported in this thread):

TEST:  With pacemaker initially down on all nodes,  start cluster on one
cluster node at a time, and see what happens when we reach quorum at 3.

[root at zs95kj VD]# date;for host in zs93KLpcs1 zs95KLpcs1 zs95kjpcs1
zs93kjpcs1 ; do ssh $host pcs status; done
Wed Sep  7 15:49:27 EDT 2016
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node
Error: cluster is not currently running on this node

[root at zs95kj VD]# date;pcs cluster start
Wed Sep  7 15:50:00 EDT 2016
Starting Cluster...
[root at zs95kj VD]#

[root at zs95kj VD]# while true; do date;./ckrm.sh; sleep 10; done
Wed Sep  7 15:50:09 EDT 2016

 ### VirtualDomain Resource Statistics: ###

"_res" Virtual Domain resources:
  Started on zs95kj: 0
  Started on zs93kj: 0
  Started on zs95KL: 0
  Started on zs93KL: 0
  Started on zs90KP: 0
  Total Started: 0
  Total NOT Started: 200

To my surprise, the resources are starting up on zs95kj.  Apparently, I
have quorum?

[root at zs95kj VD]# date;pcs status |less
Wed Sep  7 15:51:17 EDT 2016
Cluster name: test_cluster_2
Last updated: Wed Sep  7 15:51:18 2016          Last change: Wed Sep  7
15:30:12 2016 by hacluster via crmd on zs93kjpcs1
Stack: corosync
Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) - partition
with quorum  <<< WHY DO I HAVE QUORUM?
106 nodes and 304 resources configured

Node zs93KLpcs1: pending
Node zs93kjpcs1: pending
Node zs95KLpcs1: pending
Online: [ zs95kjpcs1 ]
OFFLINE: [ zs90kppcs1 ]

Full list of resources:

 zs95kjg109062_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109063_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109064_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109065_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109066_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109067_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
 zs95kjg109068_res      (ocf::heartbeat:VirtualDomain): Started zs95kjpcs1
.,
.
.
PCSD Status:
  zs93kjpcs1: Online
  zs95kjpcs1: Online
  zs95KLpcs1: Online
  zs90kppcs1: Offline
  zs93KLpcs1: Online

Check resources again:

Wed Sep  7 16:09:52 EDT 2016

 ### VirtualDomain Resource Statistics: ###

"_res" Virtual Domain resources:
  Started on zs95kj: 199
  Started on zs93kj: 0
  Started on zs95KL: 0
  Started on zs93KL: 0
  Started on zs90KP: 0
  Total Started: 199
  Total NOT Started: 1

I have since isolated all the corrupted virtual domain images and disabled
their VirtualDomain resources.
We already rebooted all five cluster nodes, after installing a new KVM
driver on them.

Now,  the quorum calculation and behavior seems to be working perfectly as
expected.

I started pacemaker on the nodes, one at a time... and, after 3 of the 5
nodes had pacemaker "Online" ...
resources activated and were evenly distributed across them.

In summary,  a lesson learned here is to check status of the pcs process to
be certain pacemaker and corosync
are indeed "offline" and that all threads to that process have terminated.
You had mentioned this command:

pstree -p | grep -A5 $(pidof -x pcs)

I'm not quite sure what the $(pidof -x pcs) represents??

On an "Online" cluster node, I see:

[root at zs93kj ~]# ps -ef |grep pcs |grep -v grep
root      18876      1  0 Sep07 ?
00:00:00 /bin/sh /usr/lib/pcsd/pcsd start
root      18905  18876  0 Sep07 ?        00:00:00 /bin/bash -c ulimit -S -c
0 >/dev/null 2>&1 ; /usr/bin/ruby -I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
root      18906  18905  0 Sep07 ?        00:04:22 /usr/bin/ruby
-I/usr/lib/pcsd /usr/lib/pcsd/ssl.rb
[root at zs93kj ~]#

If I use the 18876 PID on a healthy node, I get..

[root at zs93kj ~]# pstree -p |grep -A5 18876
           |-pcsd(18876)---bash(18905)---ruby(18906)-+-{ruby}(19102)
           |                                         |-{ruby}(20212)
           |                                         `-{ruby}(224258)
           |-pkcsslotd(18851)
           |-polkitd(19091)-+-{polkitd}(19100)
           |                |-{polkitd}(19101)

Is this what you meant for me to do?    If so, I'll be sure to do that next
time I suspect processes are not exiting on cluster kill or stop.

Thanks

Scott Greenlese ... IBM z/BX Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)    M/S:  POK 42HA/P966

From:	Jan Pokorný <jpokorny at redhat.com>
To:	Cluster Labs - All topics related to open-source clustering
            welcomed <users at clusterlabs.org>
Cc:	Si Bo Niu <niusibo at cn.ibm.com>, Scott
            Loveland/Poughkeepsie/IBM at IBMUS, Michael
            Tebolt/Poughkeepsie/IBM at IBMUS
Date:	09/08/2016 02:43 PM
Subject:	Re: [ClusterLabs] Pacemaker quorum behavior

On 08/09/16 10:20 -0400, Scott Greenlese wrote:
> Correction...
>
> When I stopped pacemaker/corosync on the four (powered on / active)
> cluster node hosts,  I was having an issue with the gentle method of
> stopping the cluster (pcs cluster stop --all),

Can you elaborate on what went wrong with this gentle method, please?

If it seemed to have stuck, you can perhaps run some diagnostics like:

  pstree -p | grep -A5 $(pidof -x pcs)

across the nodes to see if what process(es) pcs waits on, next time.

> so I ended up doing individual (pcs cluster kill <cluster_node>) on
> each of the four cluster nodes.   I then had to stop the virtual
> domains manually via 'virsh destroy <guestname>' on each host.
> Perhaps there was some residual node status affecting my quorum?

Hardly if corosync processes were indeed dead.

--
Jan (Poki)
[attachment "attyopgs.dat" deleted by Scott Greenlese/Poughkeepsie/IBM]
_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160909/b8a43189/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://clusterlabs.org/pipermail/users/attachments/20160909/b8a43189/attachment-0001.gif>