[ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

Fri Sep 9 21:56:46 UTC 2016

On 09/09/2016 02:47 PM, Scott Greenlese wrote:
> Hi Ken ,
> 
> Below where you commented,
> 
> "It's considered good practice to stop
> pacemaker+corosync before rebooting a node intentionally (for even more
> safety, you can put the node into standby first)."
> 
> .. is this something that we document anywhere?

Not in any official documentation that I'm aware of; it's more a general
custom than a strong recommendation.

> Our 'reboot' action performs a halt (deactivate lpar) and then activate.
> Do I run the risk
> of guest instances running on multiple hosts in my case? I'm performing
> various recovery
> scenarios and want to avoid this procedure (reboot without first
> stopping cluster), if it's not supported.

By "intentionally" I mean via normal system administration, not fencing.
When fencing, it's always acceptable (and desirable) to do an immediate
cutoff, without any graceful stopping of anything.

When doing a graceful reboot/shutdown, the OS typically asks all running
processes to terminate, then waits a while for them to do so. There's
nothing really wrong with pacemaker being running at that point -- as
long as everything goes well.

If the OS gets impatient and terminates pacemaker before it finishes
stopping, the rest of the cluster will want to fence the node. Also, if
something goes wrong when resources are stopping, it might be harder to
troubleshoot, if the whole system is shutting down at the same time. So,
stopping pacemaker first makes sure that all the resources stop cleanly,
and that the cluster will ignore the node.

Putting in standby is not as important, I would say the main benefit is
that the node comes back up in standby when it rejoins, so you have more
control over when resources start being placed back on it. You can bring
up the node and start pacemaker, and make sure everything is good before
allowing resources back on it (especially helpful if you just upgraded
pacemaker or any of its dependencies, changed the host's network
configuration, etc.).

There shouldn't be any chance of multiple-active instances if fencing is
configured. Pacemaker shouldn't recover the resource elsewhere until it
confirms that either the resource stopped successfully on the node, or
the node was fenced.

> 
> By the way, I always put the node in cluster standby before an
> intentional reboot.
> 
> Thanks!
> 
> Scott Greenlese ... IBM Solutions Test, Poughkeepsie, N.Y.
> INTERNET: swgreenl at us.ibm.com
> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
> 
> 
> Inactive hide details for Ken Gaillot ---09/02/2016 10:01:15 AM---From:
> Ken Gaillot <kgaillot at redhat.com> To: users at clusterlabsKen Gaillot
> ---09/02/2016 10:01:15 AM---From: Ken Gaillot <kgaillot at redhat.com> To:
> users at clusterlabs.org
> 
> From: Ken Gaillot <kgaillot at redhat.com>
> To: users at clusterlabs.org
> Date: 09/02/2016 10:01 AM
> Subject: Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to
> transient network failure
> 
> ------------------------------------------------------------------------
> 
> 
> 
> On 09/01/2016 09:39 AM, Scott Greenlese wrote:
>> Andreas,
>>
>> You wrote:
>>
>> /"Would be good to see your full cluster configuration (corosync.conf
>> and cib) - but first guess is: no fencing at all .... and what is your
>> "no-quorum-policy" in Pacemaker?/
>>
>> /Regards,/
>> /Andreas"/
>>
>> Thanks for your interest. I actually do have a stonith device configured
>> which maps all 5 cluster nodes in the cluster:
>>
>> [root at zs95kj ~]# date;pcs stonith show fence_S90HMC1
>> Thu Sep 1 10:11:25 EDT 2016
>> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz)
>> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic
>>
> pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER
>> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1"
>> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off
>> pcmk_reboot_timeout=600
>> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s)
>>
>> This fencing device works, too well actually. It seems extremely
>> sensitive to node "failures", and I'm not sure how to tune that. Stonith
>> reboot actoin is 'off', and the general stonith action (cluster config)
>> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot
>> command) that is an active member in the cluster... stonith will power
>> off that node while it's on its wait back up. (perhaps requires a
>> separate issue thread on this forum?).
> 
> That depends on what a reboot does in your OS ... if it shuts down the
> cluster services cleanly, you shouldn't get a fence, but if it kills
> anything still running, then the cluster will see the node as failed,
> and fencing is appropriate. It's considered good practice to stop
> pacemaker+corosync before rebooting a node intentionally (for even more
> safety, you can put the node into standby first).
> 
>>
>> My no-quorum-policy is: no-quorum-policy: stop
>>
>> I don't think I should have lost quorum, only two of the five cluster
>> nodes lost their corosync ring connection.
> 
> Those two nodes lost quorum, so they should have stopped all their
> resources. And the three remaining nodes should have fenced them.
> 
> I'd check the logs around the time of the incident. Do the two affected
> nodes detect the loss of quorum? Do they attempt to stop their
> resources? Do those stops succeed? Do the other three nodes detect the
> loss of the two nodes? Does the DC attempt to fence them? Do the fence
> attempts succeed?
> 
>> Here's the full configuration:
>>
>>
>> [root at zs95kj ~]# cat /etc/corosync/corosync.conf
>> totem {
>> version: 2
>> secauth: off
>> cluster_name: test_cluster_2
>> transport: udpu
>> }
>>
>> nodelist {
>> node {
>> ring0_addr: zs93kjpcs1
>> nodeid: 1
>> }
>>
>> node {
>> ring0_addr: zs95kjpcs1
>> nodeid: 2
>> }
>>
>> node {
>> ring0_addr: zs95KLpcs1
>> nodeid: 3
>> }
>>
>> node {
>> ring0_addr: zs90kppcs1
>> nodeid: 4
>> }
>>
>> node {
>> ring0_addr: zs93KLpcs1
>> nodeid: 5
>> }
>> }
>>
>> quorum {
>> provider: corosync_votequorum
>> }
>>
>> logging {
>> #Log to a specified file
>> to_logfile: yes
>> logfile: /var/log/corosync/corosync.log
>> #Log timestamp as well
>> timestamp: on
>>
>> #Facility in syslog
>> syslog_facility: daemon
>>
>> logger_subsys {
>> #Enable debug for this logger.
>>
>> debug: off
>>
>> #This specifies the subsystem identity (name) for which logging is
> specified
>>
>> subsys: QUORUM
>>
>> }
>> #Log to syslog
>> to_syslog: yes
>>
>> #Whether or not turning on the debug information in the log
>> debug: on
>> }
>> [root at zs95kj ~]#
>>
>>
>>
>> The full CIB (see attachment)
>>
>> [root at zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out
>>
>> /(See attached file: scotts_cib_Sep1_2016.out)/
>>
>>
>> A few excerpts from the CIB:
>>
>> [root at zs95kj ~]# pcs cluster cib |less
>> <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.3" epoch="2804"
>> num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31
>> 2016" update-origin="zs93kjpcs1" update-client="crm_resource"
>> update-user="root" have-quorum="1" dc-uuid="2">
>> <configuration>
>> <crm_config>
>> <cluster_property_set id="cib-bootstrap-options">
>> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog"
>> value="false"/>
>> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>> value="1.1.13-10.el7_2.ibm.1-44eb2dd"/>
>> <nvpair id="cib-bootstrap-options-cluster-infrastructure"
>> name="cluster-infrastructure" value="corosync"/>
>> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name"
>> value="test_cluster_2"/>
>> <nvpair id="cib-bootstrap-options-no-quorum-policy"
>> name="no-quorum-policy" value="stop"/>
>> <nvpair id="cib-bootstrap-options-last-lrm-refresh"
>> name="last-lrm-refresh" value="1472595716"/>
>> <nvpair id="cib-bootstrap-options-stonith-action" name="stonith-action"
>> value="off"/>
>> </cluster_property_set>
>> </crm_config>
>> <nodes>
>> <node id="1" uname="zs93kjpcs1">
>> <instance_attributes id="nodes-1"/>
>> </node>
>> <node id="2" uname="zs95kjpcs1">
>> <instance_attributes id="nodes-2"/>
>> </node>
>> <node id="3" uname="zs95KLpcs1">
>> <instance_attributes id="nodes-3"/>
>> </node>
>> <node id="4" uname="zs90kppcs1">
>> <instance_attributes id="nodes-4"/>
>> </node>
>> <node id="5" uname="zs93KLpcs1">
>> <instance_attributes id="nodes-5"/>
>> </node>
>> </nodes>
>> <primitive class="ocf" id="zs95kjg109062_res" provider="heartbeat"
>> type="VirtualDomain">
>> <instance_attributes id="zs95kjg109062_res-instance_attributes">
>> <nvpair id="zs95kjg109062_res-instance_attributes-config" name="config"
>> value="/guestxml/nfs1/zs95kjg109062.xml"/>
>> <nvpair id="zs95kjg109062_res-instance_attributes-hypervisor"
>> name="hypervisor" value="qemu:///system"/>
>> <nvpair id="zs95kjg109062_res-instance_attributes-migration_transport"
>> name="migration_transport" value="ssh"/>
>> </instance_attributes>
>> <meta_attributes id="zs95kjg109062_res-meta_attributes">
>> <nvpair id="zs95kjg109062_res-meta_attributes-allow-migrate"
>> name="allow-migrate" value="true"/>
>> </meta_attributes>
>> <operations>
>> <op id="zs95kjg109062_res-start-interval-0s" interval="0s" name="start"
>> timeout="90"/>
>> <op id="zs95kjg109062_res-stop-interval-0s" interval="0s" name="stop"
>> timeout="90"/>
>> <op id="zs95kjg109062_res-monitor-interval-30s" interval="30s"
>> name="monitor"/>
>> <op id="zs95kjg109062_res-migrate-from-interval-0s" interval="0s"
>> name="migrate-from" timeout="1200"/>
>> </operations>
>> <utilization id="zs95kjg109062_res-utilization">
>> <nvpair id="zs95kjg109062_res-utilization-cpu" name="cpu" value="2"/>
>> <nvpair id="zs95kjg109062_res-utilization-hv_memory" name="hv_memory"
>> value="2048"/>
>> </utilization>
>> </primitive>
>>
>> ( I OMITTED THE OTHER, SIMILAR 199 VIRTUALDOMAIN PRIMITIVE ENTRIES FOR
>> THE SAKE OF SPACE, BUT IF THEY ARE OF
>> INTEREST, I CAN ADD THEM)
>>
>> .
>> .
>> .
>>
>> <constraints>
>> <rsc_location id="location-zs95kjg109062_res" rsc="zs95kjg109062_res">
>> <rule id="location-zs95kjg109062_res-rule" score="-INFINITY">
>> <expression attribute="#kind" id="location-zs95kjg109062_res-rule-expr"
>> operation="eq" value="container"/>
>> </rule>
>> </rsc_location>
>>
>> (I DEFINED THIS LOCATION CONSTRAINT RULE TO PREVENT OPAQUE GUEST VIRTUAL
>> DOMAIN RESOUCES FROM BEING
>> ASSIGNED TO REMOTE NODE VIRTUAL DOMAIN RESOURCES. I ALSO OMITTED THE
>> NUMEROUS, SIMILAR ENTRIES BELOW).
>>
>> .
>> .
>> .
>>
>> (I ALSO OMITTED THE NUMEROUS RESOURCE STATUS STANZAS)
>> .
>> .
>> .
>> </node_state>
>> <node_state remote_node="true" id="zs95kjg110117" uname="zs95kjg110117"
>> crm-debug-origin="do_state_transition" node_fenced="0">
>> <transient_attributes id="zs95kjg110117">
>> <instance_attributes id="status-zs95kjg110117"/>
>> </transient_attributes>
>>
>> (OMITTED NUMEROUS SIMILAR NODE STATUS ENTRIES)
>> .
>> .
>> .
>>
>>
>> If there's anything important I left out in the CIB output, please refer
>> to the email attachment "scotts_cib_Sep1_2016.out". Thanks!
>>
>>
>> Scott G.
>>
>>
>> Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
>> INTERNET: swgreenl at us.ibm.com
>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
>>
>>
>> Inactive hide details for Andreas Kurz ---08/30/2016 05:06:40 PM---Hi,
>> On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <swgreAndreas Kurz
>> ---08/30/2016 05:06:40 PM---Hi, On Tue, Aug 30, 2016 at 10:03 PM, Scott
>> Greenlese <swgreenl at us.ibm.com>
>>
>> From: Andreas Kurz <andreas.kurz at gmail.com>
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>
>> Date: 08/30/2016 05:06 PM
>> Subject: Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to
>> transient network failure
>>
>> ------------------------------------------------------------------------
>>
>>
>>
>> Hi,
>>
>> On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <_swgreenl at us.ibm.com_
>> <mailto:swgreenl at us.ibm.com>> wrote:
>>
>>     Added an appropriate subject line (was blank). Thanks...
>>
>>
>>     Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
>>     INTERNET: _swgreenl at us.ibm.com_ <mailto:swgreenl at us.ibm.com>
>>     PHONE: 8/293-7301 _(845-433-7301_ <tel:%28845-433-7301>) M/S: POK
>>     42HA/P966
>>
>>     ----- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016
>>     03:59 PM -----
>>
>>     From: Scott Greenlese/Poughkeepsie/IBM at IBMUS
>>     To: Cluster Labs - All topics related to open-source clustering
>>     welcomed <_users at clusterlabs.org_ <mailto:users at clusterlabs.org>>
>>     Date: 08/29/2016 06:36 PM
>>     Subject: [ClusterLabs] (no subject)
>>    
> ------------------------------------------------------------------------
>>
>>
>>
>>     Hi folks,
>>
>>     I'm assigned to system test Pacemaker/Corosync on the KVM on System
>>     Z platform
>>     with pacemaker-1.1.13-10 and corosync-2.3.4-7 .
>>
>>
>> Would be good to see your full cluster configuration (corosync.conf and
>> cib) - but first guess is: no fencing at all .... and what is your
>> "no-quorum-policy" in Pacemaker?
>>
>> Regards,
>> Andreas
>>  
>>
>>
>>     I have a cluster with 5 KVM hosts, and a total of 200
>>     ocf:pacemakerVirtualDomain resources defined to run
>>     across the 5 cluster nodes (symmertical is true for this cluster).
>>
>>     The heartbeat network is communicating over vlan1293, which is hung
>>     off a network device, 0230 .
>>
>>     In general, pacemaker does a good job of distributing my virtual
>>     guest resources evenly across the hypervisors
>>     in the cluster. These resource are a mixed bag:
>>
>>     - "opaque" and remote "guest nodes" managed by the cluster.
>>     - allow-migrate=false and allow-migrate=true
>>     - qcow2 (file based) guests and LUN based guests
>>     - Sles and Ubuntu OS
>>
>>     [root at zs95kj ]# pcs status |less
>>     Cluster name: test_cluster_2
>>     Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29
>>     16:37:31 2016 by root via crm_resource on zs93kjpcs1
>>     Stack: corosync
>>     Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
>>     partition with quorum
>>     103 nodes and 300 resources configured
>>
>>     Node zs90kppcs1: standby
>>     Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]
>>
>>     This morning, our system admin team performed a "non-disruptive"
>>     (concurrent) microcode code load on the OSA, which
>>     (to our surprise) dropped the network connection for 13 seconds on
>>     the S93 CEC, from 11:18:34am to 11:18:47am , to be exact.
>>     This temporary outage caused the two cluster nodes on S93
>>     (zs93kjpcs1 and zs93KLpcs1) to drop out of the cluster,
>>     as expected.
>>
>>     However, pacemaker didn't handle this too well. The end result was
>>     numerous VirtualDomain resources in FAILED state:
>>
>>     [root at zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep
>>     FAILED
>>     Mon Aug 29 12:33:32 EDT 2016
>>     zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>     zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>     zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>     zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>     zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
>>     zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>     zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
>>
>>
>>     As well as, several VirtualDomain resources showing "Started" on two
>>     cluster nodes:
>>
>>     zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[
>>     zs93kjpcs1 zs93KLpcs1 ]
>>     zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[
>>     zs93kjpcs1 zs93KLpcs1 ]
>>     zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[
>>     zs93kjpcs1 zs93KLpcs1 ]
>>     zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[
>>     zs93kjpcs1 zs93KLpcs1 ]
>>     zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[
>>     zs93kjpcs1 zs93KLpcs1 ]
>>
>>
>>     The virtual machines themselves were in fact, "running" on both
>>     hosts. For example:
>>
>>     [root at zs93kl ~]# virsh list |grep zs95kjg110079
>>     70 zs95kjg110079 running
>>
>>     [root at zs93kj cli]# virsh list |grep zs95kjg110079
>>     18 zs95kjg110079 running
>>
>>
>>     On this particular VM, here was file corruption of this file-based
>>     qcow2 guest's image, such that you could not ping or ssh,
>>     and if you open a virsh console, you get "initramfs" prompt.
>>
>>     To recover, we had to mount the volume on another VM and then run
>>     fsck to recover it.
>>
>>     I walked through the system log on the two S93 hosts to see how
>>     zs95kjg110079 ended up running
>>     on two cluster nodes. (some entries were omitted, I saved logs for
>>     future reference):
>>     *
>>
>>     zs93kjpcs1 *system log - (shows membership changes after the network
>>     failure at 11:18:34)
>>
>>     Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: The qeth device driver
>>     failed to recover an error on the device
>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000000: 00 c2 40 17 01 51
>>     90 38 00 04 00 00 00 00 00 00 .. at ..Q.8........
>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00
>>     00 00 00 00 00 00 00 00 00 00 ................
>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00
>>     00 00 00 00 00 00 00 00 00 00 ................
>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00
>>     00 00 00 00 00 34 00 1f 00 07 ...........4....
>>     Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: A recovery process has
>>     been started for the device
>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] The token was lost
>>     in the OPERATIONAL state.
>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] A processor failed,
>>     forming new configuration.
>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] entering GATHER
>>     state from 2(The token was lost in the OPERATIONAL state.).
>>     Aug 29 11:18:34 zs93kl kernel: qeth 0.0.0230: The qeth device driver
>>     failed to recover an error on the device
>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000000: 00 00 11 01 00 00
>>     00 00 00 04 00 00 00 00 00 00 ................
>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00
>>     00 00 00 00 00 00 00 00 00 00 ................
>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00
>>     00 00 00 00 00 00 00 00 00 00 ................
>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00
>>     00 00 00 00 00 00 00 00 00 00 ................
>>
>>
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:
>>     Node zs95kjpcs1[2] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95kjpcs1
>>     attributes for attrd_peer_change_cb
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:
>>     Node zs95kjpcs1[2] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95kjpcs1/2
>>     from the membership list
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=2
>>     and/or uname=zs95kjpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95kjpcs1/2
>>     from the membership list
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:
>>     Node zs95KLpcs1[3] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with
>>     id=2 and/or uname=zs95kjpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95KLpcs1/3
>>     from the membership list
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:
>>     Node zs95KLpcs1[3] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=3
>>     and/or uname=zs95KLpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:
>>     Node zs93KLpcs1[5] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs93KLpcs1/5
>>     from the membership list
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering GATHER
>>     state from 0(consensus timeout).
>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=5
>>     and/or uname=zs93KLpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Creating commit
>>     token because I am the rep.
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Saving state aru 32
>>     high seq received 32
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Storing new sequence
>>     id for ring 300
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering COMMIT
> state.
>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: Membership 768: quorum
>>     lost (1)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95KLpcs1
>>     attributes for attrd_peer_change_cb
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95KLpcs1/3
>>     from the membership list
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering RECOVERY
>>     state.
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] TRANS [0] member
>>     _10.20.93.11_ <http://10.20.93.11/>:
>>     Aug 29 11:18:37 zs93kj pacemakerd[21143]: notice: Membership 768:
>>     quorum lost (1)
>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice:
>>     crm_update_peer_proc: Node zs95kjpcs1[2] - state is now lost (was
>>     member)
>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes:
>>     Node zs95KLpcs1[3] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown
>>     action on 3
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with
>>     id=3 and/or uname=zs95KLpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Removing
>>     zs95kjpcs1/2 from the membership list
>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: Stonith/shutdown of
>>     zs95KLpcs1 not matched
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] position [0] member
>>     _10.20.93.11_ <http://10.20.93.11/>:
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:
>>     Node zs93KLpcs1[5] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Purged 1 peers
>>     with id=2 and/or uname=zs95kjpcs1 from the membership cache
>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes:
>>     Node zs95kjpcs1[2] - state is now lost (was member)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] previous ring seq
>>     2fc rep 10.20.93.11
>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs93KLpcs1
>>     attributes for attrd_peer_change_cb
>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice:
>>     crm_update_peer_proc: Node zs95KLpcs1[3] - state is now lost (was
>>     member)
>>     Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown
>>     action on 2
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] aru 32 high
>>     delivered 32 received flag 1
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Did not need to
>>     originate any messages in recovery.
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Sending initial ORF
>>     token
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag
>>     is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0
>>     high seq received 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag
>>     is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0
>>     high seq received 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag
>>     is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0
>>     high seq received 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag
>>     is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0
>>     high seq received 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] retrans flag count
>>     4 token aru 0 install seq 0 aru 0 0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Resetting old ring
>>     state
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] recovery to
> regular 1-0
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
>>     10.20.93.12 inactive
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
>>     10.20.93.13 inactive
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
>>     10.20.93.14 inactive
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
>>     ip(10.20.93.12)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
>>     ip(10.20.93.13)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)
>>     ip(10.20.93.14)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] waiting_trans_ack
>>     changed to 1
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering
>>     OPERATIONAL state.
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] A new membership
>>     (_10.20.93.11:768_ <http://10.20.93.11:768/>) was formed. Members
>>     left: 2 5 3
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Failed to receive
>>     the leave message. failed: 2 5 3
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [SYNC ] Committing
>>     synchronization for corosync configuration map access
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CMAP ] Not first sync -> no
>>     action
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] comparing: sender
>>     r(0) ip(10.20.93.11) ; members(old:4 left:3)
>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] chosen downlist:
>>     sender r(0) ip(10.20.93.11) ; members(old:4 left:3)
>>
>>
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
>>     10.20.93.12 active
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
>>     10.20.93.14 active
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)
>>     ip(10.20.93.12)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)
>>     ip(10.20.93.14)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] entering
>>     OPERATIONAL state.
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] A new membership
>>     (_10.20.93.11:772_ <http://10.20.93.11:772/>) was formed. Members
>>     joined: 2 3
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
>>     synchronization for corosync configuration map access
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CMAP ] Not first sync -> no
>>     action
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message
>>     from node 0x1
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message
>>     from node 0x2
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender
>>     r(0) ip(10.20.93.14) ; members(old:2 left:0)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender
>>     r(0) ip(10.20.93.12) ; members(old:2 left:0)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender
>>     r(0) ip(10.20.93.11) ; members(old:1 left:0)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] chosen downlist:
>>     sender r(0) ip(10.20.93.12) ; members(old:2 left:0)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message
>>     from node 0x3
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
>>     synchronization for corosync cluster closed process group service
> v1.01
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[0]
>>     group:crmd\x00, ip:r(0) ip(10.20.93.14) , pid:21491
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[1]
>>     group:attrd\x00, ip:r(0) ip(10.20.93.14) , pid:21489
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[2]
>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.14) , pid:21487
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[3]
>>     group:cib\x00, ip:r(0) ip(10.20.93.14) , pid:21486
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[4]
>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.14) , pid:21485
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[5]
>>     group:crmd\x00, ip:r(0) ip(10.20.93.12) , pid:24499
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[6]
>>     group:attrd\x00, ip:r(0) ip(10.20.93.12) , pid:24497
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[7]
>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.12) , pid:24495
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[8]
>>     group:cib\x00, ip:r(0) ip(10.20.93.12) , pid:24494
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[9]
>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.12) , pid:24491
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[10]
>>     group:crmd\x00, ip:r(0) ip(10.20.93.11) , pid:21402
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[11]
>>     group:attrd\x00, ip:r(0) ip(10.20.93.11) , pid:21400
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[12]
>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.11) , pid:21398
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[13]
>>     group:cib\x00, ip:r(0) ip(10.20.93.11) , pid:21397
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[14]
>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.11) , pid:21143
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] flags: quorate: No
>>     Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No
>>     QdeviceCastVote: No QdeviceMasterWins: No
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] IPC credentials
>>     authenticated (20562-21400-28)
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] connecting to client
>>     [21400]
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
>>     real_size:1052672; rb->word_size:263168
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
>>     real_size:1052672; rb->word_size:263168
>>     Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: Membership 772:
>>     quorum acquired (3)
>>
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] quorum regained,
>>     resuming activity
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] got nodeinfo
>>     message from cluster node 3
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] nodeinfo
>>     message[0]: votes: 0, expected: 0 flags: 0
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
>>     synchronization for corosync vote quorum service v1.0
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] total_votes=3,
>>     expected_votes=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 1 state=1,
>>     votes=1, expected=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 2 state=1,
>>     votes=1, expected=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 3 state=1,
>>     votes=1, expected=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 4 state=2,
>>     votes=1, expected=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 5 state=2,
>>     votes=1, expected=5
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] lowest node id: 1
> us: 1
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] highest node id: 3
>>     us: 1
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] This node is within
>>     the primary component and will provide service.
>>     Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice:
>>     pcmk_quorum_notification: Node zs95KLpcs1[3] - state is now member
>>     (was lost)
>>     Au g 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc:
>>     Node zs95KLpcs1[3] - state is now member (was (null))
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] Members[3]: 1 2 3
>>     Aug 29 11:18:43 zs93kj stonith-ng[21398]: warning: Node names with
>>     capitals are discouraged, consider changing 'zs95KLpcs1' to
>>     something else
>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Completed service
>>     synchronization, ready to provide service.
>>     Aug 29 11:18:43 zs93kj stonith-ng[21398]: notice:
>>     crm_update_peer_proc: Node zs95KLpcs1[3] - state is now member (was
>>     (null))
>>     Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc:
>>     Node zs95kjpcs1[2] - state is now member (was (null))
>>
>>
>>     *
>>
>>     The story of zs95kjg110079 starts on ZS93KL when it seemed to be
>>     already running on ZS93KJ - **
>>
>>     System log on zs93KLpcs1:*
>>
>>     Aug 29 11:20:58 zs93kl pengine[19997]: notice: Start
>>     zs95kjg110079_res#011(zs93KLpcs1)
>>
>>     Aug 29 11:21:56 zs93kl crmd[20001]: notice: Initiating action 520:
>>     start zs95kjg110079_res_start_0 on zs93KLpcs1 (local)
>>
>>     Aug 29 11:21:56 zs93kl systemd-machined: New machine
>>     qemu-70-zs95kjg110079.
>>     Aug 29 11:21:56 zs93kl systemd: Started Virtual Machine
>>     qemu-70-zs95kjg110079.
>>     Aug 29 11:21:56 zs93kl systemd: Starting Virtual Machine
>>     qemu-70-zs95kjg110079.
>>
>>     Aug 29 11:21:59 zs93kl crmd[20001]: notice: Operation
>>     zs95kjg110079_res_start_0: ok (node=zs93KLpcs1, call=1036, rc=0,
>>     cib-update=735, confirmed=true)
>>
>>     Aug 29 11:22:07 zs93kl crmd[20001]: warning: Action 238
>>     (zs95kjg110079_res_monitor_0) on zs93kjpcs1 failed (target: 7 vs.
>>     rc: 0): Error
>>     Aug 29 11:22:07 zs93kl crmd[20001]: notice: Transition aborted by
>>     zs95kjg110079_res_monitor_0 'create' on zs93kjpcs1: Event failed
>>     (magic=0:0;238:13:7:236d078a-9063-4092-9660-cfae048f3627,
>>     cib=0.2437.3212, source=match_graph_event:381, 0)
>>
>>     Aug 29 11:22:15 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:22:15 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:22:15 zs93kl pengine[19997]: notice: Restart
>>     zs95kjg110079_res#011(Started zs93kjpcs1)
>>
>>     Aug 29 11:22:23 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:22:23 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:22:23 zs93kl pengine[19997]: notice: Restart
>>     zs95kjg110079_res#011(Started zs93kjpcs1)
>>
>>
>>     Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>
>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>
>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>
>>
>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>
>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:03:2 4 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:03:24 zs93kl pengine[19997]: notice: Restart
>>     zs95kjg110079_res#011(Started zs93kjpcs1)
>>
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource
>>     zs95kjg110210_res (ocf::VirtualDomain) is active on 2 nodes
>>     attempting recovery
>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more
>>     information.
>>     Aug 29 12:36:27 zs93kl pengine[19997]: notice: Restart
>>     zs95kjg110079_res#011(Started zs93kjpcs1)
>>
>>
>>     Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition 84
>>     (Complete=108, Pending=0, Fired=0, Skipped=0, Incomplete=77,
>>     Source=/var/lib/pacemaker/pengine/pe-error-106.bz2): Terminated
>>     Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition failed:
>>     terminated
>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: Graph 84 with 185
>>     actions: batch-limit=185 jobs, network-delay=0ms
>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 410]: Pending
>>     rsc op zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0,
>>     waiting: 409)
>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 409]: Pending
>>     rsc op zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0,
>>     waiting: 408)
>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 408]: Pending
>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 439
>>     470 496 521 546)
>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 407]: Completed
>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: none)
>>
>>     Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 428:
>>     stop zs95kjg110079_res_stop_0 on zs93kjpcs1
>>     Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 495:
>>     stop zs95kjg110108_res_stop_0 on zs93kjpcs1
>>     Aug 29 12:59:44 zs93kl crmd[20001]: notice: Initiating action 660:
>>     stop zs95kjg110186_res_stop_0 on zs93kjpcs1
>>
>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 431]: Pending
>>     rsc op zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0,
>>     waiting: 430)
>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 430]: Pending
>>     rsc op zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0,
>>     waiting: 429)
>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 429]: Pending
>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 460
>>     491 517 542 567)
>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 428]: Completed
>>     rsc op zs95kjg110079_res_stop_0 on zs93kjpcs1 (priority: 0, waiting:
>>     none)
>>
>>
>>     *
>>
>>     System log on zs93kjpcs1*:
>>
>>
>>     Aug 29 11:20:48 zs93kj crmd[21402]: notice: Recurring action
>>     zs95kjg110079_res:817 (zs95kjg110079_res_monitor_30000) incomplete
>>     at shutdown
>>
>>     Aug 29 11:22:07 zs93kj crmd[259639]: notice: Operation
>>     zs95kjg110079_res_monitor_0: ok (node=zs93kjpcs1, call=1223, rc=0,
>>     cib-update=104, confirmed=true)
>>
>>     Aug 29 12:59:42 zs93kj VirtualDomain(zs95kjg110079_res)[9148]: INFO:
>>     Issuing graceful shutdown request for domain zs95kjg110079.*
>>
>>     Finally **zs95kjg110079**shuts down on ZS93KJ at 12:59*
>>
>>
>>     ===================
>>
>>     Does this "active on two nodes" recovery process look right?
>>
>>     What is the recommended procedure to "undo" the resource failures
>>     and dual host assignments? It took several hours (short of
>>     stopping/starting the entire cluster)
>>     to recover them... resource disable, cleanup, enable was the basis
>>     ... but it seemed that I would fix one resource and two more would
>>     fall out.
>>
>>     This seems to be one of the pitfalls of configuring resources in
>>     symmetrical mode.
>>
>>     I would appreciate any best practice guidelines you have to offer. I
>>     saved the system logs on all hosts in case anyone needs more
>>     detailed information.
>>     I also have pacemaker.log logs.
>>
>>     Thanks in advance!
>>
>>
>>
>>     Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
>>     INTERNET: _swgreenl at us.ibm.com_ <mailto:swgreenl at us.ibm.com>
>>     PHONE: 8/293-7301 _(845-433-7301_ <tel:%28845-433-7301>) M/S: POK
>>     42HA/P966