<html><body><p>Hi Ken , <br><br>Below where you commented, <br><br>"<tt>It's considered good practice to stop<br>pacemaker+corosync before rebooting a node intentionally (for even more<br>safety, you can put the node into standby first).</tt>"<br><br>.. is this something that we document anywhere?  <br><br>Our 'reboot' action performs a halt (deactivate lpar) and then activate.   Do I run the risk<br>of guest instances running on multiple hosts in my case?  I'm performing various recovery<br>scenarios and want to avoid this procedure (reboot without first stopping cluster), if it's not supported.  <br><br>By the way, I always put the node in cluster standby before an intentional reboot. <br><br>Thanks!<br><br>Scott Greenlese ... IBM Solutions Test,  Poughkeepsie, N.Y.<br>  INTERNET:  swgreenl@us.ibm.com  <br>  PHONE:  8/293-7301 (845-433-7301)    M/S:  POK 42HA/P966<br><br><br><img width="16" height="16" src="cid:2__=8FBB0ABADFFA16EE8f9e8a93df938690918c8FB@" border="0" alt="Inactive hide details for Ken Gaillot ---09/02/2016 10:01:15 AM---From: Ken Gaillot <kgaillot@redhat.com> To: users@clusterlabs"><font color="#424282">Ken Gaillot ---09/02/2016 10:01:15 AM---From: Ken Gaillot <kgaillot@redhat.com> To: users@clusterlabs.org</font><br><br><font size="2" color="#5F5F5F">From:        </font><font size="2">Ken Gaillot <kgaillot@redhat.com></font><br><font size="2" color="#5F5F5F">To:        </font><font size="2">users@clusterlabs.org</font><br><font size="2" color="#5F5F5F">Date:        </font><font size="2">09/02/2016 10:01 AM</font><br><font size="2" color="#5F5F5F">Subject:        </font><font size="2">Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure</font><br><hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br><br><br><tt>On 09/01/2016 09:39 AM, Scott Greenlese wrote:<br>> Andreas,<br>> <br>> You wrote:<br>> <br>> /"Would be good to see your full cluster configuration (corosync.conf<br>> and cib) - but first guess is: no fencing at all .... and what is your<br>> "no-quorum-policy" in Pacemaker?/<br>> <br>> /Regards,/<br>> /Andreas"/<br>> <br>> Thanks for your interest. I actually do have a stonith device configured<br>> which maps all 5 cluster nodes in the cluster:<br>> <br>> [root@zs95kj ~]# date;pcs stonith show fence_S90HMC1<br>> Thu Sep 1 10:11:25 EDT 2016<br>> Resource: fence_S90HMC1 (class=stonith type=fence_ibmz)<br>> Attributes: ipaddr=9.12.35.134 login=stonith passwd=lnx4ltic<br>> pcmk_host_map=zs95KLpcs1:S95/KVL;zs93KLpcs1:S93/KVL;zs93kjpcs1:S93/KVJ;zs95kjpcs1:S95/KVJ;zs90kppcs1:S90/PACEMAKER<br>> pcmk_host_list="zs95KLpcs1 zs93KLpcs1 zs93kjpcs1 zs95kjpcs1 zs90kppcs1"<br>> pcmk_list_timeout=300 pcmk_off_timeout=600 pcmk_reboot_action=off<br>> pcmk_reboot_timeout=600<br>> Operations: monitor interval=60s (fence_S90HMC1-monitor-interval-60s)<br>> <br>> This fencing device works, too well actually. It seems extremely<br>> sensitive to node "failures", and I'm not sure how to tune that. Stonith<br>> reboot actoin is 'off', and the general stonith action (cluster config)<br>> is also 'off'. In fact, often if I reboot a cluster node (i.e. reboot<br>> command) that is an active member in the cluster... stonith will power<br>> off that node while it's on its wait back up. (perhaps requires a<br>> separate issue thread on this forum?).<br><br>That depends on what a reboot does in your OS ... if it shuts down the<br>cluster services cleanly, you shouldn't get a fence, but if it kills<br>anything still running, then the cluster will see the node as failed,<br>and fencing is appropriate. It's considered good practice to stop<br>pacemaker+corosync before rebooting a node intentionally (for even more<br>safety, you can put the node into standby first).<br><br>> <br>> My no-quorum-policy is: no-quorum-policy: stop<br>> <br>> I don't think I should have lost quorum, only two of the five cluster<br>> nodes lost their corosync ring connection.<br><br>Those two nodes lost quorum, so they should have stopped all their<br>resources. And the three remaining nodes should have fenced them.<br><br>I'd check the logs around the time of the incident. Do the two affected<br>nodes detect the loss of quorum? Do they attempt to stop their<br>resources? Do those stops succeed? Do the other three nodes detect the<br>loss of the two nodes? Does the DC attempt to fence them? Do the fence<br>attempts succeed?<br><br>> Here's the full configuration:<br>> <br>> <br>> [root@zs95kj ~]# cat /etc/corosync/corosync.conf<br>> totem {<br>> version: 2<br>> secauth: off<br>> cluster_name: test_cluster_2<br>> transport: udpu<br>> }<br>> <br>> nodelist {<br>> node {<br>> ring0_addr: zs93kjpcs1<br>> nodeid: 1<br>> }<br>> <br>> node {<br>> ring0_addr: zs95kjpcs1<br>> nodeid: 2<br>> }<br>> <br>> node {<br>> ring0_addr: zs95KLpcs1<br>> nodeid: 3<br>> }<br>> <br>> node {<br>> ring0_addr: zs90kppcs1<br>> nodeid: 4<br>> }<br>> <br>> node {<br>> ring0_addr: zs93KLpcs1<br>> nodeid: 5<br>> }<br>> }<br>> <br>> quorum {<br>> provider: corosync_votequorum<br>> }<br>> <br>> logging {<br>> #Log to a specified file<br>> to_logfile: yes<br>> logfile: /var/log/corosync/corosync.log<br>> #Log timestamp as well<br>> timestamp: on<br>> <br>> #Facility in syslog<br>> syslog_facility: daemon<br>> <br>> logger_subsys {<br>> #Enable debug for this logger.<br>> <br>> debug: off<br>> <br>> #This specifies the subsystem identity (name) for which logging is specified<br>> <br>> subsys: QUORUM<br>> <br>> }<br>> #Log to syslog<br>> to_syslog: yes<br>> <br>> #Whether or not turning on the debug information in the log<br>> debug: on<br>> }<br>> [root@zs95kj ~]#<br>> <br>> <br>> <br>> The full CIB (see attachment)<br>> <br>> [root@zs95kj ~]# pcs cluster cib > /tmp/scotts_cib_Sep1_2016.out<br>> <br>> /(See attached file: scotts_cib_Sep1_2016.out)/<br>> <br>> <br>> A few excerpts from the CIB:<br>> <br>> [root@zs95kj ~]# pcs cluster cib |less<br>> <cib crm_feature_set="3.0.10" validate-with="pacemaker-2.3" epoch="2804"<br>> num_updates="19" admin_epoch="0" cib-last-written="Wed Aug 31 15:59:31<br>> 2016" update-origin="zs93kjpcs1" update-client="crm_resource"<br>> update-user="root" have-quorum="1" dc-uuid="2"><br>> <configuration><br>> <crm_config><br>> <cluster_property_set id="cib-bootstrap-options"><br>> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog"<br>> value="false"/><br>> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"<br>> value="1.1.13-10.el7_2.ibm.1-44eb2dd"/><br>> <nvpair id="cib-bootstrap-options-cluster-infrastructure"<br>> name="cluster-infrastructure" value="corosync"/><br>> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name"<br>> value="test_cluster_2"/><br>> <nvpair id="cib-bootstrap-options-no-quorum-policy"<br>> name="no-quorum-policy" value="stop"/><br>> <nvpair id="cib-bootstrap-options-last-lrm-refresh"<br>> name="last-lrm-refresh" value="1472595716"/><br>> <nvpair id="cib-bootstrap-options-stonith-action" name="stonith-action"<br>> value="off"/><br>> </cluster_property_set><br>> </crm_config><br>> <nodes><br>> <node id="1" uname="zs93kjpcs1"><br>> <instance_attributes id="nodes-1"/><br>> </node><br>> <node id="2" uname="zs95kjpcs1"><br>> <instance_attributes id="nodes-2"/><br>> </node><br>> <node id="3" uname="zs95KLpcs1"><br>> <instance_attributes id="nodes-3"/><br>> </node><br>> <node id="4" uname="zs90kppcs1"><br>> <instance_attributes id="nodes-4"/><br>> </node><br>> <node id="5" uname="zs93KLpcs1"><br>> <instance_attributes id="nodes-5"/><br>> </node><br>> </nodes><br>> <primitive class="ocf" id="zs95kjg109062_res" provider="heartbeat"<br>> type="VirtualDomain"><br>> <instance_attributes id="zs95kjg109062_res-instance_attributes"><br>> <nvpair id="zs95kjg109062_res-instance_attributes-config" name="config"<br>> value="/guestxml/nfs1/zs95kjg109062.xml"/><br>> <nvpair id="zs95kjg109062_res-instance_attributes-hypervisor"<br>> name="hypervisor" value="qemu:///system"/><br>> <nvpair id="zs95kjg109062_res-instance_attributes-migration_transport"<br>> name="migration_transport" value="ssh"/><br>> </instance_attributes><br>> <meta_attributes id="zs95kjg109062_res-meta_attributes"><br>> <nvpair id="zs95kjg109062_res-meta_attributes-allow-migrate"<br>> name="allow-migrate" value="true"/><br>> </meta_attributes><br>> <operations><br>> <op id="zs95kjg109062_res-start-interval-0s" interval="0s" name="start"<br>> timeout="90"/><br>> <op id="zs95kjg109062_res-stop-interval-0s" interval="0s" name="stop"<br>> timeout="90"/><br>> <op id="zs95kjg109062_res-monitor-interval-30s" interval="30s"<br>> name="monitor"/><br>> <op id="zs95kjg109062_res-migrate-from-interval-0s" interval="0s"<br>> name="migrate-from" timeout="1200"/><br>> </operations><br>> <utilization id="zs95kjg109062_res-utilization"><br>> <nvpair id="zs95kjg109062_res-utilization-cpu" name="cpu" value="2"/><br>> <nvpair id="zs95kjg109062_res-utilization-hv_memory" name="hv_memory"<br>> value="2048"/><br>> </utilization><br>> </primitive><br>> <br>> ( I OMITTED THE OTHER, SIMILAR 199 VIRTUALDOMAIN PRIMITIVE ENTRIES FOR<br>> THE SAKE OF SPACE, BUT IF THEY ARE OF<br>> INTEREST, I CAN ADD THEM)<br>> <br>> .<br>> .<br>> .<br>> <br>> <constraints><br>> <rsc_location id="location-zs95kjg109062_res" rsc="zs95kjg109062_res"><br>> <rule id="location-zs95kjg109062_res-rule" score="-INFINITY"><br>> <expression attribute="#kind" id="location-zs95kjg109062_res-rule-expr"<br>> operation="eq" value="container"/><br>> </rule><br>> </rsc_location><br>> <br>> (I DEFINED THIS LOCATION CONSTRAINT RULE TO PREVENT OPAQUE GUEST VIRTUAL<br>> DOMAIN RESOUCES FROM BEING<br>> ASSIGNED TO REMOTE NODE VIRTUAL DOMAIN RESOURCES. I ALSO OMITTED THE<br>> NUMEROUS, SIMILAR ENTRIES BELOW).<br>> <br>> .<br>> .<br>> .<br>> <br>> (I ALSO OMITTED THE NUMEROUS RESOURCE STATUS STANZAS)<br>> .<br>> .<br>> .<br>> </node_state><br>> <node_state remote_node="true" id="zs95kjg110117" uname="zs95kjg110117"<br>> crm-debug-origin="do_state_transition" node_fenced="0"><br>> <transient_attributes id="zs95kjg110117"><br>> <instance_attributes id="status-zs95kjg110117"/><br>> </transient_attributes><br>> <br>> (OMITTED NUMEROUS SIMILAR NODE STATUS ENTRIES)<br>> .<br>> .<br>> .<br>> <br>> <br>> If there's anything important I left out in the CIB output, please refer<br>> to the email attachment "scotts_cib_Sep1_2016.out". Thanks!<br>> <br>> <br>> Scott G.<br>> <br>> <br>> Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.<br>> INTERNET: swgreenl@us.ibm.com<br>> PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966<br>> <br>> <br>> Inactive hide details for Andreas Kurz ---08/30/2016 05:06:40 PM---Hi,<br>> On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <swgreAndreas Kurz<br>> ---08/30/2016 05:06:40 PM---Hi, On Tue, Aug 30, 2016 at 10:03 PM, Scott<br>> Greenlese <swgreenl@us.ibm.com><br>> <br>> From: Andreas Kurz <andreas.kurz@gmail.com><br>> To: Cluster Labs - All topics related to open-source clustering welcomed<br>> <users@clusterlabs.org><br>> Date: 08/30/2016 05:06 PM<br>> Subject: Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to<br>> transient network failure<br>> <br>> ------------------------------------------------------------------------<br>> <br>> <br>> <br>> Hi,<br>> <br>> On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <_swgreenl@us.ibm.com_<br>> <</tt><tt><a href="mailto:swgreenl@us.ibm.com">mailto:swgreenl@us.ibm.com</a></tt><tt>>> wrote:<br>> <br>>     Added an appropriate subject line (was blank). Thanks...<br>> <br>> <br>>     Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.<br>>     INTERNET: _swgreenl@us.ibm.com_ <</tt><tt><a href="mailto:swgreenl@us.ibm.com">mailto:swgreenl@us.ibm.com</a></tt><tt>><br>>     PHONE: 8/293-7301 _(845-433-7301_ <tel:%28845-433-7301>) M/S: POK<br>>     42HA/P966<br>> <br>>     ----- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016<br>>     03:59 PM -----<br>> <br>>     From: Scott Greenlese/Poughkeepsie/IBM@IBMUS<br>>     To: Cluster Labs - All topics related to open-source clustering<br>>     welcomed <_users@clusterlabs.org_ <</tt><tt><a href="mailto:users@clusterlabs.org">mailto:users@clusterlabs.org</a></tt><tt>>><br>>     Date: 08/29/2016 06:36 PM<br>>     Subject: [ClusterLabs] (no subject)<br>>     ------------------------------------------------------------------------<br>> <br>> <br>> <br>>     Hi folks,<br>> <br>>     I'm assigned to system test Pacemaker/Corosync on the KVM on System<br>>     Z platform<br>>     with pacemaker-1.1.13-10 and corosync-2.3.4-7 . <br>> <br>> <br>> Would be good to see your full cluster configuration (corosync.conf and<br>> cib) - but first guess is: no fencing at all .... and what is your<br>> "no-quorum-policy" in Pacemaker?<br>> <br>> Regards,<br>> Andreas<br>>   <br>> <br>> <br>>     I have a cluster with 5 KVM hosts, and a total of 200<br>>     ocf:pacemakerVirtualDomain resources defined to run<br>>     across the 5 cluster nodes (symmertical is true for this cluster).<br>> <br>>     The heartbeat network is communicating over vlan1293, which is hung<br>>     off a network device, 0230 .<br>> <br>>     In general, pacemaker does a good job of distributing my virtual<br>>     guest resources evenly across the hypervisors<br>>     in the cluster. These resource are a mixed bag:<br>> <br>>     - "opaque" and remote "guest nodes" managed by the cluster.<br>>     - allow-migrate=false and allow-migrate=true<br>>     - qcow2 (file based) guests and LUN based guests<br>>     - Sles and Ubuntu OS<br>> <br>>     [root@zs95kj ]# pcs status |less<br>>     Cluster name: test_cluster_2<br>>     Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29<br>>     16:37:31 2016 by root via crm_resource on zs93kjpcs1<br>>     Stack: corosync<br>>     Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -<br>>     partition with quorum<br>>     103 nodes and 300 resources configured<br>> <br>>     Node zs90kppcs1: standby<br>>     Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]<br>> <br>>     This morning, our system admin team performed a "non-disruptive"<br>>     (concurrent) microcode code load on the OSA, which<br>>     (to our surprise) dropped the network connection for 13 seconds on<br>>     the S93 CEC, from 11:18:34am to 11:18:47am , to be exact.<br>>     This temporary outage caused the two cluster nodes on S93<br>>     (zs93kjpcs1 and zs93KLpcs1) to drop out of the cluster,<br>>     as expected.<br>> <br>>     However, pacemaker didn't handle this too well. The end result was<br>>     numerous VirtualDomain resources in FAILED state:<br>> <br>>     [root@zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep<br>>     FAILED<br>>     Mon Aug 29 12:33:32 EDT 2016<br>>     zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br>>     zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br>>     zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br>>     zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br>>     zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1<br>>     zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>>     zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1<br>> <br>> <br>>     As well as, several VirtualDomain resources showing "Started" on two<br>>     cluster nodes:<br>> <br>>     zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[<br>>     zs93kjpcs1 zs93KLpcs1 ]<br>>     zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[<br>>     zs93kjpcs1 zs93KLpcs1 ]<br>>     zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[<br>>     zs93kjpcs1 zs93KLpcs1 ]<br>>     zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[<br>>     zs93kjpcs1 zs93KLpcs1 ]<br>>     zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[<br>>     zs93kjpcs1 zs93KLpcs1 ]<br>> <br>> <br>>     The virtual machines themselves were in fact, "running" on both<br>>     hosts. For example:<br>> <br>>     [root@zs93kl ~]# virsh list |grep zs95kjg110079<br>>     70 zs95kjg110079 running<br>> <br>>     [root@zs93kj cli]# virsh list |grep zs95kjg110079<br>>     18 zs95kjg110079 running<br>> <br>> <br>>     On this particular VM, here was file corruption of this file-based<br>>     qcow2 guest's image, such that you could not ping or ssh,<br>>     and if you open a virsh console, you get "initramfs" prompt.<br>> <br>>     To recover, we had to mount the volume on another VM and then run<br>>     fsck to recover it.<br>> <br>>     I walked through the system log on the two S93 hosts to see how<br>>     zs95kjg110079 ended up running<br>>     on two cluster nodes. (some entries were omitted, I saved logs for<br>>     future reference):<br>>     *<br>> <br>>     zs93kjpcs1 *system log - (shows membership changes after the network<br>>     failure at 11:18:34)<br>> <br>>     Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: The qeth device driver<br>>     failed to recover an error on the device<br>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000000: 00 c2 40 17 01 51<br>>     90 38 00 04 00 00 00 00 00 00 ..@..Q.8........<br>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00<br>>     00 00 00 00 00 00 00 00 00 00 ................<br>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00<br>>     00 00 00 00 00 00 00 00 00 00 ................<br>>     Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00<br>>     00 00 00 00 00 34 00 1f 00 07 ...........4....<br>>     Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: A recovery process has<br>>     been started for the device<br>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] The token was lost<br>>     in the OPERATIONAL state.<br>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] A processor failed,<br>>     forming new configuration.<br>>     Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] entering GATHER<br>>     state from 2(The token was lost in the OPERATIONAL state.).<br>>     Aug 29 11:18:34 zs93kl kernel: qeth 0.0.0230: The qeth device driver<br>>     failed to recover an error on the device<br>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000000: 00 00 11 01 00 00<br>>     00 00 00 04 00 00 00 00 00 00 ................<br>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00<br>>     00 00 00 00 00 00 00 00 00 00 ................<br>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00<br>>     00 00 00 00 00 00 00 00 00 00 ................<br>>     Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00<br>>     00 00 00 00 00 00 00 00 00 00 ................<br>> <br>> <br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:<br>>     Node zs95kjpcs1[2] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95kjpcs1<br>>     attributes for attrd_peer_change_cb<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:<br>>     Node zs95kjpcs1[2] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95kjpcs1/2<br>>     from the membership list<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=2<br>>     and/or uname=zs95kjpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95kjpcs1/2<br>>     from the membership list<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:<br>>     Node zs95KLpcs1[3] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with<br>>     id=2 and/or uname=zs95kjpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95KLpcs1/3<br>>     from the membership list<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:<br>>     Node zs95KLpcs1[3] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=3<br>>     and/or uname=zs95KLpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc:<br>>     Node zs93KLpcs1[5] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs93KLpcs1/5<br>>     from the membership list<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering GATHER<br>>     state from 0(consensus timeout).<br>>     Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=5<br>>     and/or uname=zs93KLpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Creating commit<br>>     token because I am the rep.<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Saving state aru 32<br>>     high seq received 32<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Storing new sequence<br>>     id for ring 300<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering COMMIT state.<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: Membership 768: quorum<br>>     lost (1)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95KLpcs1<br>>     attributes for attrd_peer_change_cb<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95KLpcs1/3<br>>     from the membership list<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering RECOVERY<br>>     state.<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] TRANS [0] member<br>>     _10.20.93.11_ <</tt><tt><a href="http://10.20.93.11/">http://10.20.93.11/</a></tt><tt>>:<br>>     Aug 29 11:18:37 zs93kj pacemakerd[21143]: notice: Membership 768:<br>>     quorum lost (1)<br>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice:<br>>     crm_update_peer_proc: Node zs95kjpcs1[2] - state is now lost (was<br>>     member)<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes:<br>>     Node zs95KLpcs1[3] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown<br>>     action on 3<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with<br>>     id=3 and/or uname=zs95KLpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Removing<br>>     zs95kjpcs1/2 from the membership list<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: Stonith/shutdown of<br>>     zs95KLpcs1 not matched<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] position [0] member<br>>     _10.20.93.11_ <</tt><tt><a href="http://10.20.93.11/">http://10.20.93.11/</a></tt><tt>>:<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc:<br>>     Node zs93KLpcs1[5] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Purged 1 peers<br>>     with id=2 and/or uname=zs95kjpcs1 from the membership cache<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes:<br>>     Node zs95kjpcs1[2] - state is now lost (was member)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] previous ring seq<br>>     2fc rep 10.20.93.11<br>>     Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs93KLpcs1<br>>     attributes for attrd_peer_change_cb<br>>     Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice:<br>>     crm_update_peer_proc: Node zs95KLpcs1[3] - state is now lost (was<br>>     member)<br>>     Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown<br>>     action on 2<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] aru 32 high<br>>     delivered 32 received flag 1<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Did not need to<br>>     originate any messages in recovery.<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Sending initial ORF<br>>     token<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag<br>>     is 0 my set retrans flag0 retrans queue empty 1 count 0, aru 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0<br>>     high seq received 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag<br>>     is 0 my set retrans flag0 retrans queue empty 1 count 1, aru 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0<br>>     high seq received 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag<br>>     is 0 my set retrans flag0 retrans queue empty 1 count 2, aru 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0<br>>     high seq received 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag<br>>     is 0 my set retrans flag0 retrans queue empty 1 count 3, aru 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0<br>>     high seq received 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] retrans flag count<br>>     4 token aru 0 install seq 0 aru 0 0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Resetting old ring<br>>     state<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] recovery to regular 1-0<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member<br>>     10.20.93.12 inactive<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member<br>>     10.20.93.13 inactive<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member<br>>     10.20.93.14 inactive<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)<br>>     ip(10.20.93.12)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)<br>>     ip(10.20.93.13)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0)<br>>     ip(10.20.93.14)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] waiting_trans_ack<br>>     changed to 1<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering<br>>     OPERATIONAL state.<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] A new membership<br>>     (_10.20.93.11:768_ <</tt><tt><a href="http://10.20.93.11:768/">http://10.20.93.11:768/</a></tt><tt>>) was formed. Members<br>>     left: 2 5 3<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Failed to receive<br>>     the leave message. failed: 2 5 3<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [SYNC ] Committing<br>>     synchronization for corosync configuration map access<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CMAP ] Not first sync -> no<br>>     action<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] comparing: sender<br>>     r(0) ip(10.20.93.11) ; members(old:4 left:3)<br>>     Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] chosen downlist:<br>>     sender r(0) ip(10.20.93.11) ; members(old:4 left:3)<br>> <br>> <br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member<br>>     10.20.93.12 active</tt><br><tt>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member<br>>     10.20.93.14 active<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)<br>>     ip(10.20.93.12)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0)<br>>     ip(10.20.93.14)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] entering<br>>     OPERATIONAL state.<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] A new membership<br>>     (_10.20.93.11:772_ <</tt><tt><a href="http://10.20.93.11:772/">http://10.20.93.11:772/</a></tt><tt>>) was formed. Members<br>>     joined: 2 3<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing<br>>     synchronization for corosync configuration map access<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CMAP ] Not first sync -> no<br>>     action<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message<br>>     from node 0x1<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message<br>>     from node 0x2<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender<br>>     r(0) ip(10.20.93.14) ; members(old:2 left:0)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender<br>>     r(0) ip(10.20.93.12) ; members(old:2 left:0)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender<br>>     r(0) ip(10.20.93.11) ; members(old:1 left:0)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] chosen downlist:<br>>     sender r(0) ip(10.20.93.12) ; members(old:2 left:0)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message<br>>     from node 0x3<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing<br>>     synchronization for corosync cluster closed process group service v1.01<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[0]<br>>     group:crmd\x00, ip:r(0) ip(10.20.93.14) , pid:21491<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[1]<br>>     group:attrd\x00, ip:r(0) ip(10.20.93.14) , pid:21489<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[2]<br>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.14) , pid:21487<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[3]<br>>     group:cib\x00, ip:r(0) ip(10.20.93.14) , pid:21486<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[4]<br>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.14) , pid:21485<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[5]<br>>     group:crmd\x00, ip:r(0) ip(10.20.93.12) , pid:24499<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[6]<br>>     group:attrd\x00, ip:r(0) ip(10.20.93.12) , pid:24497<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[7]<br>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.12) , pid:24495<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[8]<br>>     group:cib\x00, ip:r(0) ip(10.20.93.12) , pid:24494<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[9]<br>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.12) , pid:24491<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[10]<br>>     group:crmd\x00, ip:r(0) ip(10.20.93.11) , pid:21402<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[11]<br>>     group:attrd\x00, ip:r(0) ip(10.20.93.11) , pid:21400<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[12]<br>>     group:stonith-ng\x00, ip:r(0) ip(10.20.93.11) , pid:21398<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[13]<br>>     group:cib\x00, ip:r(0) ip(10.20.93.11) , pid:21397<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[14]<br>>     group:pacemakerd\x00, ip:r(0) ip(10.20.93.11) , pid:21143<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] flags: quorate: No<br>>     Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No<br>>     QdeviceCastVote: No QdeviceMasterWins: No<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] IPC credentials<br>>     authenticated (20562-21400-28)<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] connecting to client<br>>     [21400]<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;<br>>     real_size:1052672; rb->word_size:263168<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;<br>>     real_size:1052672; rb->word_size:263168<br>>     Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: Membership 772:<br>>     quorum acquired (3)<br>> <br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] quorum regained,<br>>     resuming activity<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] got nodeinfo<br>>     message from cluster node 3<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] nodeinfo<br>>     message[0]: votes: 0, expected: 0 flags: 0<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing<br>>     synchronization for corosync vote quorum service v1.0<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] total_votes=3,<br>>     expected_votes=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 1 state=1,<br>>     votes=1, expected=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 2 state=1,<br>>     votes=1, expected=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 3 state=1,<br>>     votes=1, expected=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 4 state=2,<br>>     votes=1, expected=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 5 state=2,<br>>     votes=1, expected=5<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] lowest node id: 1 us: 1<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] highest node id: 3<br>>     us: 1<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] This node is within<br>>     the primary component and will provide service.<br>>     Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice:<br>>     pcmk_quorum_notification: Node zs95KLpcs1[3] - state is now member<br>>     (was lost)<br>>     Au

g 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc:<br>>     Node zs95KLpcs1[3] - state is now member (was (null))<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] Members[3]: 1 2 3<br>>     Aug 29 11:18:43 zs93kj stonith-ng[21398]: warning: Node names with<br>>     capitals are discouraged, consider changing 'zs95KLpcs1' to<br>>     something else<br>>     Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Completed service<br>>     synchronization, ready to provide service.<br>>     Aug 29 11:18:43 zs93kj stonith-ng[21398]: notice:<br>>     crm_update_peer_proc: Node zs95KLpcs1[3] - state is now member (was<br>>     (null))<br>>     Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc:<br>>     Node zs95kjpcs1[2] - state is now member (was (null))<br>> <br>> <br>>     *<br>> <br>>     The story of zs95kjg110079 starts on ZS93KL when it seemed to be<br>>     already running on ZS93KJ - **<br>> <br>>     System log on zs93KLpcs1:*<br>> <br>>     Aug 29 11:20:58 zs93kl pengine[19997]: notice: Start<br>>     zs95kjg110079_res#011(zs93KLpcs1)<br>> <br>>     Aug 29 11:21:56 zs93kl crmd[20001]: notice: Initiating action 520:<br>>     start zs95kjg110079_res_start_0 on zs93KLpcs1 (local)<br>> <br>>     Aug 29 11:21:56 zs93kl systemd-machined: New machine<br>>     qemu-70-zs95kjg110079.<br>>     Aug 29 11:21:56 zs93kl systemd: Started Virtual Machine<br>>     qemu-70-zs95kjg110079.<br>>     Aug 29 11:21:56 zs93kl systemd: Starting Virtual Machine<br>>     qemu-70-zs95kjg110079.<br>> <br>>     Aug 29 11:21:59 zs93kl crmd[20001]: notice: Operation<br>>     zs95kjg110079_res_start_0: ok (node=zs93KLpcs1, call=1036, rc=0,<br>>     cib-update=735, confirmed=true)<br>> <br>>     Aug 29 11:22:07 zs93kl crmd[20001]: warning: Action 238<br>>     (zs95kjg110079_res_monitor_0) on zs93kjpcs1 failed (target: 7 vs.<br>>     rc: 0): Error<br>>     Aug 29 11:22:07 zs93kl crmd[20001]: notice: Transition aborted by<br>>     zs95kjg110079_res_monitor_0 'create' on zs93kjpcs1: Event failed<br>>     (magic=0:0;238:13:7:236d078a-9063-4092-9660-cfae048f3627,<br>>     cib=0.2437.3212, source=match_graph_event:381, 0)<br>> <br>>     Aug 29 11:22:15 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:22:15 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:22:15 zs93kl pengine[19997]: notice: Restart<br>>     zs95kjg110079_res#011(Started zs93kjpcs1)<br>> <br>>     Aug 29 11:22:23 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:22:23 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:22:23 zs93kl pengine[19997]: notice: Restart<br>>     zs95kjg110079_res#011(Started zs93kjpcs1)<br>> <br>> <br>>     Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:30:31 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:30:31 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>> <br>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:55:41 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>> <br>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 11:58:53 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>> <br>> <br>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:00:00 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>> <br>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:03:2 4 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:03:24 zs93kl pengine[19997]: notice: Restart<br>>     zs95kjg110079_res#011(Started zs93kjpcs1)<br>> <br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110079_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110108_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110186_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110188_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110198_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource<br>>     zs95kjg110210_res (ocf::VirtualDomain) is active on 2 nodes<br>>     attempting recovery<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: warning: See<br>>     _http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active_ for more<br>>     information.<br>>     Aug 29 12:36:27 zs93kl pengine[19997]: notice: Restart<br>>     zs95kjg110079_res#011(Started zs93kjpcs1)<br>> <br>> <br>>     Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition 84<br>>     (Complete=108, Pending=0, Fired=0, Skipped=0, Incomplete=77,<br>>     Source=/var/lib/pacemaker/pengine/pe-error-106.bz2): Terminated<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition failed:<br>>     terminated<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: Graph 84 with 185<br>>     actions: batch-limit=185 jobs, network-delay=0ms<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 410]: Pending<br>>     rsc op zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0,<br>>     waiting: 409)<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 409]: Pending<br>>     rsc op zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0,<br>>     waiting: 408)<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 408]: Pending<br>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 439<br>>     470 496 521 546)<br>>     Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 407]: Completed<br>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: none)<br>> <br>>     Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 428:<br>>     stop zs95kjg110079_res_stop_0 on zs93kjpcs1<br>>     Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 495:<br>>     stop zs95kjg110108_res_stop_0 on zs93kjpcs1<br>>     Aug 29 12:59:44 zs93kl crmd[20001]: notice: Initiating action 660:<br>>     stop zs95kjg110186_res_stop_0 on zs93kjpcs1<br>> <br>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 431]: Pending<br>>     rsc op zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0,<br>>     waiting: 430)<br>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 430]: Pending<br>>     rsc op zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0,<br>>     waiting: 429)<br>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 429]: Pending<br>>     pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 460<br>>     491 517 542 567)<br>>     Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 428]: Completed<br>>     rsc op zs95kjg110079_res_stop_0 on zs93kjpcs1 (priority: 0, waiting:<br>>     none)<br>> <br>> <br>>     *<br>> <br>>     System log on zs93kjpcs1*:<br>> <br>> <br>>     Aug 29 11:20:48 zs93kj crmd[21402]: notice: Recurring action<br>>     zs95kjg110079_res:817 (zs95kjg110079_res_monitor_30000) incomplete<br>>     at shutdown<br>> <br>>     Aug 29 11:22:07 zs93kj crmd[259639]: notice: Operation<br>>     zs95kjg110079_res_monitor_0: ok (node=zs93kjpcs1, call=1223, rc=0,<br>>     cib-update=104, confirmed=true)<br>> <br>>     Aug 29 12:59:42 zs93kj VirtualDomain(zs95kjg110079_res)[9148]: INFO:<br>>     Issuing graceful shutdown request for domain zs95kjg110079.*<br>> <br>>     Finally **zs95kjg110079**shuts down on ZS93KJ at 12:59*<br>> <br>> <br>>     ===================<br>> <br>>     Does this "active on two nodes" recovery process look right?<br>> <br>>     What is the recommended procedure to "undo" the resource failures<br>>     and dual host assignments? It took several hours (short of<br>>     stopping/starting the entire cluster)<br>>     to recover them... resource disable, cleanup, enable was the basis<br>>     ... but it seemed that I would fix one resource and two more would<br>>     fall out.<br>> <br>>     This seems to be one of the pitfalls of configuring resources in<br>>     symmetrical mode.<br>> <br>>     I would appreciate any best practice guidelines you have to offer. I<br>>     saved the system logs on all hosts in case anyone needs more<br>>     detailed information.<br>>     I also have pacemaker.log logs.<br>> <br>>     Thanks in advance!<br>> <br>> <br>> <br>>     Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.<br>>     INTERNET: _swgreenl@us.ibm.com_ <</tt><tt><a href="mailto:swgreenl@us.ibm.com">mailto:swgreenl@us.ibm.com</a></tt><tt>><br>>     PHONE: 8/293-7301 _(845-433-7301_ <tel:%28845-433-7301>) M/S: POK<br>>     42HA/P966<br><br>_______________________________________________<br>Users mailing list: Users@clusterlabs.org<br></tt><tt><a href="http://clusterlabs.org/mailman/listinfo/users">http://clusterlabs.org/mailman/listinfo/users</a></tt><tt><br><br>Project Home: </tt><tt><a href="http://www.clusterlabs.org">http://www.clusterlabs.org</a></tt><tt><br>Getting started: </tt><tt><a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></tt><tt><br>Bugs: </tt><tt><a href="http://bugs.clusterlabs.org">http://bugs.clusterlabs.org</a></tt><tt><br><br></tt><br><br><BR>

</body></html>