[ClusterLabs] "VirtualDomain is active on 2 nodes" due to transient network failure

Scott Greenlese swgreenl at us.ibm.com
Wed Aug 31 15:35:10 UTC 2016


Andreas Kurz,

I see that you replied to my post, but I can't seem to find any text
written from you??   Can you please
re-post your reply, or message me offline at swgreenl at us.ibm.com?   I'm new
to the process, so
maybe something I did wrong in my post.

Thanks for your guidance.

Scott G.


Scott Greenlese ... IBM z/BX Solutions Test,  Poughkeepsie, N.Y.
  INTERNET:  swgreenl at us.ibm.com
  PHONE:  8/293-7301 (845-433-7301)    M/S:  POK 42HA/P966




From:	Andreas Kurz <andreas.kurz at gmail.com>
To:	Cluster Labs - All topics related to open-source clustering
            welcomed <users at clusterlabs.org>
Date:	08/30/2016 05:06 PM
Subject:	Re: [ClusterLabs] "VirtualDomain is active on 2 nodes" due to
            transient network failure



Hi,

On Tue, Aug 30, 2016 at 10:03 PM, Scott Greenlese <swgreenl at us.ibm.com>
wrote:
  Added an appropriate subject line (was blank). Thanks...


  Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
  INTERNET: swgreenl at us.ibm.com
  PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966

  ----- Forwarded by Scott Greenlese/Poughkeepsie/IBM on 08/30/2016 03:59
  PM -----

  From: Scott Greenlese/Poughkeepsie/IBM at IBMUS
  To: Cluster Labs - All topics related to open-source clustering welcomed
  <users at clusterlabs.org>
  Date: 08/29/2016 06:36 PM
  Subject: [ClusterLabs] (no subject)



  Hi folks,

  I'm assigned to system test Pacemaker/Corosync on the KVM on System Z
  platform
  with pacemaker-1.1.13-10 and corosync-2.3.4-7 .

Would be good to see your full cluster configuration (corosync.conf and
cib) - but first guess is: no fencing at all .... and what is your
"no-quorum-policy" in Pacemaker?

Regards,
Andreas


  I have a cluster with 5 KVM hosts, and a total of 200
  ocf:pacemakerVirtualDomain resources defined to run
  across the 5 cluster nodes (symmertical is true for this cluster).

  The heartbeat network is communicating over vlan1293, which is hung off a
  network device, 0230 .

  In general, pacemaker does a good job of distributing my virtual guest
  resources evenly across the hypervisors
  in the cluster. These resource are a mixed bag:

  - "opaque" and remote "guest nodes" managed by the cluster.
  - allow-migrate=false and allow-migrate=true
  - qcow2 (file based) guests and LUN based guests
  - Sles and Ubuntu OS

  [root at zs95kj ]# pcs status |less
  Cluster name: test_cluster_2
  Last updated: Mon Aug 29 17:02:08 2016 Last change: Mon Aug 29 16:37:31
  2016 by root via crm_resource on zs93kjpcs1
  Stack: corosync
  Current DC: zs95kjpcs1 (version 1.1.13-10.el7_2.ibm.1-44eb2dd) -
  partition with quorum
  103 nodes and 300 resources configured

  Node zs90kppcs1: standby
  Online: [ zs93KLpcs1 zs93kjpcs1 zs95KLpcs1 zs95kjpcs1 ]

  This morning, our system admin team performed a
  "non-disruptive" (concurrent) microcode code load on the OSA, which
  (to our surprise) dropped the network connection for 13 seconds on the
  S93 CEC, from 11:18:34am to 11:18:47am , to be exact.
  This temporary outage caused the two cluster nodes on S93 (zs93kjpcs1 and
  zs93KLpcs1) to drop out of the cluster,
  as expected.

  However, pacemaker didn't handle this too well. The end result was
  numerous VirtualDomain resources in FAILED state:

  [root at zs95kj log]# date;pcs status |grep VirtualD |grep zs93 |grep FAILED
  Mon Aug 29 12:33:32 EDT 2016
  zs95kjg110104_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110092_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
  zs95kjg110099_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110102_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110106_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
  zs95kjg110112_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110115_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110118_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
  zs95kjg110124_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110127_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110130_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
  zs95kjg110136_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110139_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110142_res (ocf::heartbeat:VirtualDomain): FAILED zs93KLpcs1
  zs95kjg110148_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110152_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110155_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110161_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110164_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110167_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110173_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110176_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110179_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg110185_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1
  zs95kjg109106_res (ocf::heartbeat:VirtualDomain): FAILED zs93kjpcs1


  As well as, several VirtualDomain resources showing "Started" on two
  cluster nodes:

  zs95kjg110079_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
  zs93KLpcs1 ]
  zs95kjg110108_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
  zs93KLpcs1 ]
  zs95kjg110186_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
  zs93KLpcs1 ]
  zs95kjg110188_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
  zs93KLpcs1 ]
  zs95kjg110198_res (ocf::heartbeat:VirtualDomain): Started[ zs93kjpcs1
  zs93KLpcs1 ]


  The virtual machines themselves were in fact, "running" on both hosts.
  For example:

  [root at zs93kl ~]# virsh list |grep zs95kjg110079
  70 zs95kjg110079 running

  [root at zs93kj cli]# virsh list |grep zs95kjg110079
  18 zs95kjg110079 running


  On this particular VM, here was file corruption of this file-based qcow2
  guest's image, such that you could not ping or ssh,
  and if you open a virsh console, you get "initramfs" prompt.

  To recover, we had to mount the volume on another VM and then run fsck to
  recover it.

  I walked through the system log on the two S93 hosts to see how
  zs95kjg110079 ended up running
  on two cluster nodes. (some entries were omitted, I saved logs for future
  reference):


  zs93kjpcs1 system log - (shows membership changes after the network
  failure at 11:18:34)

  Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: The qeth device driver
  failed to recover an error on the device
  Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000000: 00 c2 40 17 01 51 90
  38 00 04 00 00 00 00 00 00 .. at ..Q.8........
  Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 ................
  Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 ................
  Aug 29 11:18:33 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00
  00 00 00 00 34 00 1f 00 07 ...........4....
  Aug 29 11:18:33 zs93kl kernel: qeth 0.0.0230: A recovery process has been
  started for the device
  Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] The token was lost in
  the OPERATIONAL state.
  Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] A processor failed,
  forming new configuration.
  Aug 29 11:18:33 zs93kl corosync[19281]: [TOTEM ] entering GATHER state
  from 2(The token was lost in the OPERATIONAL state.).
  Aug 29 11:18:34 zs93kl kernel: qeth 0.0.0230: The qeth device driver
  failed to recover an error on the device
  Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000000: 00 00 11 01 00 00 00
  00 00 04 00 00 00 00 00 00 ................
  Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000010: 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 ................
  Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000020: 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 ................
  Aug 29 11:18:34 zs93kl kernel: qeth: irb 00000030: 00 00 00 00 00 00 00
  00 00 00 00 00 00 00 00 00 ................


  Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
  zs95kjpcs1[2] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95kjpcs1
  attributes for attrd_peer_change_cb
  Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
  zs95kjpcs1[2] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95kjpcs1/2 from the
  membership list
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=2
  and/or uname=zs95kjpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95kjpcs1/2 from
  the membership list
  Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
  zs95KLpcs1[3] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=2
  and/or uname=zs95kjpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs95KLpcs1/3 from the
  membership list
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
  zs95KLpcs1[3] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=3
  and/or uname=zs95KLpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj cib[21397]: notice: crm_update_peer_proc: Node
  zs93KLpcs1[5] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Removing zs93KLpcs1/5 from the
  membership list
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering GATHER state
  from 0(consensus timeout).
  Aug 29 11:18:37 zs93kj cib[21397]: notice: Purged 1 peers with id=5
  and/or uname=zs93KLpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Creating commit token
  because I am the rep.
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Saving state aru 32 high
  seq received 32
  Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Storing new sequence id
  for ring 300
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering COMMIT state.
  Aug 29 11:18:37 zs93kj crmd[21402]: notice: Membership 768: quorum lost
  (1)
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs95KLpcs1
  attributes for attrd_peer_change_cb
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing zs95KLpcs1/3 from
  the membership list
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering RECOVERY state.
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] TRANS [0] member
  10.20.93.11:
  Aug 29 11:18:37 zs93kj pacemakerd[21143]: notice: Membership 768: quorum
  lost (1)
  Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
  Node zs95kjpcs1[2] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node
  zs95KLpcs1[3] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action
  on 3
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Purged 1 peers with id=3
  and/or uname=zs95KLpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Removing zs95kjpcs1/2
  from the membership list
  Aug 29 11:18:37 zs93kj crmd[21402]: notice: Stonith/shutdown of
  zs95KLpcs1 not matched
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] position [0] member
  10.20.93.11:
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
  zs93KLpcs1[5] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: Purged 1 peers with
  id=2 and/or uname=zs95kjpcs1 from the membership cache
  Aug 29 11:18:37 zs93kj crmd[21402]: notice: crm_reap_unseen_nodes: Node
  zs95kjpcs1[2] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] previous ring seq 2fc
  rep 10.20.93.11
  Aug 29 11:18:37 zs93kj attrd[21400]: notice: Removing all zs93KLpcs1
  attributes for attrd_peer_change_cb
  Aug 29 11:18:37 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
  Node zs95KLpcs1[3] - state is now lost (was member)
  Aug 29 11:18:37 zs93kj crmd[21402]: warning: No match for shutdown action
  on 2
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] aru 32 high delivered 32
  received flag 1
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Did not need to
  originate any messages in recovery.
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] got commit token
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Sending initial ORF
  token
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
  my set retrans flag0 retrans queue empty 1 count 0, aru 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
  seq received 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
  my set retrans flag0 retrans queue empty 1 count 1, aru 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
  seq received 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
  my set retrans flag0 retrans queue empty 1 count 2, aru 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
  seq received 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] token retrans flag is 0
  my set retrans flag0 retrans queue empty 1 count 3, aru 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] install seq 0 aru 0 high
  seq received 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] retrans flag count 4
  token aru 0 install seq 0 aru 0 0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Resetting old ring state
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] recovery to regular 1-0
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
  10.20.93.12 inactive
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
  10.20.93.13 inactive
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
  10.20.93.14 inactive
  Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) ip
  (10.20.93.12)
  Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) ip
  (10.20.93.13)
  Aug 29 11:18:37 zs93kj corosync[20562]: [MAIN ] Member left: r(0) ip
  (10.20.93.14)
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] waiting_trans_ack
  changed to 1
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL
  state.
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] A new membership (
  10.20.93.11:768) was formed. Members left: 2 5 3
  Aug 29 11:18:37 zs93kj corosync[20562]: [TOTEM ] Failed to receive the
  leave message. failed: 2 5 3
  Aug 29 11:18:37 zs93kj corosync[20562]: [SYNC ] Committing
  synchronization for corosync configuration map access
  Aug 29 11:18:37 zs93kj corosync[20562]: [CMAP ] Not first sync -> no
  action
  Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) ip
  (10.20.93.11) ; members(old:4 left:3)
  Aug 29 11:18:37 zs93kj corosync[20562]: [CPG ] chosen downlist: sender r
  (0) ip(10.20.93.11) ; members(old:4 left:3)


  Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
  10.20.93.12 active
  Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] Marking UDPU member
  10.20.93.14 active
  Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0) ip
  (10.20.93.12)
  Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Member joined: r(0) ip
  (10.20.93.14)
  Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] entering OPERATIONAL
  state.
  Aug 29 11:18:43 zs93kj corosync[20562]: [TOTEM ] A new membership (
  10.20.93.11:772) was formed. Members joined: 2 3
  Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
  synchronization for corosync configuration map access
  Aug 29 11:18:43 zs93kj corosync[20562]: [CMAP ] Not first sync -> no
  action
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
  node 0x1
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
  node 0x2
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) ip
  (10.20.93.14) ; members(old:2 left:0)
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) ip
  (10.20.93.12) ; members(old:2 left:0)
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] comparing: sender r(0) ip
  (10.20.93.11) ; members(old:1 left:0)
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] chosen downlist: sender r
  (0) ip(10.20.93.12) ; members(old:2 left:0)
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] got joinlist message from
  node 0x3
  Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
  synchronization for corosync cluster closed process group service v1.01
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[0]
  group:crmd\x00, ip:r(0) ip(10.20.93.14) , pid:21491
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[1]
  group:attrd\x00, ip:r(0) ip(10.20.93.14) , pid:21489
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[2]
  group:stonith-ng\x00, ip:r(0) ip(10.20.93.14) , pid:21487
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[3]
  group:cib\x00, ip:r(0) ip(10.20.93.14) , pid:21486
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[4]
  group:pacemakerd\x00, ip:r(0) ip(10.20.93.14) , pid:21485
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[5]
  group:crmd\x00, ip:r(0) ip(10.20.93.12) , pid:24499
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[6]
  group:attrd\x00, ip:r(0) ip(10.20.93.12) , pid:24497
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[7]
  group:stonith-ng\x00, ip:r(0) ip(10.20.93.12) , pid:24495
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[8]
  group:cib\x00, ip:r(0) ip(10.20.93.12) , pid:24494
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[9]
  group:pacemakerd\x00, ip:r(0) ip(10.20.93.12) , pid:24491
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[10]
  group:crmd\x00, ip:r(0) ip(10.20.93.11) , pid:21402
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[11]
  group:attrd\x00, ip:r(0) ip(10.20.93.11) , pid:21400
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[12]
  group:stonith-ng\x00, ip:r(0) ip(10.20.93.11) , pid:21398
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[13]
  group:cib\x00, ip:r(0) ip(10.20.93.11) , pid:21397
  Aug 29 11:18:43 zs93kj corosync[20562]: [CPG ] joinlist_messages[14]
  group:pacemakerd\x00, ip:r(0) ip(10.20.93.11) , pid:21143
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] flags: quorate: No
  Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No
  QdeviceCastVote: No QdeviceMasterWins: No
  Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] IPC credentials
  authenticated (20562-21400-28)
  Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] connecting to client
  [21400]
  Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
  real_size:1052672; rb->word_size:263168
  Aug 29 11:18:43 zs93kj corosync[20562]: [QB ] shm size:1048589;
  real_size:1052672; rb->word_size:263168
  Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice: Membership 772: quorum
  acquired (3)

  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] quorum regained,
  resuming activity
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] got nodeinfo message
  from cluster node 3
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] nodeinfo message[0]:
  votes: 0, expected: 0 flags: 0
  Aug 29 11:18:43 zs93kj corosync[20562]: [SYNC ] Committing
  synchronization for corosync vote quorum service v1.0
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] total_votes=3,
  expected_votes=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 1 state=1, votes=1,
  expected=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 2 state=1, votes=1,
  expected=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 3 state=1, votes=1,
  expected=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 4 state=2, votes=1,
  expected=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] node 5 state=2, votes=1,
  expected=5
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] lowest node id: 1 us: 1
  Aug 29 11:18:43 zs93kj corosync[20562]: [VOTEQ ] highest node id: 3 us: 1
  Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] This node is within the
  primary component and will provide service.
  Aug 29 11:18:43 zs93kj pacemakerd[21143]: notice:
  pcmk_quorum_notification: Node zs95KLpcs1[3] - state is now member (was
  lost)
  Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
  zs95KLpcs1[3] - state is now member (was (null))
  Aug 29 11:18:43 zs93kj corosync[20562]: [QUORUM] Members[3]: 1 2 3
  Aug 29 11:18:43 zs93kj stonith-ng[21398]: warning: Node names with
  capitals are discouraged, consider changing 'zs95KLpcs1' to something
  else
  Aug 29 11:18:43 zs93kj corosync[20562]: [MAIN ] Completed service
  synchronization, ready to provide service.
  Aug 29 11:18:43 zs93kj stonith-ng[21398]: notice: crm_update_peer_proc:
  Node zs95KLpcs1[3] - state is now member (was (null))
  Aug 29 11:18:43 zs93kj attrd[21400]: notice: crm_update_peer_proc: Node
  zs95kjpcs1[2] - state is now member (was (null))




  The story of zs95kjg110079 starts on ZS93KL when it seemed to be already
  running on ZS93KJ -

  System log on zs93KLpcs1:

  Aug 29 11:20:58 zs93kl pengine[19997]: notice: Start
  zs95kjg110079_res#011(zs93KLpcs1)

  Aug 29 11:21:56 zs93kl crmd[20001]: notice: Initiating action 520: start
  zs95kjg110079_res_start_0 on zs93KLpcs1 (local)

  Aug 29 11:21:56 zs93kl systemd-machined: New machine
  qemu-70-zs95kjg110079.
  Aug 29 11:21:56 zs93kl systemd: Started Virtual Machine
  qemu-70-zs95kjg110079.
  Aug 29 11:21:56 zs93kl systemd: Starting Virtual Machine
  qemu-70-zs95kjg110079.

  Aug 29 11:21:59 zs93kl crmd[20001]: notice: Operation
  zs95kjg110079_res_start_0: ok (node=zs93KLpcs1, call=1036, rc=0,
  cib-update=735, confirmed=true)

  Aug 29 11:22:07 zs93kl crmd[20001]: warning: Action 238
  (zs95kjg110079_res_monitor_0) on zs93kjpcs1 failed (target: 7 vs. rc: 0):
  Error
  Aug 29 11:22:07 zs93kl crmd[20001]: notice: Transition aborted by
  zs95kjg110079_res_monitor_0 'create' on zs93kjpcs1: Event failed
  (magic=0:0;238:13:7:236d078a-9063-4092-9660-cfae048f3627,
  cib=0.2437.3212, source=match_graph_event:381, 0)

  Aug 29 11:22:15 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:22:15 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:22:15 zs93kl pengine[19997]: notice: Restart
  zs95kjg110079_res#011(Started zs93kjpcs1)

  Aug 29 11:22:23 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:22:23 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:22:23 zs93kl pengine[19997]: notice: Restart
  zs95kjg110079_res#011(Started zs93kjpcs1)


  Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:30:31 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:30:31 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.

  Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:55:41 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:55:41 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.

  Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 11:58:53 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 11:58:53 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.


  Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:00:00 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:00:00 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.

  Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:03:24 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:03:24 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:03:24 zs93kl pengine[19997]: notice: Restart
  zs95kjg110079_res#011(Started zs93kjpcs1)

  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110079_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110108_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110186_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110188_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110198_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: error: Resource zs95kjg110210_res
  (ocf::VirtualDomain) is active on 2 nodes attempting recovery
  Aug 29 12:36:27 zs93kl pengine[19997]: warning: See
  http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more
  information.
  Aug 29 12:36:27 zs93kl pengine[19997]: notice: Restart
  zs95kjg110079_res#011(Started zs93kjpcs1)


  Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition 84 (Complete=108,
  Pending=0, Fired=0, Skipped=0, Incomplete=77,
  Source=/var/lib/pacemaker/pengine/pe-error-106.bz2): Terminated
  Aug 29 12:44:41 zs93kl crmd[20001]: warning: Transition failed:
  terminated
  Aug 29 12:44:41 zs93kl crmd[20001]: notice: Graph 84 with 185 actions:
  batch-limit=185 jobs, network-delay=0ms
  Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 410]: Pending rsc op
  zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 409)
  Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 409]: Pending rsc op
  zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 408)
  Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 408]: Pending pseudo
  op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 439 470 496 521
  546)
  Aug 29 12:44:41 zs93kl crmd[20001]: notice: [Action 407]: Completed
  pseudo op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: none)

  Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 428: stop
  zs95kjg110079_res_stop_0 on zs93kjpcs1
  Aug 29 12:59:42 zs93kl crmd[20001]: notice: Initiating action 495: stop
  zs95kjg110108_res_stop_0 on zs93kjpcs1
  Aug 29 12:59:44 zs93kl crmd[20001]: notice: Initiating action 660: stop
  zs95kjg110186_res_stop_0 on zs93kjpcs1

  Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 431]: Pending rsc op
  zs95kjg110079_res_monitor_30000 on zs93kjpcs1 (priority: 0, waiting: 430)
  Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 430]: Pending rsc op
  zs95kjg110079_res_start_0 on zs93kjpcs1 (priority: 0, waiting: 429)
  Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 429]: Pending pseudo
  op zs95kjg110079_res_stop_0 on N/A (priority: 0, waiting: 460 491 517 542
  567)
  Aug 29 13:00:04 zs93kl crmd[20001]: notice: [Action 428]: Completed rsc
  op zs95kjg110079_res_stop_0 on zs93kjpcs1 (priority: 0, waiting: none)




  System log on zs93kjpcs1:


  Aug 29 11:20:48 zs93kj crmd[21402]: notice: Recurring action
  zs95kjg110079_res:817 (zs95kjg110079_res_monitor_30000) incomplete at
  shutdown

  Aug 29 11:22:07 zs93kj crmd[259639]: notice: Operation
  zs95kjg110079_res_monitor_0: ok (node=zs93kjpcs1, call=1223, rc=0,
  cib-update=104, confirmed=true)

  Aug 29 12:59:42 zs93kj VirtualDomain(zs95kjg110079_res)[9148]: INFO:
  Issuing graceful shutdown request for domain zs95kjg110079.

  Finally zs95kjg110079 shuts down on ZS93KJ at 12:59


  ===================

  Does this "active on two nodes" recovery process look right?

  What is the recommended procedure to "undo" the resource failures and
  dual host assignments? It took several hours (short of stopping/starting
  the entire cluster)
  to recover them... resource disable, cleanup, enable was the basis ...
  but it seemed that I would fix one resource and two more would fall out.

  This seems to be one of the pitfalls of configuring resources in
  symmetrical mode.

  I would appreciate any best practice guidelines you have to offer. I
  saved the system logs on all hosts in case anyone needs more detailed
  information.
  I also have pacemaker.log logs.

  Thanks in advance!



  Scott Greenlese ... IBM z/BX Solutions Test, Poughkeepsie, N.Y.
  INTERNET: swgreenl at us.ibm.com
  PHONE: 8/293-7301 (845-433-7301) M/S: POK 42HA/P966
  _______________________________________________
  Users mailing list: Users at clusterlabs.org
  http://clusterlabs.org/mailman/listinfo/users

  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


  _______________________________________________
  Users mailing list: Users at clusterlabs.org
  http://clusterlabs.org/mailman/listinfo/users

  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160831/9384b850/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160831/9384b850/attachment-0004.gif>


More information about the Users mailing list