[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Ken Gaillot kgaillot at redhat.com
Mon Mar 27 18:10:36 EDT 2017


On 03/27/2017 03:54 PM, Seth Reid wrote:
> 
> 
> 
> On Fri, Mar 24, 2017 at 2:10 PM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> 
>     On 03/24/2017 03:52 PM, Digimer wrote:
>     > On 24/03/17 04:44 PM, Seth Reid wrote:
>     >> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
>     >> production yet because I'm having a problem during fencing. When I
>     >> disable the network interface of any one machine, the disabled machines
>     >> is properly fenced leaving me, briefly, with a two node cluster. A
>     >> second node is then fenced off immediately, and the remaining node
>     >> appears to try to fence itself off. This leave two nodes with
>     >> corosync/pacemaker stopped, and the remaining machine still in the
>     >> cluster but showing an offline node and an UNCLEAN node. What can be
>     >> causing this behavior?
>     >
>     > It looks like the fence attempt failed, leaving the cluster hung. When
>     > you say all nodes were fenced, did all nodes actually reboot? Or did the
>     > two surviving nodes just lock up? If the later, then that is the proper
>     > response to a failed fence (DLM stays blocked).
> 
>     See comments inline ...
> 
>     >
>     >> Each machine has a dedicated network interface for the cluster, and
>     >> there is a vlan on the switch devoted to just these interfaces.
>     >> In the following, I disabled the interface on node id 2 (b014).
>     Node 1
>     >> (b013) is fenced as well. Node 2 (b015) is still up.
>     >>
>     >> Logs from b013:
>     >> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
>     >> /dev/null && debian-sa1 1 1)
>     >> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor
>     >> failed, forming new configuration.
>     >> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed,
>     >> forming new configuration.
>     >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new
>     membership
>     >> (192.168.100.13:576 <http://192.168.100.13:576>
>     <http://192.168.100.13:576>) was formed. Members left: 2
>     >> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to
>     receive
>     >> the leave message. failed: 2
>     >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership
>     >> (192.168.100.13:576 <http://192.168.100.13:576>
>     <http://192.168.100.13:576>) was formed. Members left: 2
>     >> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the
>     >> leave message. failed: 2
>     >> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc:
>     Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
>     >> membership list
>     >> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2
>     >> and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice:
>     crm_reap_unseen_nodes:
>     >> Node b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2
>     from the
>     >> membership list
>     >> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2
>     >> and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice:
>     crm_update_peer_proc:
>     >> Node b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing
>     b014-cl/2 from
>     >> the membership list
>     >> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with
>     >> id=2 and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid
>     19223
>     >> nodedown time 1490387717 fence_all dlm_stonith
>     >> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing
>     connection to
>     >> node 2
>     >> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes:
>     Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0
>     entries for
>     >> 2/(null): 0 in progress, 0 completed
>     >> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
>     >> b014-cl by b015-cl for stonith-api.19223 at b013-cl.7aeb2ffb: OK
>     >> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node
>     2/(null)
>     >> kicked: reboot
> 
>     It looks like the fencing of b014-cl is reported as successful above ...
> 
>     >> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to
>     >> node 3
>     >> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to
>     >> node 1
>     >> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
>     >> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
>     >> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
>     >> lockspaces
>     >> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
>     >> code=exited, status=255/n/a
> 
>     ... but then DLM and corosync exit on this node. Pacemaker can only
>     exit, and the node gets fenced.
> 
>     What does your fencing configuration look like?
> 
> 
> This is the command I used. b013-cl, for example is a hosts file entry
> so that the cluster only uses the cluster-only network interface. 
> 
> pcs stonith create fence_wh fence_scsi
> debug="/var/log/cluster/fence-debug.log" vgs_path="/sbin/vgs"
> sg_persist_path="/usr/bin/sg_persist" sg_turs_path="/usr/bin/sg_turs"
> pcmk_reboot_action="off" pcmk_host_list="b013-cl b014-cl b015-cl"
> pcmk_monitor_action="metadata" meta provides="unfencing" --force
> 
> I got the pcmk_monitor_action, pcmk_hosts_list, pcmk_reboot_action, and
> --force from various redhat articles. I've tried getting fencing to
> start without these, and it doesn't work.

It looks good to me. Not sure what's going wrong.

The big question is why are DLM and corosync exiting after another node
is fenced. Pacemaker is reacting properly once that happens.

>     >> Mar 24 16:35:18 b013 cib[2220]:    error: Connection to the CPG API
>     >> failed: Library error (2)
>     >> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered
>     failed
>     >> state.
>     >> Mar 24 16:35:18 b013 attrd[2223]:    error: Connection to cib_rw
>     failed
>     >> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
>     >> 'exit-code'.
>     >> Mar 24 16:35:18 b013 attrd[2223]:    error: Connection to
>     >> cib_rw[0x560754147990] closed (I/O condition=17)
>     >> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process
>     exited,
>     >> code=exited, status=107/n/a
>     >> Mar 24 16:35:18 b013 pacemakerd[2187]:    error: Connection to
>     the CPG
>     >> API failed: Library error (2)
>     >> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered
>     failed
>     >> state.
>     >> Mar 24 16:35:18 b013 attrd[2223]:   notice: Disconnecting client
>     >> 0x560754149000, pid=2227...
>     >> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Failed with
>     result
>     >> 'exit-code'.
>     >> Mar 24 16:35:18 b013 lrmd[2222]:  warning: new_event_notification
>     >> (2222-2227-8): Bad file descriptor (9)
>     >> Mar 24 16:35:18 b013 stonith-ng[2221]:    error: Connection to
>     cib_rw failed
>     >> Mar 24 16:35:18 b013 stonith-ng[2221]:    error: Connection to
>     >> cib_rw[0x5579c03ecdd0] closed (I/O condition=17)
>     >> Mar 24 16:35:18 b013 lrmd[2222]:    error: Connection to
>     stonith-ng failed
>     >> Mar 24 16:35:18 b013 lrmd[2222]:    error: Connection to
>     >> stonith-ng[0x55888c8ef820] closed (I/O condition=17)
>     >> Mar 24 16:37:02 b013 kernel: [ 3196.469475] dlm: node 0: socket error
>     >> sending to node 2, port 21064, sk_err=113/113
>     >> Mar 24 16:37:02 b013 kernel: [ 3196.470675] dlm: node 0: socket error
>     >> sending to node 2, port 21064, sk_err=113/113
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.833544] INFO: task
>     gfs2_quotad:3054
>     >> blocked for more than 120 seconds.
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.834565]       Not tainted
>     >> 4.4.0-66-generic #87-Ubuntu
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.835413] "echo 0 >
>     >> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836656] gfs2_quotad     D
>     >> ffff880fd747fa38     0  3054      2 0x00000000
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836663]  ffff880fd747fa38
>     >> 00000001d8144018 ffff880fd975f2c0 ffff880fd7a972c0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836666]  ffff880fd7480000
>     >> ffff887fd81447b8 ffff887fd81447d0 ffff881fd7af00b0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836669]  0000000000000004
>     >> ffff880fd747fa50 ffffffff818384d5 ffff880fd7a972c0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836672] Call Trace:
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836688]  [<ffffffff818384d5>]
>     >> schedule+0x35/0x80
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836695]  [<ffffffff8183b380>]
>     >> rwsem_down_read_failed+0xe0/0x140
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836701]  [<ffffffff81406574>]
>     >> call_rwsem_down_read_failed+0x14/0x30
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836704]  [<ffffffff8183a920>] ?
>     >> down_read+0x20/0x30
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836726]  [<ffffffffc0583324>]
>     >> dlm_lock+0x84/0x1f0 [dlm]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836731]  [<ffffffff810b57e3>] ?
>     >> check_preempt_wakeup+0x193/0x220
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836755]  [<ffffffffc06a5da0>] ?
>     >> gdlm_recovery_result+0x130/0x130 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836764]  [<ffffffffc06a5050>] ?
>     >> gdlm_cancel+0x30/0x30 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836769]  [<ffffffff810ab579>] ?
>     >> ttwu_do_wakeup+0x19/0xe0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836779]  [<ffffffffc06a5499>]
>     >> gdlm_lock+0x1d9/0x300 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836788]  [<ffffffffc06a5050>] ?
>     >> gdlm_cancel+0x30/0x30 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836798]  [<ffffffffc06a5da0>] ?
>     >> gdlm_recovery_result+0x130/0x130 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836807]  [<ffffffffc0686e5f>]
>     >> do_xmote+0x16f/0x290 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836816]  [<ffffffffc068705c>]
>     >> run_queue+0xdc/0x2d0 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836824]  [<ffffffffc06875ef>]
>     >> gfs2_glock_nq+0x20f/0x410 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836834]  [<ffffffffc06a2006>]
>     >> gfs2_statfs_sync+0x76/0x1c0 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836841]  [<ffffffff810ed018>] ?
>     >> del_timer_sync+0x48/0x50
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836851]  [<ffffffffc06a1ffc>] ?
>     >> gfs2_statfs_sync+0x6c/0x1c0 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836861]  [<ffffffffc0697fe3>]
>     >> quotad_check_timeo.part.18+0x23/0x80 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836871]  [<ffffffffc069ad01>]
>     >> gfs2_quotad+0x241/0x2d0 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836876]  [<ffffffff810c41e0>] ?
>     >> wake_atomic_t_function+0x60/0x60
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836886]  [<ffffffffc069aac0>] ?
>     >> gfs2_wake_up_statfs+0x40/0x40 [gfs2]
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836890]  [<ffffffff810a0ba8>]
>     >> kthread+0xd8/0xf0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836893]  [<ffffffff810a0ad0>] ?
>     >> kthread_create_on_node+0x1e0/0x1e0
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836897]  [<ffffffff8183c98f>]
>     >> ret_from_fork+0x3f/0x70
>     >> Mar 24 16:37:46 b013 kernel: [ 3240.836900]  [<ffffffff810a0ad0>] ?
>     >> kthread_create_on_node+0x1e0/0x1e0
>     >>
>     >> Logs from b015:
>     >> Mar 24 16:35:01 b015 CRON[19781]: (root) CMD (command -v debian-sa1 >
>     >> /dev/null && debian-sa1 1 1)
>     >> Mar 24 16:35:13 b015 corosync[2105]: notice  [TOTEM ] A processor
>     >> failed, forming new configuration.
>     >> Mar 24 16:35:13 b015 corosync[2105]:  [TOTEM ] A processor failed,
>     >> forming new configuration.
>     >> Mar 24 16:35:17 b015 corosync[2105]: notice  [TOTEM ] A new
>     membership
>     >> (192.168.100.13:576 <http://192.168.100.13:576>
>     <http://192.168.100.13:576>) was formed. Members left: 2
>     >> Mar 24 16:35:17 b015 corosync[2105]: notice  [TOTEM ] Failed to
>     receive
>     >> the leave message. failed: 2
>     >> Mar 24 16:35:17 b015 corosync[2105]:  [TOTEM ] A new membership
>     >> (192.168.100.13:576 <http://192.168.100.13:576>
>     <http://192.168.100.13:576>) was formed. Members left: 2
>     >> Mar 24 16:35:17 b015 corosync[2105]:  [TOTEM ] Failed to receive the
>     >> leave message. failed: 2
>     >> Mar 24 16:35:17 b015 attrd[2253]:   notice: crm_update_peer_proc:
>     Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b015 attrd[2253]:   notice: Removing b014-cl/2
>     from the
>     >> membership list
>     >> Mar 24 16:35:17 b015 attrd[2253]:   notice: Purged 1 peers with id=2
>     >> and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b015 stonith-ng[2251]:   notice:
>     crm_update_peer_proc:
>     >> Node b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b015 stonith-ng[2251]:   notice: Removing
>     b014-cl/2 from
>     >> the membership list
>     >> Mar 24 16:35:17 b015 cib[2249]:   notice: crm_update_peer_proc: Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b015 crmd[2255]:   notice: State transition S_IDLE ->
>     >> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>     >> origin=abort_transition_graph ]
>     >> Mar 24 16:35:17 b015 kernel: [ 3478.622093] dlm: closing
>     connection to
>     >> node 2
>     >> Mar 24 16:35:17 b015 stonith-ng[2251]:   notice: Purged 1 peers with
>     >> id=2 and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b015 cib[2249]:   notice: Removing b014-cl/2 from the
>     >> membership list
>     >> Mar 24 16:35:17 b015 cib[2249]:   notice: Purged 1 peers with id=2
>     >> and/or uname=b014-cl from the membership cache
>     >> Mar 24 16:35:17 b015 crmd[2255]:   notice: crm_reap_unseen_nodes:
>     Node
>     >> b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:17 b015 pacemakerd[2159]:   notice:
>     crm_reap_unseen_nodes:
>     >> Node b014-cl[2] - state is now lost (was member)
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 stonith-ng[2251]:  warning: fence_scsi[19818]
>     >> stderr: [ WARNING:root:Parse error: Ignoring unknown option
>     'port=b014-cl' ]
>     >> Mar 24 16:35:18 b015 stonith-ng[2251]:  warning: fence_scsi[19818]
>     >> stderr: [  ]
>     >> Mar 24 16:35:18 b015 stonith-ng[2251]:   notice: Operation 'reboot'
>     >> [19818] (call 2 from stonith-api.19223) for host 'b014-cl' with
>     device
>     >> 'fence_wh' returned: 0 (OK)
>     >> Mar 24 16:35:18 b015 stonith-ng[2251]:   notice: Operation reboot of
>     >> b014-cl by b015-cl for stonith-api.19223 at b013-cl.7aeb2ffb: OK
>     >> Mar 24 16:35:18 b015 dlm_controld[2656]: 3479 fence request 2 pid
>     19880
>     >> nodedown time 1490387717 fence_all dlm_stonith
>     >> Mar 24 16:35:18 b015 dlm_controld[2656]: 3479 tell corosync to remove
>     >> nodeid 1 from cluster
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-scsi\x2d36782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 systemd[1]:
>     >>
>     dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device: Dev
>     >> dev-disk-by\x2did-wwn\x2d0x6782bcb0007085a70000081958aee1ff.device
>     >> appeared twice with different sysfs paths
>     >>
>     /sys/devices/pci0000:00/0000:00:03.0/0000:08:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdc
>     >> and /sys/devices/virtual/block/dm-0
>     >> Mar 24 16:35:18 b015 dlm_controld[2656]: 3479 tell corosync to remove
>     >> nodeid 1 from cluster
>     >> Mar 24 16:35:18 b015 dlm_stonith: stonith_api_time: Found 2
>     entries for
>     >> 2/(null): 0 in progress, 2 completed
>     >> Mar 24 16:35:18 b015 dlm_stonith: stonith_api_time: Node 2/(null)
>     last
>     >> kicked at: 1490387718
>     >> Mar 24 16:35:18 b015 kernel: [ 3479.266118] dlm: closing
>     connection to
>     >> node 1
>     >> Mar 24 16:35:18 b015 kernel: [ 3479.266270] dlm: closing
>     connection to
>     >> node 3
>     >> Mar 24 16:35:18 b015 dlm_controld[2656]: 3479 abandoned lockspace
>     share_data
>     >> Mar 24 16:35:18 b015 dlm_controld[2656]: 3479 abandoned lockspace
>     clvmd
>     >> Mar 24 16:35:18 b015 kernel: [ 3479.268325] dlm: dlm user daemon
>     left 2
>     >> lockspaces
>     >> Mar 24 16:35:21 b015 corosync[2105]: notice  [TOTEM ] A processor
>     >> failed, forming new configuration.
>     >> Mar 24 16:35:21 b015 corosync[2105]:  [TOTEM ] A processor failed,
>     >> forming new configuration.
>     >> Mar 24 16:35:26 b015 corosync[2105]: notice  [TOTEM ] A new
>     membership
>     >> (192.168.100.15:580 <http://192.168.100.15:580>
>     <http://192.168.100.15:580>) was formed. Members left: 1
>     >> Mar 24 16:35:26 b015 corosync[2105]: notice  [TOTEM ] Failed to
>     receive
>     >> the leave message. failed: 1
>     >> Mar 24 16:35:26 b015 corosync[2105]:  [TOTEM ] A new membership
>     >> (192.168.100.15:580 <http://192.168.100.15:580>
>     <http://192.168.100.15:580>) was formed. Members left: 1
>     >> Mar 24 16:35:26 b015 corosync[2105]:  [TOTEM ] Failed to receive the
>     >> leave message. failed: 1
>     >> Mar 24 16:35:26 b015 attrd[2253]:   notice: crm_update_peer_proc:
>     Node
>     >> b013-cl[1] - state is now lost (was member)
>     >> Mar 24 16:35:26 b015 attrd[2253]:   notice: Removing b013-cl/1
>     from the
>     >> membership list
>     >> Mar 24 16:35:26 b015 stonith-ng[2251]:   notice:
>     crm_update_peer_proc:
>     >> Node b013-cl[1] - state is now lost (was member)
>     >> Mar 24 16:35:26 b015 attrd[2253]:   notice: Purged 1 peers with id=1
>     >> and/or uname=b013-cl from the membership cache
>     >> Mar 24 16:35:26 b015 stonith-ng[2251]:   notice: Removing
>     b013-cl/1 from
>     >> the membership list
>     >> Mar 24 16:35:26 b015 pacemakerd[2159]:   notice: Membership 580:
>     quorum
>     >> lost (1)
>     >> Mar 24 16:35:26 b015 cib[2249]:   notice: crm_update_peer_proc: Node
>     >> b013-cl[1] - state is now lost (was member)
>     >> Mar 24 16:35:26 b015 stonith-ng[2251]:   notice: Purged 1 peers with
>     >> id=1 and/or uname=b013-cl from the membership cache
>     >> Mar 24 16:35:26 b015 pacemakerd[2159]:   notice:
>     crm_reap_unseen_nodes:
>     >> Node b013-cl[1] - state is now lost (was member)
>     >> Mar 24 16:35:26 b015 cib[2249]:   notice: Removing b013-cl/1 from the
>     >> membership list
>     >> Mar 24 16:35:26 b015 cib[2249]:   notice: Purged 1 peers with id=1
>     >> and/or uname=b013-cl from the membership cache
>     >> Mar 24 16:35:26 b015 crmd[2255]:   notice: Membership 580: quorum
>     lost (1)
>     >> Mar 24 16:35:26 b015 crmd[2255]:   notice: crm_reap_unseen_nodes:
>     Node
>     >> b013-cl[1] - state is now lost (was member)
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: We do not have quorum -
>     >> fencing and resource management disabled
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Node b013-cl is unclean
>     >> because the node is no longer part of the cluster
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Node b013-cl is unclean
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action dlm:1_stop_0 on
>     >> b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action dlm:1_stop_0 on
>     >> b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action
>     clvmd:1_stop_0 on
>     >> b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action
>     clvmd:1_stop_0 on
>     >> b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action
>     gfs2share:1_stop_0
>     >> on b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Action
>     gfs2share:1_stop_0
>     >> on b013-cl is unrunnable (offline)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Node b013-cl is
>     unclean!
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: Cannot fence
>     unclean nodes
>     >> until quorum is attained (or no-quorum-policy is set to ignore)
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: Start
>     >> fence_wh#011(b015-cl - blocked)
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: Stop   
>     dlm:1#011(b013-cl
>     >> - blocked)
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: Stop
>     >>  clvmd:1#011(b013-cl - blocked)
>     >> Mar 24 16:35:26 b015 pengine[2254]:   notice: Stop
>     >>  gfs2share:1#011(b013-cl - blocked)
>     >> Mar 24 16:35:26 b015 pengine[2254]:  warning: Calculated
>     Transition 9:
>     >> /var/lib/pacemaker/pengine/pe-warn-2669.bz2
>     >> Mar 24 16:35:26 b015 crmd[2255]:   notice: Transition 9 (Complete=6,
>     >> Pending=0, Fired=0, Skipped=0, Incomplete=0,
>     >> Source=/var/lib/pacemaker/pengine/pe-warn-2669.bz2): Complete
>     >> Mar 24 16:35:26 b015 crmd[2255]:   notice: State transition
>     >> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>     cause=C_FSA_INTERNAL
>     >> origin=notify_crmd ]
>     >> Mar 24 16:35:31 b015 controld(dlm)[20000]: ERROR: Uncontrolled
>     lockspace
>     >> exists, system must reboot. Executing suicide fencing
>     >> Mar 24 16:35:31 b015 fence_scsi: Failed: keys cannot be same. You can
>     >> not fence yourself.
>     >> Mar 24 16:35:31 b015 fence_scsi: Please use '-h' for usage
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [ WARNING:root:Parse error: Ignoring unknown option
>     'port=b015-cl' ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [  ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [ ERROR:root:Failed: keys cannot be same. You can not fence
>     >> yourself. ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [  ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [ Failed: keys cannot be same. You can not fence yourself. ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [  ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [ ERROR:root:Please use '-h' for usage ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [  ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [ Please use '-h' for usage ]
>     >> Mar 24 16:35:31 b015 stonith-ng[2251]:  warning: fence_scsi[20020]
>     >> stderr: [  ]
>     >>
>     >>
>     >>
>     >> Software versions:
>     >> corosync                           2.3.5-3ubuntu1
>     >> pacemaker-common         1.1.14-2ubuntu1.1
>     >> pcs                                    0.9.149-1ubuntu1
>     >> libqb0:amd64                    1.0-1ubuntu1
>     >> gfs2-utils                            3.1.6-0ubuntu3
>     >>
>     >>
>     >> -------
>     >> Seth Reid
>     >> System Operations Engineer
>     >> Vendini, Inc.
>     >> 415.349.7736
>     >> sreid at vendini.com <mailto:sreid at vendini.com>
>     <mailto:sreid at vendini.com <mailto:sreid at vendini.com>>
>     >> www.vendini.com <http://www.vendini.com> <http://www.vendini.com>




More information about the Users mailing list