[ClusterLabs] Antw: corosync/dlm fencing?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Jul 16 02:33:46 EDT 2018
Hi!
I don't run SLES 12, but in SLES 11 some RAs tried to unload kernel modules
during stop, and when you had installed kernel updates before, the operation
failed, and the cluster paniced (i.e.: started to fence the node). In your case
it should be obvious from the logs why the cluster wants to fence nodes. Maybe
"grep -i fence /var/log/messages" (or in pacemaker's log)
Regards,
Ulrich
>>> Philipp Achmüller <philipp.achmueller at arz.at> schrieb am 15.07.2018 um
15:39 in
Nachricht <OF678C848F.04D4334F-ONC12582CB.00473374-C12582CB.004AFD3B at arz.at>:
> hi!
>
> i have a 4 node cluster running on SLES12 SP3
> ‑ pacemaker‑1.1.16‑4.8.x86_64
> ‑ corosync‑2.3.6‑9.5.1.x86_64
>
> following configuration:
>
> Stack: corosync
> Current DC: sitea‑2 (version 1.1.16‑4.8‑77ea74d) ‑ partition with quorum
> Last updated: Sun Jul 15 15:00:55 2018
> Last change: Sat Jul 14 18:54:50 2018 by root via crm_resource on sitea‑1
>
> 4 nodes configured
> 23 resources configured
>
> Node sitea‑1: online
> 1 (ocf::pacemaker:controld): Active
> 1 (ocf::lvm2:clvmd): Active
> 1 (ocf::pacemaker:SysInfo): Active
> 5 (ocf::heartbeat:VirtualDomain): Active
> 1 (ocf::heartbeat:LVM): Active
> Node siteb‑1: online
> 1 (ocf::pacemaker:controld): Active
> 1 (ocf::lvm2:clvmd): Active
> 1 (ocf::pacemaker:SysInfo): Active
> 1 (ocf::heartbeat:VirtualDomain): Active
> 1 (ocf::heartbeat:LVM): Active
> Node sitea‑2: online
> 1 (ocf::pacemaker:controld): Active
> 1 (ocf::lvm2:clvmd): Active
> 1 (ocf::pacemaker:SysInfo): Active
> 3 (ocf::heartbeat:VirtualDomain): Active
> 1 (ocf::heartbeat:LVM): Active
> Node siteb‑2: online
> 1 (ocf::pacemaker:ClusterMon): Active
> 3 (ocf::heartbeat:VirtualDomain): Active
> 1 (ocf::pacemaker:SysInfo): Active
> 1 (stonith:external/sbd): Active
> 1 (ocf::lvm2:clvmd): Active
> 1 (ocf::heartbeat:LVM): Active
> 1 (ocf::pacemaker:controld): Active
> ‑‑‑‑
> and these ordering:
> ...
> group base‑group dlm clvm vg1
> clone base‑clone base‑group \
> meta interleave=true target‑role=Started ordered=true
> colocation colocation‑VM‑base‑clone‑INFINITY inf: VM base‑clone
> order order‑base‑clone‑VM‑mandatory base‑clone:start VM:start
> ...
>
> for maintenance i would like to standby 1 or 2 nodes from "sitea" so that
> every Resources move off from these 2 images.
> everything works fine until dlm stops as last resource on these nodes,
> then dlm_controld send fence_request ‑ sometimes to the remaining online
> nodes, so there is online 1 node left in cluster....
>
> messages:
>
> ....
> 2018‑07‑14T14:38:56.441157+02:00 siteb‑1 dlm_controld[39725]: 678 fence
> request 3 pid 54428 startup time 1531571371 fence_all dlm_stonith
> 2018‑07‑14T14:38:56.445284+02:00 siteb‑1 dlm_stonith: stonith_api_time:
> Found 0 entries for 3/(null): 0 in progress, 0 completed
> 2018‑07‑14T14:38:56.446033+02:00 siteb‑1 stonith‑ng[8085]: notice:
> Client stonith‑api.54428.ee6a7e02 wants to fence (reboot) '3' with device
> '(any)'
> 2018‑07‑14T14:38:56.446294+02:00 siteb‑1 stonith‑ng[8085]: notice:
> Requesting peer fencing (reboot) of sitea‑2
> ...
>
> # dlm_tool dump_config
> daemon_debug=0
> foreground=0
> log_debug=0
> timewarn=0
> protocol=detect
> debug_logfile=0
> enable_fscontrol=0
> enable_plock=1
> plock_debug=0
> plock_rate_limit=0
> plock_ownership=0
> drop_resources_time=10000
> drop_resources_count=10
> drop_resources_age=10000
> post_join_delay=30
> enable_fencing=1
> enable_concurrent_fencing=0
> enable_startup_fencing=0
> repeat_failed_fencing=1
> enable_quorum_fencing=1
> enable_quorum_lockspace=1
> help=‑1
> version=‑1
>
> how to find out what is happening here?
More information about the Users
mailing list