[ClusterLabs] Antw: SBD fencing and crashkernel question
kwenning at redhat.com
Mon Oct 21 05:59:30 EDT 2019
On 10/21/19 8:31 AM, Ulrich Windl wrote:
>>>> Strahil Nikolov <hunter86_bg at yahoo.com> schrieb am 20.10.2019 um 01:03 in
> Nachricht <1223585818.2655058.1571526232579 at mail.yahoo.com>:
>> Hello Community,
>> I have a question about the stack in newer version compared to our SLES 11
>> openais stack.Can someone clarify if a node with SBD will invoke a
>> crashkernel before self killing ?
>> According to my tests on SLES 11 ,when another node kills the unresponsive
>> one - crashkernel is invoked and a dump is present at /var/crash , but if the
>> node stucks for some reason (naughty admin) - there is no sign of a crash
>> (checked on the iLO to be sure).
Can't help with SLES-specifics here but the difference between
the 2 cases you describe is probably that in one case sbd-daemon
is still alive enough to call a reboot, write on sysrq-trigger or whatever
is configured (using poison-pill? you can configure what should
happen if sbd-daemon is triggering the timeout-action - with current
sbd even in a consistent manner as long as sbd-daemon is alive.)
In the other case it is probably a hardware-watchdog kicking in.
>> I'm not sure if this behaviour is the same on newer software version (SLES
>> 12/15) and if I can workaround it - as we still struggle to find the reason
>> why our clusters fence on a very specific situation (the clusters are using
>> MDADM raid1-s on a dual-DC environment instead of SAN replication) where
>> remote DC is unavailable for 20-30s until SAN/Network is rerouted. We have
>> enabled crashdump on some of the systems , but we are pending a reboot and
>> then a real DC<->DC connectivity outage to gather valuable info,as corosync is
>> using dual-rings and is not affected, SBD is using survive on pacemaker and
>> we suspect that the nodes suicide.
>> Best Regards,Strahil Nikolov
> So basically you want to know why your node is fenced? I couldn't quite understand the environment you set up, nor what types of problems you are seeing.
> Actually in the time of many gigabytes of RAM is see little sense in crash dumps, because they will just consume a lot of time to get done.
> Manage your subscription:
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users