[ClusterLabs] Antw: SBD fencing and crashkernel question

Mon Oct 21 05:59:30 EDT 2019

On 10/21/19 8:31 AM, Ulrich Windl wrote:
>>>> Strahil Nikolov <hunter86_bg at yahoo.com> schrieb am 20.10.2019 um 01:03 in
> Nachricht <1223585818.2655058.1571526232579 at mail.yahoo.com>:
>> Hello Community,
>> I have a question about the stack in newer version compared to our SLES 11 
>> openais stack.Can someone clarify if a node with SBD will invoke a 
>> crashkernel before self killing ?
>> According to my tests on SLES 11 ,when another node kills the unresponsive 
>> one - crashkernel is invoked and a dump is present at /var/crash , but if the 
>> node stucks for some reason (naughty admin) - there is no sign of a crash 
>> (checked on the iLO to be sure).
Can't help with SLES-specifics here but the difference between
the 2 cases you describe is probably that in one case sbd-daemon
is still alive enough to call a reboot, write on sysrq-trigger or whatever
is configured (using poison-pill? you can configure what should
happen if sbd-daemon is triggering the timeout-action - with current
sbd even in a consistent manner as long as sbd-daemon is alive.)
In the other case it is probably a hardware-watchdog kicking in.

Regards,
Klaus


>> I'm not sure if this behaviour is the same on newer software version (SLES 
>> 12/15) and if I can workaround it - as we still struggle to find the reason 
>> why our clusters fence on a very specific situation (the clusters are using 
>> MDADM raid1-s on a dual-DC environment instead of SAN replication) where 
>> remote DC is unavailable for 20-30s until SAN/Network is rerouted. We have 
>> enabled crashdump on some of the systems , but we are pending a reboot and 
>> then a real DC<->DC connectivity outage to gather valuable info,as corosync is 
>> using dual-rings and is not affected, SBD is using survive on pacemaker and 
>> we suspect that the nodes suicide.
>> Best Regards,Strahil Nikolov
> So basically you want to know why your node is fenced? I couldn't quite  understand the environment you set up, nor what types of problems you are seeing.
> Actually in the time of many gigabytes of RAM is see little sense in crash dumps, because they will just consume a lot of time to get done.
>
> Regards,
> Ulrich
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/