[ClusterLabs] SBD fencing and crashkernel question

Mon Oct 21 06:09:35 EDT 2019

On 10/20/19 7:03 AM, Strahil Nikolov wrote:
> Hello Community,
> 
> I have a question about the stack in newer version compared to our SLES 
> 11 openais stack.
> Can someone clarify if a node with SBD will invoke a crashkernel before 
> self killing ?
> 
> According to my tests on SLES 11 ,when another node kills the 
> unresponsive one - crashkernel is invoked and a dump is present at 
> /var/crash , but if the node stucks for some reason (naughty admin) - 
> there is no sign of a crash (checked on the iLO to be sure).
> 

"crashdump" is one of SBD option need be configured on purpose.

You can `man sbd` to check the "-r" option, or "SBD_TIMEOUT_ACTION" in 
/etc/sysconfig/sbd

> I'm not sure if this behaviour is the same on newer software version 
> (SLES 12/15) and if I can workaround it - as we still struggle to find 
> the reason why our clusters fence on a very specific situation (the 
> clusters are using MDADM raid1-s on a dual-DC environment instead of SAN 
> replication) where remote DC is unavailable for 20-30s until SAN/Network 
> is rerouted.

Not sure if you imply cluster-md-raid1 here?
If yes, you might refer to Page 18 of [1](
https://github.com/zzhou1/ks/blob/master/2018-06.%E5%BB%B6%E4%BC%B8Linux%E5%85%B3%E9%94%AE%E4%B8%9A%E5%8A%A1%E5%88%B0%E5%8F%8C%E6%B4%BBNVMe-oF%E5%AD%98%E5%82%A8.OpenInfra18.v8.pdf)

> We have enabled crashdump on some of the systems , but we 
> are pending a reboot and then a real DC<->DC connectivity outage to 
> gather valuable info,as corosync is using dual-rings and is not 
> affected, SBD is using survive on pacemaker and we suspect that the 
> nodes suicide.
> 

Not able to follow up all your words, you might want to rephrase with a 
bit more details.

Cheers,
Roger

> Best Regards,
> Strahil Nikolov
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>