[ClusterLabs] Stonith

Mon Mar 20 12:44:42 UTC 2017

Hello guys,

it looks like I miss something obvious, but I just don't get what has 
happened.

I've got a number of stonith-enabled clusters within my big POWER boxes. 
My stonith devices are two HMC (hardware management consoles) - separate 
servers from IBM that can reboot separate LPARs (logical partitions) 
within POWER boxes - one per every datacenter.

So my definition for stonith devices was pretty straightforward:

primitive st_dc2_hmc stonith:ibmhmc \
params ipaddr=10.1.2.9
primitive st_dc1_hmc stonith:ibmhmc \
params ipaddr=10.1.2.8
clone cl_st_dc2_hmc st_dc2_hmc
clone cl_st_dc1_hmc st_dc1_hmc

Everything was ok when we tested failover. But today upon power outage 
we lost one DC completely. Shortly after that cluster just literally 
hanged itself upong trying to reboot nonexistent node. No failover 
occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were 
marked "Started UNCLEAN" on nonexistent node.

UNCLEAN seems to flag a problems with stonith configuration. So my 
question is: how to avoid such behaviour?

Thank you!

-- 
Regards,
Alexander