[ClusterLabs] SBD Watchdog Question

Thu Mar 14 17:53:32 EDT 2019

Hi,

I'm testing a two-node cluster using SBD 1.4.0, Corosync 2.4.2, and
Pacemaker 1.1.16. For testing, I have one shared block storage device
between the two nodes, and each node has an "IPMI watchdog" device
available at '/dev/watchdog'.

Up until today, I've been testing the SBD fencing functionality with
"stonith-action" set to the default value of "reboot", and this has
worked excellent so far.

Then I wanted to test using "stonith-action" set to "off" to power off
the node when fencing, instead of rebooting it. I set this, then used
stonith_admin to fence one of the nodes, and I was surprised that it
did not turn off...

I see this for the SBD daemon logs:
--snip--
Mar 14 19:38:02 testnode-2 sbd[2114]:
/dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504:   notice:
servant: Received command off from testnode-1 on disk
/dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504
Mar 14 19:38:02 testnode-2 sbd[2108]:  warning: inquisitor_child:
/dev/disk/by-id/nvme-eui.0000000000000005000cca0b01592504 requested a
shutoff
Mar 14 19:38:02 testnode-2 sbd[2108]:    emerg: do_exit: Rebooting system: off
Mar 14 19:38:02 testnode-2 sbd[2108]:     info: sysrq_trigger: sysrq-trigger: o
Mar 14 19:38:12 testnode-2 sbd[2116]:       pcmk:    error:
crm_ipc_read: Connection to cib_ro failed
Mar 14 19:38:12 testnode-2 sbd[2116]:       pcmk:    error:
mainloop_gio_callback: Connection to cib_ro[0x1272fa0] closed (I/O
condition=1)
Mar 14 19:38:12 testnode-2 sbd[2116]:       pcmk:  warning:
set_servant_health: Disconnected from CIB
Mar 14 19:38:13 testnode-2 sbd[2116]:       pcmk:  warning:
mon_timer_reconnect: CIB reconnect failed: -107
Mar 14 19:38:14 testnode-2 sbd[2116]:       pcmk:  warning:
set_servant_health: Node state: pending
Mar 14 19:38:14 testnode-2 sbd[2116]:       pcmk:  warning:
mon_timer_reconnect: CIB reconnect failed: -107
--snip--

But writing 'o' into '/proc/sysrq-trigger' failed, it didn't work... I
see it tried to in the kernel logs:
[ 1902.935589] sysrq: SysRq : Power Off

But it just didn't go... it turns it there is a bug in another driver,
and the task is hung, which is why using 'o' doesn't work correctly on
this system. That's another issue for me to work through...

My question is: Why didn't the watchdog device reboot the system? The
"off" operation didn't work, so I was expecting the watchdog to not
get poked, and then handle resetting the node.

I saw this message when it appears the IPMI watchdog device was closed
(by sbd I assume):
[ 1902.933851] IPMI Watchdog: Unexpected close, not stopping watchdog!

Still reading up on the watchdog devices, but I guess I'm looking for
guidance to focus my search: Should the node have been reset via the
watchdog device? Is that the expected behavior? Or is it not expected
in this scenario for the watchdog device to reset the system?

(Note: I confirmed the watchdog device does work by using 'sbd test-watchdog'.)

Any help or tips would be greatly appreciated.

Thanks,

Marc