[ClusterLabs] RES: STONITH on a vmWare server I dont have control

Carlos Xavier cbastos at connection.com.br
Fri May 29 16:37:39 EDT 2015


Thank you very much for the fast answer.

> De: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de]
> > 
> > After this I rebooted the host and ended with a infinite reboot :-(
> Did you inspect the logs for the reason? Most likely everybody else knows less about the reason than your logs...

> >

Five days after I have lost the machine control, it came back to the life and I could check the logs.
It seems that the machine was hanging on some pending process between 05/21 to 05/27.

2015-05-21T16:17:24.079661-03:00 apolo systemd[1]: Unmounting /var/lib/ntp/proc...
2015-05-21T16:17:24.089312-03:00 apolo systemd[1]: Deactivating swap /dev/dm-6...
2015-05-27T14:01:45.736854-03:00 apolo rsyslogd: [origin software="rsyslogd" swVersion="7.2.7" x-pid="10
20" x-info="http://www.rsyslog.com"] start
2015-05-27T14:01:45.736905-03:00 apolo systemd-modules-load[385]: Inserted module 'softdog'

> De: Kai Dupke [mailto:kdupke at suse.com]
> > Can someone please shade some light on this issue?
> SBD needs a shared storage, even if it is a virtual one like an VMware file.
> Please make sure all nodes using the same virtual disk, the cache for this disk is disabled and the
> disk controller is set to sharable, too.

Yes, the nodes are using the same shared disk /dev/sdc. On this disk it was created a first partition of 1MB to be used as sbd
device. Bellow is the print of how it is configured:

apolo:~ # sbd -d /dev/sdc1 dump
==Dumping header on disk /dev/sdc1
Header version     : 2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 10
==Header on disk /dev/sdc1 is dumped

primitive stonith_sbd stonith:external/sbd \
        meta target-role="Started"
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1432898438" \
        maintenance-mode="false" \
rsc_defaults $id="rsc-options" \

Since the server was up and running again I could force the STONITH trigger killing the "sbd: inquisitor" process. Although
If we take a look at the logs we can see some weird messages:

2015-05-27T14:01:50.171531-03:00 apolo sbd: [2491]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.206550-03:00 apolo sbd: [2494]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.207960-03:00 apolo sbd: [2494]: info: Starting servant for device /dev/sdc1
2015-05-27T14:01:50.208949-03:00 apolo sbd: [2496]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.210039-03:00 apolo sbd: [2496]: info: Servant starting for device /dev/sdc1
2015-05-27T14:01:50.211388-03:00 apolo sbd: [2496]: info: apolo owns slot 0
2015-05-27T14:01:50.212257-03:00 apolo sbd: [2496]: info: Monitoring slot 0 on disk /dev/sdc1
2015-05-27T14:01:50.942909-03:00 apolo kernel: [   19.092844] hpet_rtc_timer_reinit: 54 callbacks suppressed
2015-05-27T14:01:50.942973-03:00 apolo kernel: [   19.092844] hpet1: lost 6 rtc interrupts
2015-05-27T14:01:51.629181-03:00 apolo sbd: [2494]: notice: Using watchdog device: /dev/watchdog
2015-05-27T14:01:51.629697-03:00 apolo sbd: [2494]: info: Set watchdog timeout to 5 seconds.

2015-05-27T14:01:59.956700-03:00 apolo lrmd: [2542]: info: rsc:stonith_sbd probe[5] (pid 2939)
2015-05-27T14:01:59.957179-03:00 apolo cib: [2540]: debug: acl_enabled: CIB ACL is disabled
2015-05-27T14:01:59.966243-03:00 apolo cib: last message repeated 2 times
2015-05-27T14:01:59.966096-03:00 apolo lrm-stonith: [2939]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on:
2015-05-27T14:01:59.966663-03:00 apolo lrm-stonith: [2939]: debug: get_stonith_token: Obtained registration token:
2015-05-27T14:01:59.967191-03:00 apolo lrm-stonith: [2939]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on:
2015-05-27T14:01:59.967676-03:00 apolo lrm-stonith: [2939]: debug: get_stonith_token: Obtained registration token:
2015-05-27T14:01:59.968156-03:00 apolo lrm-stonith: [2939]: debug: stonith_api_signon: Connection to STONITH successful
2015-05-27T14:01:59.968582-03:00 apolo stonith-ng: [2541]: debug: stonith_command: Processing register from lrmd (               0)
2015-05-27T14:01:59.969050-03:00 apolo stonith-ng: [2541]: debug: stonith_command: Processing st_execute from lrmd (
2015-05-27T14:01:59.969482-03:00 apolo stonith-ng: [2541]: notice: stonith_device_action: Device stonith_sbd not found
2015-05-27T14:01:59.969998-03:00 apolo stonith-ng: [2541]: info: stonith_command: Processed st_execute from lrmd: rc=-12
2015-05-27T14:01:59.970432-03:00 apolo lrm-stonith: [2939]: debug: execra: stonith_sbd_monitor returned -12
2015-05-27T14:01:59.970899-03:00 apolo lrm-stonith: [2939]: debug: stonith_api_signoff: Signing out of the STONITH Service
2015-05-27T14:01:59.971376-03:00 apolo lrmd: [2542]: WARN: Managed stonith_sbd:monitor process 2939 exited with return code 7.
2015-05-27T14:01:59.971810-03:00 apolo lrmd: [2542]: info: operation monitor[5] on stonith_sbd for client 2545: pid 2939 exited with
return code 7
2015-05-27T14:01:59.972308-03:00 apolo crmd: [2545]: debug: create_operation_update: do_update_resource: Updating resouce
stonith_sbd after complete monitor op (interval=0)
2015-05-27T14:02:00.029149-03:00 apolo crmd: [2545]: info: process_lrm_event: LRM operation stonith_sbd_monitor_0 (call=5, rc=7,
cib-update=8, confirmed=true) not running
2015-05-27T14:02:00.029918-03:00 apolo crmd: [2545]: debug: update_history_cache: Appending monitor op to history for 'stonith_sbd'


More information about the Users mailing list