[ClusterLabs] RES: STONITH on a vmWare server I dont have control
Carlos Xavier
cbastos at connection.com.br
Fri May 29 20:37:39 UTC 2015
Hi.
Thank you very much for the fast answer.
> De: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de]
> >
> > After this I rebooted the host and ended with a infinite reboot :-(
>
> Did you inspect the logs for the reason? Most likely everybody else knows less about the reason than your logs...
> >
Five days after I have lost the machine control, it came back to the life and I could check the logs.
It seems that the machine was hanging on some pending process between 05/21 to 05/27.
2015-05-21T16:17:24.079661-03:00 apolo systemd[1]: Unmounting /var/lib/ntp/proc...
2015-05-21T16:17:24.089312-03:00 apolo systemd[1]: Deactivating swap /dev/dm-6...
2015-05-27T14:01:45.736854-03:00 apolo rsyslogd: [origin software="rsyslogd" swVersion="7.2.7" x-pid="10
20" x-info="http://www.rsyslog.com"] start
2015-05-27T14:01:45.736905-03:00 apolo systemd-modules-load[385]: Inserted module 'softdog'
> De: Kai Dupke [mailto:kdupke at suse.com]
> > Can someone please shade some light on this issue?
>
> SBD needs a shared storage, even if it is a virtual one like an VMware file.
>
> Please make sure all nodes using the same virtual disk, the cache for this disk is disabled and the
> disk controller is set to sharable, too.
>
Yes, the nodes are using the same shared disk /dev/sdc. On this disk it was created a first partition of 1MB to be used as sbd
device. Bellow is the print of how it is configured:
apolo:~ # sbd -d /dev/sdc1 dump
==Dumping header on disk /dev/sdc1
Header version : 2
Number of slots : 255
Sector size : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop) : 1
Timeout (msgwait) : 10
==Header on disk /dev/sdc1 is dumped
primitive stonith_sbd stonith:external/sbd \
meta target-role="Started"
property $id="cib-bootstrap-options" \
dc-version="1.1.7-61a079313275f3e9d0e85671f62c721d32ce3563" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="true" \
no-quorum-policy="ignore" \
last-lrm-refresh="1432898438" \
maintenance-mode="false" \
stonith-timeout="30s"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Since the server was up and running again I could force the STONITH trigger killing the "sbd: inquisitor" process. Although
If we take a look at the logs we can see some weird messages:
2015-05-27T14:01:50.171531-03:00 apolo sbd: [2491]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.206550-03:00 apolo sbd: [2494]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.207960-03:00 apolo sbd: [2494]: info: Starting servant for device /dev/sdc1
2015-05-27T14:01:50.208949-03:00 apolo sbd: [2496]: ERROR: Unable to set scheduler parameters.: Operation not permitted
2015-05-27T14:01:50.210039-03:00 apolo sbd: [2496]: info: Servant starting for device /dev/sdc1
2015-05-27T14:01:50.211388-03:00 apolo sbd: [2496]: info: apolo owns slot 0
2015-05-27T14:01:50.212257-03:00 apolo sbd: [2496]: info: Monitoring slot 0 on disk /dev/sdc1
2015-05-27T14:01:50.942909-03:00 apolo kernel: [ 19.092844] hpet_rtc_timer_reinit: 54 callbacks suppressed
2015-05-27T14:01:50.942973-03:00 apolo kernel: [ 19.092844] hpet1: lost 6 rtc interrupts
2015-05-27T14:01:51.629181-03:00 apolo sbd: [2494]: notice: Using watchdog device: /dev/watchdog
2015-05-27T14:01:51.629697-03:00 apolo sbd: [2494]: info: Set watchdog timeout to 5 seconds.
2015-05-27T14:01:59.956700-03:00 apolo lrmd: [2542]: info: rsc:stonith_sbd probe[5] (pid 2939)
2015-05-27T14:01:59.957179-03:00 apolo cib: [2540]: debug: acl_enabled: CIB ACL is disabled
2015-05-27T14:01:59.966243-03:00 apolo cib: last message repeated 2 times
2015-05-27T14:01:59.966096-03:00 apolo lrm-stonith: [2939]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on:
/var/run/crm/st_command
2015-05-27T14:01:59.966663-03:00 apolo lrm-stonith: [2939]: debug: get_stonith_token: Obtained registration token:
2f281f50-22b8-4d31-961d-32f9782c9cbc
2015-05-27T14:01:59.967191-03:00 apolo lrm-stonith: [2939]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on:
/var/run/crm/st_callback
2015-05-27T14:01:59.967676-03:00 apolo lrm-stonith: [2939]: debug: get_stonith_token: Obtained registration token:
d9ae0e91-5ad8-4ff2-8f33-414e4ad83a83
2015-05-27T14:01:59.968156-03:00 apolo lrm-stonith: [2939]: debug: stonith_api_signon: Connection to STONITH successful
2015-05-27T14:01:59.968582-03:00 apolo stonith-ng: [2541]: debug: stonith_command: Processing register from lrmd ( 0)
2015-05-27T14:01:59.969050-03:00 apolo stonith-ng: [2541]: debug: stonith_command: Processing st_execute from lrmd (
1000)
2015-05-27T14:01:59.969482-03:00 apolo stonith-ng: [2541]: notice: stonith_device_action: Device stonith_sbd not found
2015-05-27T14:01:59.969998-03:00 apolo stonith-ng: [2541]: info: stonith_command: Processed st_execute from lrmd: rc=-12
2015-05-27T14:01:59.970432-03:00 apolo lrm-stonith: [2939]: debug: execra: stonith_sbd_monitor returned -12
2015-05-27T14:01:59.970899-03:00 apolo lrm-stonith: [2939]: debug: stonith_api_signoff: Signing out of the STONITH Service
2015-05-27T14:01:59.971376-03:00 apolo lrmd: [2542]: WARN: Managed stonith_sbd:monitor process 2939 exited with return code 7.
2015-05-27T14:01:59.971810-03:00 apolo lrmd: [2542]: info: operation monitor[5] on stonith_sbd for client 2545: pid 2939 exited with
return code 7
2015-05-27T14:01:59.972308-03:00 apolo crmd: [2545]: debug: create_operation_update: do_update_resource: Updating resouce
stonith_sbd after complete monitor op (interval=0)
2015-05-27T14:02:00.029149-03:00 apolo crmd: [2545]: info: process_lrm_event: LRM operation stonith_sbd_monitor_0 (call=5, rc=7,
cib-update=8, confirmed=true) not running
2015-05-27T14:02:00.029918-03:00 apolo crmd: [2545]: debug: update_history_cache: Appending monitor op to history for 'stonith_sbd'
Regards,
Carlos.
More information about the Users
mailing list