[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence
kgaillot at redhat.com
Mon May 18 12:33:55 EDT 2015
On 05/18/2015 11:08 AM, Shaheedur Haque (shahhaqu) wrote:
> Thanks for clarifying.
> Is it a fair assumption that between 1.1.10 and 1.1.12/13 that not much has changed on-the-wire between the components, and I could just replace the crmd that comes with 14.04 with one I built locally. If so, then I could test that simply myself, and then report back here or to Canonical as needed (changing more than one binary is clearly also an option, so if changing more/all is needed then that would be good to know too...apologies for sounding like a noob, this is all very new to me).
The components need to be from the same source code revision; the
internal APIs do change. 1.1.13rc3 just came out and is a good choice if
you're compiling from scratch.
If you want to get the blackbox utility working, compile corosync too,
otherwise you can set PCMK_debug=yes (in /etc/sysconfig/pacemaker for
rpm distros, maybe in /etc/default/pacemaker for ubuntu?) to get a very
I personally haven't seen an issue like this. Seeing your config might help.
> -----Original Message-----
> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
> Sent: 18 May 2015 15:25
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Subject: Re: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence
> On Mon, May 18, 2015 at 08:58:30AM +0000, Shaheedur Haque (shahhaqu) wrote:
>> I'm a little uncertain what the suggestion is; if this sounds like a
>> bug that has been fixed, then then presumably it would be best if I
>> could point Canonical to the upstream fix (all I could come up with
>> was dbbb6a6, in possibly the same area, but to my untrained eye, it is
>> hard to guess if this could be the same thing). If it is thought to be
>> a new bug, then presumably I am better off working with upstream?
> I'm also not sure if it's a new bug. It's just that there were a number of changes since 1.1.10.
>> Either way, if a new bug is needed, it seems I should start with a bug here...
> Well, I'd suggest the other way around, as the ubuntu maintainers should know better how to handle it, were to check if it's a new bug or not, etc. Though it is of course fine to inquire here.
>> -----Original Message-----
>> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
>> Sent: 15 May 2015 19:00
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] STONITH error:
>> stonith_async_timeout_handler despite successful fence
>> On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
>>> I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
>>> - The device responds "OK" to the reboot.
>>> - 132s later, crmd complains about the timeout.
>> It seems you can open a bug report for this. It could be, however, that the bug has already been fixed in the meantime, so best to file the bug with ubuntu.
>>> I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
>>> Thanks, Shaheed
>>> -----Original Message-----
>>> From: Shaheedur Haque (shahhaqu)
>>> Sent: 09 May 2015 07:23
>>> To: users at clusterlabs.org
>>> Subject: RE: STONITH error: stonith_async_timeout_handler despite
>>> successful fence
>>> I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
>>> May 8 18:28:03 octl-03 stonith-ng: notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
>>> May 8 18:28:03 octl-03 stonith-ng: notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
>>> May 8 18:28:03 octl-03 stonith-ng: notice: log_operation: Operation 'reboot'  (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
>>> May 8 18:28:03 octl-03 stonith: : info: Host null-reset: octl-01
>>> May 8 18:30:15 octl-03 crmd: error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
>>> May 8 18:30:15 octl-03 crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>> May 8 18:30:15 octl-03 crmd: notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
>>> May 8 18:30:15 octl-03 crmd: notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
>>> May 8 18:30:15 octl-03 crmd: notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
>>> May 8 18:30:15 octl-03 crmd: notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
>>> Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue.
>>> FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
>>> Thanks, Shaheed
More information about the Users