[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Wed May 20 06:01:06 CEST 2015

> On 19 May 2015, at 2:33 am, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
> On 05/18/2015 11:08 AM, Shaheedur Haque (shahhaqu) wrote:
>> Thanks for clarifying.
>> 
>> Is it a fair assumption that between 1.1.10 and 1.1.12/13 that not much has changed on-the-wire between the components, and I could just replace the crmd that comes with 14.04 with one I built locally. If so, then I could test that simply myself, and then report back here or to Canonical as needed (changing more than one binary is clearly also an option, so if changing more/all is needed then that would be good to know too...apologies for sounding like a noob, this is all very new to me).
> 
> The components need to be from the same source code revision; the
> internal APIs do change. 1.1.13rc3 just came out and is a good choice if
> you're compiling from scratch.
> 
> If you want to get the blackbox utility working, compile corosync too,
> otherwise you can set PCMK_debug=yes

Never do that unless you’ve exhausted all other options, you’ll fill your disks faster that you can type rm -rf.

http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ has some details on other options to try first.

> (in /etc/sysconfig/pacemaker for
> rpm distros, maybe in /etc/default/pacemaker for ubuntu?) to get a very
> noisy /var/log/pacemaker.log.
> 
> I personally haven't seen an issue like this. Seeing your config might help.
> 
>> -----Original Message-----
>> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm] 
>> Sent: 18 May 2015 15:25
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> Subject: Re: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence
>> 
>> On Mon, May 18, 2015 at 08:58:30AM +0000, Shaheedur Haque (shahhaqu) wrote:
>>> I'm a little uncertain what the suggestion is; if this sounds like a 
>>> bug that has been fixed, then then presumably it would be best if I 
>>> could point Canonical to the upstream fix (all I could come up with 
>>> was dbbb6a6, in possibly the same area, but to my untrained eye, it is 
>>> hard to guess if this could be the same thing). If it is thought to be 
>>> a new bug, then presumably I am better off working with upstream?
>> 
>> I'm also not sure if it's a new bug. It's just that there were a number of changes since 1.1.10.
>> 
>>> Either way, if a new bug is needed, it seems I should start with a bug here...
>> 
>> Well, I'd suggest the other way around, as the ubuntu maintainers should know better how to handle it, were to check if it's a new bug or not, etc. Though it is of course fine to inquire here.
>> 
>> Thanks,
>> 
>> Dejan
>> 
>>> -----Original Message-----
>>> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
>>> Sent: 15 May 2015 19:00
>>> To: users at clusterlabs.org
>>> Subject: Re: [ClusterLabs] STONITH error: 
>>> stonith_async_timeout_handler despite successful fence
>>> 
>>> Hi,
>>> 
>>> On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
>>>> I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
>>>> 
>>>> - The device responds "OK" to the reboot.
>>>> - 132s later, crmd complains about the timeout.
>>> 
>>> It seems you can open a bug report for this. It could be, however, that the bug has already been fixed in the meantime, so best to file the bug with ubuntu.
>>> 
>>> Thanks,
>>> 
>>> Dejan
>>> 
>>>> I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
>>>> 
>>>> Thanks, Shaheed
>>>> 
>>>> -----Original Message-----
>>>> From: Shaheedur Haque (shahhaqu)
>>>> Sent: 09 May 2015 07:23
>>>> To: users at clusterlabs.org
>>>> Subject: RE: STONITH error: stonith_async_timeout_handler despite 
>>>> successful fence
>>>> 
>>>> Hi,
>>>> 
>>>> I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
>>>> 
>>>> ============
>>>> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
>>>> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
>>>> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
>>>> May  8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
>>>> May  8 18:30:15 octl-03 crmd[15635]:    error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
>>>> May  8 18:30:15 octl-03 crmd[15635]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>> May  8 18:30:15 octl-03 crmd[15635]:   notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
>>>> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
>>>> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
>>>> May  8 18:30:15 octl-03 crmd[15635]:   notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
>>>> ============
>>>> 
>>>> Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue. 
>>>> 
>>>> FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
>>>> 
>>>> Thanks, Shaheed
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org