[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Mon May 18 11:08:54 EDT 2015

Thanks for clarifying.

Is it a fair assumption that between 1.1.10 and 1.1.12/13 that not much has changed on-the-wire between the components, and I could just replace the crmd that comes with 14.04 with one I built locally. If so, then I could test that simply myself, and then report back here or to Canonical as needed (changing more than one binary is clearly also an option, so if changing more/all is needed then that would be good to know too...apologies for sounding like a noob, this is all very new to me).

-----Original Message-----
From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm] 
Sent: 18 May 2015 15:25
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

On Mon, May 18, 2015 at 08:58:30AM +0000, Shaheedur Haque (shahhaqu) wrote:
> I'm a little uncertain what the suggestion is; if this sounds like a 
> bug that has been fixed, then then presumably it would be best if I 
> could point Canonical to the upstream fix (all I could come up with 
> was dbbb6a6, in possibly the same area, but to my untrained eye, it is 
> hard to guess if this could be the same thing). If it is thought to be 
> a new bug, then presumably I am better off working with upstream?

I'm also not sure if it's a new bug. It's just that there were a number of changes since 1.1.10.

> Either way, if a new bug is needed, it seems I should start with a bug here...

Well, I'd suggest the other way around, as the ubuntu maintainers should know better how to handle it, were to check if it's a new bug or not, etc. Though it is of course fine to inquire here.

Thanks,

Dejan

> -----Original Message-----
> From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
> Sent: 15 May 2015 19:00
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] STONITH error: 
> stonith_async_timeout_handler despite successful fence
> 
> Hi,
> 
> On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
> > I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
> > 
> > - The device responds "OK" to the reboot.
> > - 132s later, crmd complains about the timeout.
> 
> It seems you can open a bug report for this. It could be, however, that the bug has already been fixed in the meantime, so best to file the bug with ubuntu.
> 
> Thanks,
> 
> Dejan
> 
> > I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
> > 
> > Thanks, Shaheed
> > 
> > -----Original Message-----
> > From: Shaheedur Haque (shahhaqu)
> > Sent: 09 May 2015 07:23
> > To: users at clusterlabs.org
> > Subject: RE: STONITH error: stonith_async_timeout_handler despite 
> > successful fence
> > 
> > Hi,
> > 
> > I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
> > 
> > ============
> > May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
> > May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
> > May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
> > May  8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
> > May  8 18:30:15 octl-03 crmd[15635]:    error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
> > May  8 18:30:15 octl-03 crmd[15635]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> > May  8 18:30:15 octl-03 crmd[15635]:   notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> > May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
> > May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
> > May  8 18:30:15 octl-03 crmd[15635]:   notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
> > ============
> > 
> > Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue. 
> > 
> > FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
> > 
> > Thanks, Shaheed
> > 
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org 
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org Getting started: 
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org