[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Tue Jun 16 10:47:08 CEST 2015

Just to close the loop on this issue...

The long silence on my part was due to the fact that our cluster was hitting other problems, some of them self-inflicted :-), and disentangling this has taken a while. At the end, I was left with two problems:

- the one mentioned in this thread (which is essentially a n-way cluster problem from my POV)
- another related to failover in a 2-way such that when the failed node was replaced, it failed to join the cluster

On a hunch, I replaced heartbeat with corosync and to my astonishment, both problems disappeared.

Thanks for all the help, Shaheed

-----Original Message-----
From: Shaheedur Haque (shahhaqu) 
Sent: 18 May 2015 09:58
To: users at clusterlabs.org
Subject: RE: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

I'm a little uncertain what the suggestion is; if this sounds like a bug that has been fixed, then then presumably it would be best if I could point Canonical to the upstream fix (all I could come up with was dbbb6a6, in possibly the same area, but to my untrained eye, it is hard to guess if this could be the same thing). If it is thought to be a new bug, then presumably I am better off working with upstream?

Either way, if a new bug is needed, it seems I should start with a bug here...

-----Original Message-----
From: Dejan Muhamedagic [mailto:dejanmm at fastmail.fm]
Sent: 15 May 2015 19:00
To: users at clusterlabs.org
Subject: Re: [ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Hi,

On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
> I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
> 
> - The device responds "OK" to the reboot.
> - 132s later, crmd complains about the timeout.

It seems you can open a bug report for this. It could be, however, that the bug has already been fixed in the meantime, so best to file the bug with ubuntu.

Thanks,

Dejan

> I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
> 
> Thanks, Shaheed
> 
> -----Original Message-----
> From: Shaheedur Haque (shahhaqu)
> Sent: 09 May 2015 07:23
> To: users at clusterlabs.org
> Subject: RE: STONITH error: stonith_async_timeout_handler despite 
> successful fence
> 
> Hi,
> 
> I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
> 
> ============
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
> May  8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
> May  8 18:30:15 octl-03 crmd[15635]:    error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
> May  8 18:30:15 octl-03 crmd[15635]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> May  8 18:30:15 octl-03 crmd[15635]:   notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
> May  8 18:30:15 octl-03 crmd[15635]:   notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
> ============
> 
> Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue. 
> 
> FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
> 
> Thanks, Shaheed
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users at clusterlabs.org http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org