[ClusterLabs] STONITH error: stonith_async_timeout_handler despite successful fence

Fri May 15 17:59:39 UTC 2015

Hi,

On Tue, May 12, 2015 at 08:28:51AM +0000, Shaheedur Haque (shahhaqu) wrote:
> I ended up writing my own STONITH device so I could clearly log/see what was going on, and I can confirm that I see no unexpected calls to the device but the behaviour remains the same:
> 
> - The device responds "OK" to the reboot.
> - 132s later, crmd complains about the timeout.

It seems you can open a bug report for this. It could be,
however, that the bug has already been fixed in the meantime, so
best to file the bug with ubuntu.

Thanks,

Dejan

> I am convinced at this point that somehow, crmd is losing track of the timer it started to protect the call to stonith-ng. Is there any logging etc. I could gather to help diagnose the problem? (I tried the blackbox stuff, but Ubuntu seems not to build/ship the viewer utility :-().
> 
> Thanks, Shaheed
> 
> -----Original Message-----
> From: Shaheedur Haque (shahhaqu) 
> Sent: 09 May 2015 07:23
> To: users at clusterlabs.org
> Subject: RE: STONITH error: stonith_async_timeout_handler despite successful fence
> 
> Hi,
> 
> I am working in a virtualised environment where, for now at least, I am simply deleting a clustered VM and then expecting the rest of the cluster to recover using the "null" STONITH device. As far as I can see from the log, the (simulated) reboot returned OK, but the timeout fired anyway:
> 
> ============
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-01 can fence octl-01: dynamic-list
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: can_fence_host_with_device: stonith-octl-02 can not fence octl-01: dynamic- list
> May  8 18:28:03 octl-03 stonith-ng[15633]:   notice: log_operation: Operation 'reboot' [16994] (call 51 from crmd.15635) for host 'octl-01' with device 'stonith-octl-01' returned: 0 (OK)
> May  8 18:28:03 octl-03 stonith: [16995]: info: Host null-reset: octl-01
> May  8 18:30:15 octl-03 crmd[15635]:    error: stonith_async_timeout_handler: Async call 51 timed out after 132000ms
> May  8 18:30:15 octl-03 crmd[15635]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> May  8 18:30:15 octl-03 crmd[15635]:   notice: run_graph: Transition 158 (Complete=3, Pending=0, Fired=0, Skipped=25, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51 for octl-01 failed (Timer expired): aborting transition.
> May  8 18:30:15 octl-03 crmd[15635]:   notice: tengine_stonith_callback: Stonith operation 51/45:158:0:6f5821b3-2644-40c1-8bbc-cfcdf049656b: Timer expired (-62)
> May  8 18:30:15 octl-03 crmd[15635]:   notice: too_many_st_failures: Too many failures to fence octl-01 (50), giving up
> ============
> 
> Any thoughts on whether I might be doing something wrong or if this is a new issue? I've seen some other fixes in this area in the relatively recent past such as https://github.com/beekhof/pacemaker/commit/dbbb6a6, but it is not clear to me if this is the same thing or a different issue. 
> 
> FWIW, I am on Ubuntu Trusty (the change log is here: https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.3), but I cannot seem to tell just what fixes from 1.1.11 or 1.1.12 have been backported.
> 
> Thanks, Shaheed
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org