[Pacemaker] hangs pending

Thu Jan 9 20:22:25 EST 2014

On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>> 
>>>  Hi, ALL.
>>> 
>>>  I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>> 
>> Please define "pending".  Where did you see this?
> In crm_mon:
> ......
> Node dev-cluster2-node2 (172793105): pending
> ......
> 
> 
> The experiment was like this:
> Four nodes in cluster.
> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
> Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
> Then in the log fell out "Too many failures ...."
> All this time in the status in crm_mon is "pending".
> Depending on the wind direction changed to "UNCLEAN"
> Much time has passed and I can not accurately describe the behavior...
> 
> Now I am in the following state:
> I tried locate the problem. Came here with this.
> I set big value in property stonith-timeout="600s".
> And got the following behavior:
> 1. pkill -4 corosync
> 2. from node with DC call my fence agent "sshbykey"
> 3. It sends reboot victim and waits until she comes to life again. 

Hmmm.... what version of pacemaker?
This sounds like a timing issue that we fixed a while back

>   Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. 
>   All command is logged both the victim and the killer - all right.
> 4. A little later, the status of the (victim) nodes in crm_mon changes to online.
> 5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>   * Start   pingCheck:3  (dev-cluster2-node2)
> 6. In this state, we spend the next 600 seconds. 
>   After completing this timeout causes another node (not DC) decides to kill again our victim. 
>   All command again is logged both the victim and the killer - All documented :)
> 7. NOW all resource started in right sequence.
> 
> I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
> And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
> 
> I tried understood this behavior.
> As I understand it:
> 1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
> 2. It make fork and pipe from tham.
> 3. Async call mainloop_child_add with callback to  stonith_action_async_done.
> 4. Add timeout  g_timeout_add to TERM and KILL signals.
> 
> If all right must - call stonith_action_async_done, remove timeout.
> For some reason this does not happen. I sit and think ....
> 
> 
> 
> 
>>>  At this time, there are constant re-election.
>>>  Also, I noticed the difference when you start pacemaker.
>>>  At normal startup:
>>>  * corosync
>>>  * pacemakerd
>>>  * attrd
>>>  * pengine
>>>  * lrmd
>>>  * crmd
>>>  * cib
>>> 
>>>  When hangs start:
>>>  * corosync
>>>  * pacemakerd
>>>  * attrd
>>>  * pengine
>>>  * crmd
>>>  * lrmd
>>>  * cib.
>> 
>> Are you referring to the order of the daemons here?
>> The cib should not be at the bottom in either case.
>> 
>>>  Who knows who runs lrmd?
>> 
>> Pacemakerd.
>> 
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140110/5529af6f/attachment-0003.sig>