[Pacemaker] hangs pending

Thu Jan 9 07:11:31 EST 2014

08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  Hi, ALL.
>>
>>  I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>
> Please define "pending".  Where did you see this?
In crm_mon:
......
Node dev-cluster2-node2 (172793105): pending
......

The experiment was like this:
Four nodes in cluster.
On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
Then in the log fell out "Too many failures ...."
All this time in the status in crm_mon is "pending".
Depending on the wind direction changed to "UNCLEAN"
Much time has passed and I can not accurately describe the behavior...

Now I am in the following state:
I tried locate the problem. Came here with this.
I set big value in property stonith-timeout="600s".
And got the following behavior:
1. pkill -4 corosync
2. from node with DC call my fence agent "sshbykey"
3. It sends reboot victim and waits until she comes to life again. 
   Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0. 
   All command is logged both the victim and the killer - all right.
4. A little later, the status of the (victim) nodes in crm_mon changes to online.
5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
   * Start   pingCheck:3  (dev-cluster2-node2)
6. In this state, we spend the next 600 seconds. 
   After completing this timeout causes another node (not DC) decides to kill again our victim. 
   All command again is logged both the victim and the killer - All documented :)
7. NOW all resource started in right sequence.

I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.

I tried understood this behavior.
As I understand it:
1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
2. It make fork and pipe from tham.
3. Async call mainloop_child_add with callback to  stonith_action_async_done.
4. Add timeout  g_timeout_add to TERM and KILL signals.

If all right must - call stonith_action_async_done, remove timeout.
For some reason this does not happen. I sit and think ....

>>  At this time, there are constant re-election.
>>  Also, I noticed the difference when you start pacemaker.
>>  At normal startup:
>>  * corosync
>>  * pacemakerd
>>  * attrd
>>  * pengine
>>  * lrmd
>>  * crmd
>>  * cib
>>
>>  When hangs start:
>>  * corosync
>>  * pacemakerd
>>  * attrd
>>  * pengine
>>  * crmd
>>  * lrmd
>>  * cib.
>
> Are you referring to the order of the daemons here?
> The cib should not be at the bottom in either case.
>
>>  Who knows who runs lrmd?
>
> Pacemakerd.
>
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org