[Pacemaker] hangs pending

Fri Jan 10 05:23:12 EST 2014

10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>   On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>    08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>    On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>     Hi, ALL.
>>>>>>
>>>>>>     I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>    Please define "pending".  Where did you see this?
>>>>    In crm_mon:
>>>>    ......
>>>>    Node dev-cluster2-node2 (172793105): pending
>>>>    ......
>>>>
>>>>    The experiment was like this:
>>>>    Four nodes in cluster.
>>>>    On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>    Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>    Then in the log fell out "Too many failures ...."
>>>>    All this time in the status in crm_mon is "pending".
>>>>    Depending on the wind direction changed to "UNCLEAN"
>>>>    Much time has passed and I can not accurately describe the behavior...
>>>>
>>>>    Now I am in the following state:
>>>>    I tried locate the problem. Came here with this.
>>>>    I set big value in property stonith-timeout="600s".
>>>>    And got the following behavior:
>>>>    1. pkill -4 corosync
>>>>    2. from node with DC call my fence agent "sshbykey"
>>>>    3. It sends reboot victim and waits until she comes to life again.
>>>   Hmmm.... what version of pacemaker?
>>>   This sounds like a timing issue that we fixed a while back
>>  Was a version 1.1.11 from December 3.
>>  Now try full update and retest.
>
> That should be recent enough.  Can you create a crm_report the next time you reproduce?
>

Of course yes. Little delay.... :) 

......
cc1: warnings being treated as errors
upstart.c: In function ‘upstart_job_property’:
upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
upstart.c:264: error: assignment makes pointer from integer without a cast
gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/ha/pacemaker/lib'
make: *** [core] Error 1

I'm trying to solve this a problem.

>>>>      Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0.
>>>>      All command is logged both the victim and the killer - all right.
>>>>    4. A little later, the status of the (victim) nodes in crm_mon changes to online.
>>>>    5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>>>>      * Start   pingCheck:3  (dev-cluster2-node2)
>>>>    6. In this state, we spend the next 600 seconds.
>>>>      After completing this timeout causes another node (not DC) decides to kill again our victim.
>>>>      All command again is logged both the victim and the killer - All documented :)
>>>>    7. NOW all resource started in right sequence.
>>>>
>>>>    I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>>>>    And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
>>>>
>>>>    I tried understood this behavior.
>>>>    As I understand it:
>>>>    1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
>>>>    2. It make fork and pipe from tham.
>>>>    3. Async call mainloop_child_add with callback to  stonith_action_async_done.
>>>>    4. Add timeout  g_timeout_add to TERM and KILL signals.
>>>>
>>>>    If all right must - call stonith_action_async_done, remove timeout.
>>>>    For some reason this does not happen. I sit and think ....
>>>>>>     At this time, there are constant re-election.
>>>>>>     Also, I noticed the difference when you start pacemaker.
>>>>>>     At normal startup:
>>>>>>     * corosync
>>>>>>     * pacemakerd
>>>>>>     * attrd
>>>>>>     * pengine
>>>>>>     * lrmd
>>>>>>     * crmd
>>>>>>     * cib
>>>>>>
>>>>>>     When hangs start:
>>>>>>     * corosync
>>>>>>     * pacemakerd
>>>>>>     * attrd
>>>>>>     * pengine
>>>>>>     * crmd
>>>>>>     * lrmd
>>>>>>     * cib.
>>>>>    Are you referring to the order of the daemons here?
>>>>>    The cib should not be at the bottom in either case.
>>>>>>     Who knows who runs lrmd?
>>>>>    Pacemakerd.
>>>>>>     _______________________________________________
>>>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>    ,
>>>>>    _______________________________________________
>>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>    _______________________________________________
>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>    Project Home: http://www.clusterlabs.org
>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>    Bugs: http://bugs.clusterlabs.org
>>>   ,
>>>   _______________________________________________
>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org