[ClusterLabs] [Problem] The crmd fails to connect with pengine.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Sat Jan 5 20:38:34 UTC 2019
Hi Jan,
Hi Ken,
Thanks for your comment.
I am going to check a little more about the problem of libqb.
Many thanks,
Hideo Yamauchi.
----- Original Message -----
> From: Ken Gaillot <kgaillot at redhat.com>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2019/1/3, Thu 01:26
> Subject: Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
>
> On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote:
>> On 28/12/18 05:51 +0900, renayama19661014 at ybb.ne.jp wrote:
>> > This problem occurred with our users.
>> >
>> > The following problem occurred in a two-node cluster that does not
>> > set STONITH.
>> >
>> > The problem seems to have occurred in the following procedure.
>> >
>> > Step 1) Configure the cluster with 2 nodes. The DC node is the
>> > second node.
>> > Step 2) Several resources are running on the first node.
>> > Step 3) It stops almost at the same time in order of 2nd node and
>> > 1st node.
>>
>> Do I decipher the above correctly that the cluster is scheduled for
>> shutdown (fully independently node by node or through a single
>> trigger
>> with a high level management tool?) and starts proceeding in serial
>> manner, shutting 2nd node ~ original DC first?
>>
>> > Step 4) After the second node stops, the first node tries to
>> > calculate the state transition for the resource stop.
>> >
>> > However, crmd fails to connect with pengine and does not calculate
>> > state transitions.
>> >
>> > -----
>> > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client
>> > connection failed, not adding channel to mainloop
>> > -----
>>
>> Sadly, it looks like details of why this happened would only be
>> retained when debugging/tracing verbosity of the log messages
>> was enabled, which likely wasn't the case.
>>
>> Anyway, perhaps providing a wider context of the log messages
>> from this first node might shed some light into this.
>
> Agreed, that's probably the only hope.
>
> This would have to be a low-level issue like an out-of-memory error, or
> something at the libqb level.
>
>> > As a result, Pacemaker will stop without stopping the resource.
>>
>> This might have serious consequences in some scenarios, perhaps
>> unless some watchdog-based solution (SBD?) was used as a fencing
>> of choice since it would not get defused just as the resource
>> wasn't stopped, I think...
>
> Yep, this is unavoidable in this situation. If the last node standing
> has an unrecoverable problem, there's no other node remaining to fence
> it and recover.
>
>> > The problem seems to have occurred in the following environment.
>> >
>> > - libqb 1.0
>> > - corosync 2.4.1
>> > - Pacemaker 1.1.15
>> >
>> > I tried to reproduce this problem, but for now it can not be
>> > reproduced.
>> >
>> > Do you know the cause of this problem?
>>
>> No idea at this point.
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list