[ClusterLabs] [Problem] The crmd fails to connect with pengine.
Ken Gaillot
kgaillot at redhat.com
Wed Jan 2 11:26:56 EST 2019
On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote:
> On 28/12/18 05:51 +0900, renayama19661014 at ybb.ne.jp wrote:
> > This problem occurred with our users.
> >
> > The following problem occurred in a two-node cluster that does not
> > set STONITH.
> >
> > The problem seems to have occurred in the following procedure.
> >
> > Step 1) Configure the cluster with 2 nodes. The DC node is the
> > second node.
> > Step 2) Several resources are running on the first node.
> > Step 3) It stops almost at the same time in order of 2nd node and
> > 1st node.
>
> Do I decipher the above correctly that the cluster is scheduled for
> shutdown (fully independently node by node or through a single
> trigger
> with a high level management tool?) and starts proceeding in serial
> manner, shutting 2nd node ~ original DC first?
>
> > Step 4) After the second node stops, the first node tries to
> > calculate the state transition for the resource stop.
> >
> > However, crmd fails to connect with pengine and does not calculate
> > state transitions.
> >
> > -----
> > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client
> > connection failed, not adding channel to mainloop
> > -----
>
> Sadly, it looks like details of why this happened would only be
> retained when debugging/tracing verbosity of the log messages
> was enabled, which likely wasn't the case.
>
> Anyway, perhaps providing a wider context of the log messages
> from this first node might shed some light into this.
Agreed, that's probably the only hope.
This would have to be a low-level issue like an out-of-memory error, or
something at the libqb level.
> > As a result, Pacemaker will stop without stopping the resource.
>
> This might have serious consequences in some scenarios, perhaps
> unless some watchdog-based solution (SBD?) was used as a fencing
> of choice since it would not get defused just as the resource
> wasn't stopped, I think...
Yep, this is unavoidable in this situation. If the last node standing
has an unrecoverable problem, there's no other node remaining to fence
it and recover.
> > The problem seems to have occurred in the following environment.
> >
> > - libqb 1.0
> > - corosync 2.4.1
> > - Pacemaker 1.1.15
> >
> > I tried to reproduce this problem, but for now it can not be
> > reproduced.
> >
> > Do you know the cause of this problem?
>
> No idea at this point.
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list