[Pacemaker] crm_resource -L not trustable right after restart

Brian J. Murrell (brian) brian at interlinx.bc.ca
Wed Jan 15 14:53:50 EST 2014


On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
> 
> Consider any long running action, such as starting a database.
> We do not update the CIB until after actions have completed, so there can and will be times when the status section is out of date to one degree or another.

But that is the opposite of what I am reporting and is acceptable.  It's
acceptable for a resource that is in the process of starting being
reported as stopped, because it's not yet started.

What I am seeing is resources being reported as stopped when they are in
fact started/running and have been for a long time.

> At node startup is another point at which the status could potentially be behind.

Right.  Which is the case I am talking about.

> It sounds to me like you're trying to second guess the cluster, which is a dangerous path.

No, not trying to second guess at all.  I'm just trying to ask the
cluster what the state is and not getting the truth.  I am willing to
believe whatever state the cluster says it's in as long as what I am
getting is the truth.

> What if its the first node to start up?

I'd think a timeout comes in to play here.

> There'd be no fresh copy to arrive in that case.

I can't say that I know how the CIB works internally/entirely, but I'd
imagine that when a cluster node starts up it tries to see if there is a
more fresh CIB out there in the cluster.  Maybe this is part of the
process of choosing/discovering a DC.  But ultimately if the node is the
first one up, it will eventually figure that out so that it can nominate
itself as the DC.  Or it finds out that there is a DC already (and gets
a fresh CIB from it?).  It's during that window that I propose that
crm_resource should not be asserting anything and should just admit that
it does not (yet) know.

> If it had enough information to know it was out of date, it wouldn't be out of date.

But surely it understands if it is in the process of joining a cluster
or not, and therefore does know enough to know that it doesn't know if
it's out of date or not.  But that it could be.

> As above, there are situations when you'd never get an answer.

I should have added to my proposal "or has determined that there is
nothing to refresh it's CIB from" and that it's local copy is
authoritative for the whole cluster.

b.







More information about the Pacemaker mailing list