[Pacemaker] pacemaker and spanning tree in the network between the nodes

Wed Dec 23 08:20:32 EST 2009

Hi,

as said, I updated an other cluster, this time a three node cluster, and I 
took some time with one XEN resource configured to test a bit with it.
The XEN resource also had the pingd constraint defined.

There I observed the following things, see below:

On Monday 21 December 2009 12:44:17 pm Dejan Muhamedagic wrote:
> Hi,
>

> >
> > I wonder about some things:
> > 1. why three of the pingd resources had no description shown after
> > leaving the maintenance mode.
I have seen sth. similar too, but did not observed anything strange that would 
explain that to me.

> >
> > 2. why all XEN resources were shut down after leaving the maintenance
> > mode. Here I have a theory: In maintenance mode, the pingd attribute did
> > not got updated, and because heartbeat was restarted on each node, the
> > attribute was not set. Therefore when leaving the maintenance mode,
> > pacemaker decided to shut down the XEN resources, because the pingd
> > attribute was not set.
>
> Sounds like a plausible explanation.
That seems to be the case, I tested:
1. put the whole cluster into maintenance mode. Then all resources went into 
maintenance mode.
2. I did not restarted heartbeat on each of the nodes
3. I disabled the maintenance mode again, everything stayed to be fine, the 
XEN resource was still running.
The pingd=1 attribute for the node where the XEN resource was running on, was 
still there.

Then I tested again:
1. put the whole cluster into maintenance mode. Then all resources went into 
maintenance mode.
2. I now restarted heartbeat on each of the nodes
3. I disabled the maintenance mode again, and the XEN resource was shut down 
immediately
4. I waited, and waited and waited, but the pingd resources did not updated 
the pingd=1 attribute
5. Stop the pingd clone, I saw the pingd attribute was set to 0
6. Start the pingd clone, I saw the pingd attribute was set to 1
7. After a while, the XEN resource was starting

Therefore, my workaround to the problem, in case I need to restart heartbeat, 
now is:
1. Put each resource depending on pingd attribute into maintenance mode 
separately
2. Stop Pingd
3. restart heartbeat on each of the cluster nodes
4. wait until the hearbeat is back again, and then start the pingd resource 
again
5. watch the logfiles, until the pingd attribute for the nodes get set to 1
6. put each resource separately into maintained mode again
7. everything is fine then!
So I wonder whether this is by design that the pingd doesn't update the 
attribute when it is transitioning from maintenance to maintained mode?
Or could this considered a bug or sth. for an enhancement request?

>
> > 3. Why the pingd attribute was not set immediately after pingd started
> > up, and was able to ping the ping node. After the pingd was started, then
> > it waited 60 seconds (the timeout value) to set the attribute so that
> > then the XEN resources were able to start, due to their location
> > constraint.
I must have observed this somehow falsely. The pingd attribute was set only 
some seconds after the pingd was started on the nodes. However, the depending 
XEN resources, were only started about a minute after that happend.
Is there any parameter I can use to shorten that time frame from a minute to 
some seconds?

> >
> > 4. Maybe the answers to the other questions will answer this alaready:
> > Why the cluster behaved that strange at all with the large timeout values
> > set in ha.cf.
I also tested here with larger values for deadtime and initdead in 
/etc/ha.d/ha.cf file, and did not observed any strange behaviour. So I guess 
that observation was just a coincidence from the above....

> >
> > I could also send a cluster-report in case it may help to figure out what
> > was wrong here, I just did not wanted to send a large attachement to the
> > list in the first place.
>
> Probably the best to open a bugzilla and attach there the report.
> I guess that special care is necessary on setting resources to
> the unmanaged mode in case there are constraints which depend on
> pingd attributes.
Due to my further observations, no real need to open a bug report anymore.

thanks and a happy Christmas,
Sebastian