[Pacemaker] pacemaker and spanning tree in the network between the nodes

Dejan Muhamedagic dejanmm at fastmail.fm
Wed Dec 23 10:34:52 EST 2009


Hi,

On Wed, Dec 23, 2009 at 02:20:32PM +0100, Sebastian Reitenbach wrote:
> Hi,
> 
> as said, I updated an other cluster, this time a three node cluster, and I 
> took some time with one XEN resource configured to test a bit with it.
> The XEN resource also had the pingd constraint defined.
> 
> There I observed the following things, see below:
> 
> On Monday 21 December 2009 12:44:17 pm Dejan Muhamedagic wrote:
> > Hi,
> >
> 
> 
> > >
> > > I wonder about some things:
> > > 1. why three of the pingd resources had no description shown after
> > > leaving the maintenance mode.
> I have seen sth. similar too, but did not observed anything strange that would 
> explain that to me.
> 
> > >
> > > 2. why all XEN resources were shut down after leaving the maintenance
> > > mode. Here I have a theory: In maintenance mode, the pingd attribute did
> > > not got updated, and because heartbeat was restarted on each node, the
> > > attribute was not set. Therefore when leaving the maintenance mode,
> > > pacemaker decided to shut down the XEN resources, because the pingd
> > > attribute was not set.
> >
> > Sounds like a plausible explanation.
> That seems to be the case, I tested:
> 1. put the whole cluster into maintenance mode. Then all resources went into 
> maintenance mode.
> 2. I did not restarted heartbeat on each of the nodes
> 3. I disabled the maintenance mode again, everything stayed to be fine, the 
> XEN resource was still running.
> The pingd=1 attribute for the node where the XEN resource was running on, was 
> still there.
> 
> Then I tested again:
> 1. put the whole cluster into maintenance mode. Then all resources went into 
> maintenance mode.
> 2. I now restarted heartbeat on each of the nodes
> 3. I disabled the maintenance mode again, and the XEN resource was shut down 
> immediately
> 4. I waited, and waited and waited, but the pingd resources did not updated 
> the pingd=1 attribute
> 5. Stop the pingd clone, I saw the pingd attribute was set to 0
> 6. Start the pingd clone, I saw the pingd attribute was set to 1
> 7. After a while, the XEN resource was starting
> 
> Therefore, my workaround to the problem, in case I need to restart heartbeat, 
> now is:
> 1. Put each resource depending on pingd attribute into maintenance mode 
> separately
> 2. Stop Pingd
> 3. restart heartbeat on each of the cluster nodes
> 4. wait until the hearbeat is back again, and then start the pingd resource 
> again
> 5. watch the logfiles, until the pingd attribute for the nodes get set to 1
> 6. put each resource separately into maintained mode again
> 7. everything is fine then!
> So I wonder whether this is by design that the pingd doesn't update the 
> attribute when it is transitioning from maintenance to maintained mode?
> Or could this considered a bug or sth. for an enhancement request?

I'd say it's a bug. Not sure where though, in pingd or the RA.
Is there anything in the logs?

> > > 3. Why the pingd attribute was not set immediately after pingd started
> > > up, and was able to ping the ping node. After the pingd was started, then
> > > it waited 60 seconds (the timeout value) to set the attribute so that
> > > then the XEN resources were able to start, due to their location
> > > constraint.
> I must have observed this somehow falsely. The pingd attribute was set only 
> some seconds after the pingd was started on the nodes. However, the depending 
> XEN resources, were only started about a minute after that happend.
> Is there any parameter I can use to shorten that time frame from a minute to 
> some seconds?

Don't think so. I don't understand why it waited.

> > > 4. Maybe the answers to the other questions will answer this alaready:
> > > Why the cluster behaved that strange at all with the large timeout values
> > > set in ha.cf.
> I also tested here with larger values for deadtime and initdead in 
> /etc/ha.d/ha.cf file, and did not observed any strange behaviour. So I guess 
> that observation was just a coincidence from the above....
> 
> > >
> > > I could also send a cluster-report in case it may help to figure out what
> > > was wrong here, I just did not wanted to send a large attachement to the
> > > list in the first place.
> >
> > Probably the best to open a bugzilla and attach there the report.
> > I guess that special care is necessary on setting resources to
> > the unmanaged mode in case there are constraints which depend on
> > pingd attributes.
> Due to my further observations, no real need to open a bug report anymore.

It's not clear what happened to the pingd attribute: was it
updated immediately or not? A Xen resource also started later than
expected, that should be investigated too.

Cheers,

Dejan

> thanks and a happy Christmas,
> Sebastian
> 
> 




More information about the Pacemaker mailing list