[Pacemaker] pacemaker and spanning tree in the network between the nodes

Tue Dec 22 08:53:20 UTC 2009

Hi,
On Monday 21 December 2009 12:44:17 pm Dejan Muhamedagic wrote:
> Hi,
>
> On Fri, Dec 18, 2009 at 03:44:11PM +0100, Sebastian Reitenbach wrote:
> > Hi,
> >
> > I have a 4 node cluster, managing some XEN resouces. The XEN resources
> > have location constrains defined, based on pingd. On each node, a pingd
> > clone is running. XEN resources are only started, when the pingd is able
> > to ping the ping node. The xen nodes also have a preferred and fallback
> > location defined. The pingd resources have a timeout of 60 seconds
> > defined.
> > The cluster nodes run on SLES11, x86_64, with those rpms installed:
> > heartbeat-3.0.0-33.2
> > pacemaker-1.0.5-4.1
> > libpacemaker3-1.0.5-4.1
> > pacemaker-mgmt-client-1.99.2-7.1
> > pacemaker-mgmt-1.99.2-7.1
> > openais-0.80.3-26.1
> > libopenais2-0.80.3-26.1
> >
> > I want to switch to a redundant network layout, using spanning tree
> > between the switches. In case of a spanning tree recalculation because of
> > a path failure or whatever other reason, I don't want to have nodes
> > declared as dead because they cannot send heartbeat at that time to each
> > other.
> >
> > Therefore I tried to prepare pacemaker on the cluster nodes.
> > I put the whole cluster in maintenance mode via the hb_gui.
> >
> > Then I reconfigured /etc/ha.d/ha.cf and defined deadtime 70 and initdead
> > 100. Then I restarted heartbeat on each cluster node. I waited until all
> > cluster members were marked green/online in the GUI again. Then I turned
> > off the maintenance mode.
> > All XEN resources were shut down immediately.
>
> Oops.
>
> > Then
>
> A sentence missing?
>
> > In the hb_gui, the pingd resources looked a bit "strange". After leaving
> > the maintenance mode, only one pingd resource showed the description
> > ocf.:pacemaker:pingd, in hb_gui under Management. They were green, and
> > showed it running on ['<server>'].
> >
> > Then I tried to restart the XEN resources manually, but the cluster only
> > tried to start them on one host, not on the preferred or fallback
> > location.
> >
> > Then I shutted down heartbeat on all 4 cluster nodes again, and put back
> > the old ha.cf file, with deadtime 15 and initdead 40. And restarted
> > heartbeat. After the cluster was running, the pingd resources were also
> > started up. And then after the 60 seconds, the ping attribute was set,
> > and the XEN resources were started up on all hosts.
> >
> > I wonder about some things:
> > 1. why three of the pingd resources had no description shown after
> > leaving the maintenance mode.
> >
> > 2. why all XEN resources were shut down after leaving the maintenance
> > mode. Here I have a theory: In maintenance mode, the pingd attribute did
> > not got updated, and because heartbeat was restarted on each node, the
> > attribute was not set. Therefore when leaving the maintenance mode,
> > pacemaker decided to shut down the XEN resources, because the pingd
> > attribute was not set.
>
> Sounds like a plausible explanation.
>
> > 3. Why the pingd attribute was not set immediately after pingd started
> > up, and was able to ping the ping node. After the pingd was started, then
> > it waited 60 seconds (the timeout value) to set the attribute so that
> > then the XEN resources were able to start, due to their location
> > constraint.
> >
> > 4. Maybe the answers to the other questions will answer this alaready:
> > Why the cluster behaved that strange at all with the large timeout values
> > set in ha.cf.
> >
> > I could also send a cluster-report in case it may help to figure out what
> > was wrong here, I just did not wanted to send a large attachement to the
> > list in the first place.
>
> Probably the best to open a bugzilla and attach there the report.
> I guess that special care is necessary on setting resources to
> the unmanaged mode in case there are constraints which depend on
> pingd attributes.
I'm just updating another cluster to SLES11, will try to reproduce the problem 
there, and create a bug report with hb_report attached.

thanks
Sebastian


>
> Thanks,
>
> Dejan
>
> > regards,
> > Sebastian
> >
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker