[Pacemaker] Mail notification for fencing action

Wed Jun 15 18:32:10 EDT 2011

On Wed, Jun 15, 2011 at 4:20 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:

> On Wed, Jun 15, 2011 at 03:26:56PM -0500, mark - pacemaker list wrote:
> > On Wed, Jun 15, 2011 at 12:24 PM, imnotpc <imnotpc at rock3d.net> wrote:
> >
> > >
> > > What I was thinking is that the DC is never fenced
> >
> >
> > Is this actually the case?
>
> In a way it is true. Only DC can order fencing and there is
> always exactly one DC in a partition. On split brain, each
> partition elects a DC and if the DC has quorum it can try to
> fence nodes in other partitions. That's why in two-node clusters

there's always a shoot-out. But note that the old DC (before
> split brain), if it loses quorum, gets fenced by a new DC from
> another partition.
>
> > It would sure explain the one "gotcha" I've
> > never been able to work around in a three node cluster with stonith/SBD.
>  If
> > you unplug the network cable from the DC (but it and the other nodes all
> > still see the SBD disk via their other NIC(s)), the DC of course becomes
> > completely isolated.  It will fence
>
> Fence? It won't fence anything unless it has quorum. Do you have
> no-quorum-policy=ignore?
>

I have no-quorum-policy=freeze.

With this status:

============
Last updated: Wed Jun 15 16:48:57 2011
Stack: Heartbeat
Current DC: cn1.testlab.local (814b426f-ab10-445c-9158-a1765d82395e) -
partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, unknown expected votes
5 Resources configured.
============

Online: [ cn2.testlab.local cn3.testlab.local cn1.testlab.local ]

 Resource Group: MySQL-history
     iscsi_mysql_history (ocf::heartbeat:iscsi): Started cn1.testlab.local
     volgrp_mysql_history (ocf::heartbeat:LVM): Started cn1.testlab.local
     fs_mysql_history (ocf::heartbeat:Filesystem): Started cn1.testlab.local
     ip_mysql_history (ocf::heartbeat:IPaddr2): Started cn1.testlab.local
     mysql_history (ocf::heartbeat:mysql): Started cn1.testlab.local
     mail_alert_history (ocf::heartbeat:MailTo): Started cn1.testlab.local
 Resource Group: MySQL-hsa
     iscsi_mysql_hsa (ocf::heartbeat:iscsi): Started cn2.testlab.local
     volgrp_mysql_hsa (ocf::heartbeat:LVM): Started cn2.testlab.local
     fs_mysql_hsa (ocf::heartbeat:Filesystem): Started cn2.testlab.local
     ip_mysql_hsa (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
     mysql_hsa (ocf::heartbeat:mysql): Started cn2.testlab.local
     mail_alert_hsa (ocf::heartbeat:MailTo): Started cn2.testlab.local
 Resource Group: MySQL-livedata
     iscsi_mysql_livedata (ocf::heartbeat:iscsi): Started cn3.testlab.local
     volgrp_mysql_livedata (ocf::heartbeat:LVM): Started cn3.testlab.local
     fs_mysql_livedata (ocf::heartbeat:Filesystem): Started
cn3.testlab.local
     ip_mysql_livedata (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
     mysql_livedata (ocf::heartbeat:mysql): Started cn3.testlab.local
     mail_alert_livedata (ocf::heartbeat:MailTo): Started cn3.testlab.local
 stonith_sbd (stonith:external/sbd): Started cn2.testlab.local
 Resource Group: Cluster_Status
     cluster_status_ip (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
     cluster_status_page (ocf::heartbeat:apache): Started cn3.testlab.local

I isolated cn1 (the DC, but stonith_sbd was running on cn2).  In this case,
one of the two good nodes became DC and cn1 was fenced, so things worked as
I'd expect.  The outage for cn1's resources is quite short.

However, with *this* status, where everything is the same as above except
the stonith_sbd resource is also located on cn1, so it is both DC and the
node running stonith_sbd:

============
Last updated: Wed Jun 15 16:58:49 2011
Stack: Heartbeat
Current DC: cn1.testlab.local (814b426f-ab10-445c-9158-a1765d82395e) -
partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, unknown expected votes
5 Resources configured.
============

Online: [ cn2.testlab.local cn3.testlab.local cn1.testlab.local ]

 Resource Group: MySQL-history
     iscsi_mysql_history (ocf::heartbeat:iscsi): Started cn1.testlab.local
     volgrp_mysql_history (ocf::heartbeat:LVM): Started cn1.testlab.local
     fs_mysql_history (ocf::heartbeat:Filesystem): Started cn1.testlab.local
     ip_mysql_history (ocf::heartbeat:IPaddr2): Started cn1.testlab.local
     mysql_history (ocf::heartbeat:mysql): Started cn1.testlab.local
     mail_alert_history (ocf::heartbeat:MailTo): Started cn1.testlab.local
 Resource Group: MySQL-hsa
     iscsi_mysql_hsa (ocf::heartbeat:iscsi): Started cn2.testlab.local
     volgrp_mysql_hsa (ocf::heartbeat:LVM): Started cn2.testlab.local
     fs_mysql_hsa (ocf::heartbeat:Filesystem): Started cn2.testlab.local
     ip_mysql_hsa (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
     mysql_hsa (ocf::heartbeat:mysql): Started cn2.testlab.local
     mail_alert_hsa (ocf::heartbeat:MailTo): Started cn2.testlab.local
 Resource Group: MySQL-livedata
     iscsi_mysql_livedata (ocf::heartbeat:iscsi): Started cn3.testlab.local
     volgrp_mysql_livedata (ocf::heartbeat:LVM): Started cn3.testlab.local
     fs_mysql_livedata (ocf::heartbeat:Filesystem): Started
cn3.testlab.local
     ip_mysql_livedata (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
     mysql_livedata (ocf::heartbeat:mysql): Started cn3.testlab.local
     mail_alert_livedata (ocf::heartbeat:MailTo): Started cn3.testlab.local
 stonith_sbd (stonith:external/sbd): Started cn1.testlab.local
 Resource Group: Cluster_Status
     cluster_status_ip (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
     cluster_status_page (ocf::heartbeat:apache): Started cn2.testlab.local

... when I isolated cn1, it almost immediately fenced cn3.  Approx 30
seconds later cn2 promotes itself to DC as it's the only surviving node with
network connectivity, but of course cn3 is just trying to come back up after
a reboot so it isn't participating yet.  I have two nodes that think they're
DC, neither with quorum.  That's where I decided to change no-quorum-policy
to freeze, because at this time all services would shut down completely.
 With freeze, at least the services on the surviving good node stay up.

Once cn3 finishes booting pacemaker starts, then cn2 and cn3 form a quorum
and cn1 finally gets fenced, and all resources are able to start on machines
with network connectivity.  The outage in this case has of course been quite
a bit longer than the previous one.

Regards,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110615/aa2e1e8c/attachment-0003.html>