[Pacemaker] Mail notification for fencing action

Thu Jun 16 09:09:50 EDT 2011

On Wed, Jun 15, 2011 at 05:32:10PM -0500, mark - pacemaker list wrote:
> On Wed, Jun 15, 2011 at 4:20 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:
> 
> > On Wed, Jun 15, 2011 at 03:26:56PM -0500, mark - pacemaker list wrote:
> > > On Wed, Jun 15, 2011 at 12:24 PM, imnotpc <imnotpc at rock3d.net> wrote:
> > >
> > > >
> > > > What I was thinking is that the DC is never fenced
> > >
> > >
> > > Is this actually the case?
> >
> > In a way it is true. Only DC can order fencing and there is
> > always exactly one DC in a partition. On split brain, each
> > partition elects a DC and if the DC has quorum it can try to
> > fence nodes in other partitions. That's why in two-node clusters
> 
> there's always a shoot-out. But note that the old DC (before
> > split brain), if it loses quorum, gets fenced by a new DC from
> > another partition.
> >
> > > It would sure explain the one "gotcha" I've
> > > never been able to work around in a three node cluster with stonith/SBD.
> >  If
> > > you unplug the network cable from the DC (but it and the other nodes all
> > > still see the SBD disk via their other NIC(s)), the DC of course becomes
> > > completely isolated.  It will fence
> >
> > Fence? It won't fence anything unless it has quorum. Do you have
> > no-quorum-policy=ignore?
> >
> 
> I have no-quorum-policy=freeze.

OK. It seems like freeze freezes just resources, but fencing
requests are still generated. That really shouldn't be
happening. Can you please file a bugzilla.

Cheers,

Dejan

> With this status:
> 
> ============
> Last updated: Wed Jun 15 16:48:57 2011
> Stack: Heartbeat
> Current DC: cn1.testlab.local (814b426f-ab10-445c-9158-a1765d82395e) -
> partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, unknown expected votes
> 5 Resources configured.
> ============
> 
> Online: [ cn2.testlab.local cn3.testlab.local cn1.testlab.local ]
> 
>  Resource Group: MySQL-history
>      iscsi_mysql_history (ocf::heartbeat:iscsi): Started cn1.testlab.local
>      volgrp_mysql_history (ocf::heartbeat:LVM): Started cn1.testlab.local
>      fs_mysql_history (ocf::heartbeat:Filesystem): Started cn1.testlab.local
>      ip_mysql_history (ocf::heartbeat:IPaddr2): Started cn1.testlab.local
>      mysql_history (ocf::heartbeat:mysql): Started cn1.testlab.local
>      mail_alert_history (ocf::heartbeat:MailTo): Started cn1.testlab.local
>  Resource Group: MySQL-hsa
>      iscsi_mysql_hsa (ocf::heartbeat:iscsi): Started cn2.testlab.local
>      volgrp_mysql_hsa (ocf::heartbeat:LVM): Started cn2.testlab.local
>      fs_mysql_hsa (ocf::heartbeat:Filesystem): Started cn2.testlab.local
>      ip_mysql_hsa (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
>      mysql_hsa (ocf::heartbeat:mysql): Started cn2.testlab.local
>      mail_alert_hsa (ocf::heartbeat:MailTo): Started cn2.testlab.local
>  Resource Group: MySQL-livedata
>      iscsi_mysql_livedata (ocf::heartbeat:iscsi): Started cn3.testlab.local
>      volgrp_mysql_livedata (ocf::heartbeat:LVM): Started cn3.testlab.local
>      fs_mysql_livedata (ocf::heartbeat:Filesystem): Started
> cn3.testlab.local
>      ip_mysql_livedata (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
>      mysql_livedata (ocf::heartbeat:mysql): Started cn3.testlab.local
>      mail_alert_livedata (ocf::heartbeat:MailTo): Started cn3.testlab.local
>  stonith_sbd (stonith:external/sbd): Started cn2.testlab.local
>  Resource Group: Cluster_Status
>      cluster_status_ip (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
>      cluster_status_page (ocf::heartbeat:apache): Started cn3.testlab.local
> 
> 
> I isolated cn1 (the DC, but stonith_sbd was running on cn2).  In this case,
> one of the two good nodes became DC and cn1 was fenced, so things worked as
> I'd expect.  The outage for cn1's resources is quite short.
> 
> However, with *this* status, where everything is the same as above except
> the stonith_sbd resource is also located on cn1, so it is both DC and the
> node running stonith_sbd:
> 
> ============
> Last updated: Wed Jun 15 16:58:49 2011
> Stack: Heartbeat
> Current DC: cn1.testlab.local (814b426f-ab10-445c-9158-a1765d82395e) -
> partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, unknown expected votes
> 5 Resources configured.
> ============
> 
> Online: [ cn2.testlab.local cn3.testlab.local cn1.testlab.local ]
> 
>  Resource Group: MySQL-history
>      iscsi_mysql_history (ocf::heartbeat:iscsi): Started cn1.testlab.local
>      volgrp_mysql_history (ocf::heartbeat:LVM): Started cn1.testlab.local
>      fs_mysql_history (ocf::heartbeat:Filesystem): Started cn1.testlab.local
>      ip_mysql_history (ocf::heartbeat:IPaddr2): Started cn1.testlab.local
>      mysql_history (ocf::heartbeat:mysql): Started cn1.testlab.local
>      mail_alert_history (ocf::heartbeat:MailTo): Started cn1.testlab.local
>  Resource Group: MySQL-hsa
>      iscsi_mysql_hsa (ocf::heartbeat:iscsi): Started cn2.testlab.local
>      volgrp_mysql_hsa (ocf::heartbeat:LVM): Started cn2.testlab.local
>      fs_mysql_hsa (ocf::heartbeat:Filesystem): Started cn2.testlab.local
>      ip_mysql_hsa (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
>      mysql_hsa (ocf::heartbeat:mysql): Started cn2.testlab.local
>      mail_alert_hsa (ocf::heartbeat:MailTo): Started cn2.testlab.local
>  Resource Group: MySQL-livedata
>      iscsi_mysql_livedata (ocf::heartbeat:iscsi): Started cn3.testlab.local
>      volgrp_mysql_livedata (ocf::heartbeat:LVM): Started cn3.testlab.local
>      fs_mysql_livedata (ocf::heartbeat:Filesystem): Started
> cn3.testlab.local
>      ip_mysql_livedata (ocf::heartbeat:IPaddr2): Started cn3.testlab.local
>      mysql_livedata (ocf::heartbeat:mysql): Started cn3.testlab.local
>      mail_alert_livedata (ocf::heartbeat:MailTo): Started cn3.testlab.local
>  stonith_sbd (stonith:external/sbd): Started cn1.testlab.local
>  Resource Group: Cluster_Status
>      cluster_status_ip (ocf::heartbeat:IPaddr2): Started cn2.testlab.local
>      cluster_status_page (ocf::heartbeat:apache): Started cn2.testlab.local
> 
> 
> 
> ... when I isolated cn1, it almost immediately fenced cn3.  Approx 30
> seconds later cn2 promotes itself to DC as it's the only surviving node with
> network connectivity, but of course cn3 is just trying to come back up after
> a reboot so it isn't participating yet.  I have two nodes that think they're
> DC, neither with quorum.  That's where I decided to change no-quorum-policy
> to freeze, because at this time all services would shut down completely.
>  With freeze, at least the services on the surviving good node stay up.
> 
> Once cn3 finishes booting pacemaker starts, then cn2 and cn3 form a quorum
> and cn1 finally gets fenced, and all resources are able to start on machines
> with network connectivity.  The outage in this case has of course been quite
> a bit longer than the previous one.
> 
> Regards,
> Mark

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker