[ClusterLabs] ip clustering strange behaviour

Klaus Wenninger kwenning at redhat.com
Mon Sep 5 13:29:42 UTC 2016


On 09/05/2016 03:02 PM, Gabriele Bulfon wrote:
> I read docs, looks like sbd fencing is more about iscsi/fc exposed
> storage resources.
> Here I have real shared disks (seen from solaris with the format
> utility as normal sas disks, but on both nodes).
> They are all jbod disks, that ZFS organizes in raidz/mirror pools, so
> I have 5 disks on one pool in one node, and the other 5 disks on
> another pool in one node.
> How can sbd work in this situation? Has it already been used/tested on
> a Solaris env with ZFS ?

You wouldn't have to have discs at all with sbd. You can just use it for
pacemaker
to be monitored by a hardware-watchdog.
But if you want to add discs it shouldn't really matter how they are
accessed as
long as you can concurrently read/write the block-devices. Configuration of
caching in the controllers might be an issue as well.
I'm e.g. currently testing with a simple kvm setup using following
virsh-config
for the shared block-device:

<disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='SHARED_IMAGE_FILE'/>
      <target dev='vdb' bus='virtio'/>
      <shareable/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x15'
function='0x0'/>
 </disk>

Don't know about test-coverage for sbd on Solaris. Actually it should be
independent
of which file-system you are using as you would anyway use a partition
without
filesystem for sbd.

>
> BTW, is there any other possibility other than sbd.
>

Probably - see Kens' suggestions.
Excuse me thinking a little unidimensional at the moment
working on some sbd-issue ;-)
And not having a proper fencing-device a watchdog is the last resort to
have something
working reliably. And pacemakers' way to do watchdog is sbd...

> Last but not least, is there any way to let ssh-fencing be considered
> good?
> At the moment, with ssh-fencing, if I shut down the second node, I get
> all second resources in UNCLEAN state, not taken by the first one.
> If I reboot the second , I only get the node on again, but resources
> remain stopped.

Strange... What do the logs say about the fencing-action being
successful or not?

>
> I remember my tests with heartbeat react different (halt would move
> everything to node1 and get back everything on restart)
>
> Gabriele
>
> ----------------------------------------------------------------------------------------
> *Sonicle S.r.l. *: http://www.sonicle.com <http://www.sonicle.com/>
> *Music: *http://www.gabrielebulfon.com <http://www.gabrielebulfon.com/>
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
> ----------------------------------------------------------------------------------
>
> Da: Klaus Wenninger <kwenning at redhat.com>
> A: users at clusterlabs.org
> Data: 5 settembre 2016 12.21.25 CEST
> Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>
>     On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
>     > The dual machine is equipped with a syncro controller LSI 3008
>     MPT SAS3.
>     > Both nodes can see the same jbod disks (10 at the moment, up to 24).
>     > Systems are XStreamOS / illumos, with ZFS.
>     > Each system has one ZFS pool of 5 disks, with different pool names
>     > (data1, data2).
>     > When in active / active, the two machines run different zones and
>     > services on their pools, on their networks.
>     > I have custom resource agents (tested on pacemaker/heartbeat, now
>     > porting to pacemaker/corosync) for ZFS pools and zones migration.
>     > When I was testing pacemaker/heartbeat, when ssh-fencing discovered
>     > the other node to be down (cleanly or abrupt halt), it was
>     > automatically using IPaddr and our ZFS agents to take control of
>     > everything, mounting the other pool and running any configured
>     zone in it.
>     > I would like to do the same with pacemaker/corosync.
>     > The two nodes of the dual machine have an inernal lan connecting
>     them,
>     > a 100Mb ethernet: maybe this is enough reliable to trust
>     ssh-fencing?
>     > Or is there anything I can do to ensure at the controller level that
>     > the pool is not in use on the other node?
>
>     It is not just about the reliability of the networking-connection why
>     ssh-fencing might be
>     suboptimal. Something with the IP-Stack config (dynamic due to moving
>     resources)
>     might have gone wrong. And resources might be somehow hanging so that
>     the node
>     can't be brought down gracefully. Thus my suggestion to add a watchdog
>     (so available)
>     via sbd.
>
>     >
>     > Gabriele
>     >
>     >
>     ----------------------------------------------------------------------------------------
>     > *Sonicle S.r.l. *: http://www.sonicle.com <http://www.sonicle.com/>
>     > *Music: *http://www.gabrielebulfon.com
>     <http://www.gabrielebulfon.com/>
>     > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>     >
>     >
>     >
>     >
>     ----------------------------------------------------------------------------------
>     >
>     > Da: Ken Gaillot <kgaillot at redhat.com>
>     > A: gbulfon at sonicle.com Cluster Labs - All topics related to
>     > open-source clustering welcomed <users at clusterlabs.org>
>     > Data: 1 settembre 2016 15.49.04 CEST
>     > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>     >
>     > On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
>     > > Thanks, got it.
>     > > So, is it better to use "two_node: 1" or, as suggested else
>     > where, or
>     > > "no-quorum-policy=stop"?
>     >
>     > I'd prefer "two_node: 1" and letting pacemaker's options
>     default. But
>     > see the votequorum(5) man page for what two_node implies -- most
>     > importantly, both nodes have to be available when the cluster starts
>     > before it will start any resources. Node failure is handled fine
>     once
>     > the cluster has started, but at start time, both nodes must be up.
>     >
>     > > About fencing, the machine I'm going to implement the 2-nodes
>     > cluster is
>     > > a dual machine with shared disks backend.
>     > > Each node has two 10Gb ethernets dedicated to the public ip
>     and the
>     > > admin console.
>     > > Then there is a third 100Mb ethernet connecing the two machines
>     > internally.
>     > > I was going to use this last one as fencing via ssh, but looks
>     > like this
>     > > way I'm not gonna have ip/pool/zone movements if one of the nodes
>     > > freezes or halts without shutting down pacemaker clean.
>     > > What should I use instead?
>     >
>     > I'm guessing as a dual machine, they share a power supply, so that
>     > rules
>     > out a power switch. If the box has IPMI that can individually power
>     > cycle each host, you can use fence_ipmilan. If the disks are
>     > shared via
>     > iSCSI, you could use fence_scsi. If the box has a hardware watchdog
>     > device that can individually target the hosts, you could use sbd. If
>     > none of those is an option, probably the best you could do is
>     run the
>     > cluster nodes as VMs on each host, and use fence_xvm.
>     >
>     > > Thanks for your help,
>     > > Gabriele
>     > >
>     > >
>     >
>     ----------------------------------------------------------------------------------------
>     > > *Sonicle S.r.l. *: http://www.sonicle.com
>     <http://www.sonicle.com/>
>     > > *Music: *http://www.gabrielebulfon.com
>     > <http://www.gabrielebulfon.com/>
>     > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>     > >
>     > >
>     > >
>     > >
>     >
>     ----------------------------------------------------------------------------------
>     > >
>     > > Da: Ken Gaillot <kgaillot at redhat.com>
>     > > A: users at clusterlabs.org
>     > > Data: 31 agosto 2016 17.25.05 CEST
>     > > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>     > >
>     > > On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
>     > > > Sorry for reiterating, but my main question was:
>     > > >
>     > > > why does node 1 removes its own IP if I shut down node 2
>     abruptly?
>     > > > I understand that it does not take the node 2 IP (because the
>     > > > ssh-fencing has no clue about what happened on the 2nd node),
>     > but I
>     > > > wouldn't expect it to shut down its own IP...this would kill any
>     > > service
>     > > > on both nodes...what am I wrong?
>     > >
>     > > Assuming you're using corosync 2, be sure you have "two_node:
>     1" in
>     > > corosync.conf. That will tell corosync to pretend there is always
>     > > quorum, so pacemaker doesn't need any special quorum settings.
>     > See the
>     > > votequorum(5) man page for details. Of course, you need fencing
>     > in this
>     > > setup, to handle when communication between the nodes is broken
>     > but both
>     > > are still up.
>     > >
>     > > >
>     > >
>     >
>     ----------------------------------------------------------------------------------------
>     > > > *Sonicle S.r.l. *: http://www.sonicle.com
>     > <http://www.sonicle.com/>
>     > > > *Music: *http://www.gabrielebulfon.com
>     > > <http://www.gabrielebulfon.com/>
>     > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>     > > >
>     > > >
>     > >
>     >
>     ------------------------------------------------------------------------
>     > > >
>     > > >
>     > > > *Da:* Gabriele Bulfon <gbulfon at sonicle.com>
>     > > > *A:* kwenning at redhat.com Cluster Labs - All topics related to
>     > > > open-source clustering welcomed <users at clusterlabs.org>
>     > > > *Data:* 29 agosto 2016 17.37.36 CEST
>     > > > *Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
>     > > >
>     > > >
>     > > > Ok, got it, I hadn't gracefully shut pacemaker on node2.
>     > > > Now I restarted, everything was up, stopped pacemaker service on
>     > > > host2 and I got host1 with both IPs configured. ;)
>     > > >
>     > > > But, though I understand that if I halt host2 with no grace
>     > shut of
>     > > > pacemaker, it will not move the IP2 to Host1, I don't expect
>     host1
>     > > > to loose its own IP! Why?
>     > > >
>     > > > Gabriele
>     > > >
>     > > >
>     > >
>     >
>     ----------------------------------------------------------------------------------------
>     > > > *Sonicle S.r.l. *: http://www.sonicle.com
>     > <http://www.sonicle.com/>
>     > > > *Music: *http://www.gabrielebulfon.com
>     > > <http://www.gabrielebulfon.com/>
>     > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>     > > >
>     > > >
>     > > >
>     > > >
>     > >
>     >
>     ----------------------------------------------------------------------------------
>     > > >
>     > > > Da: Klaus Wenninger <kwenning at redhat.com>
>     > > > A: users at clusterlabs.org
>     > > > Data: 29 agosto 2016 17.26.49 CEST
>     > > > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>     > > >
>     > > > On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
>     > > > > Hi,
>     > > > >
>     > > > > now that I have IPaddr work, I have a strange behaviour on
>     > my test
>     > > > > setup of 2 nodes, here is my configuration:
>     > > > >
>     > > > > ===STONITH/FENCING===
>     > > > >
>     > > > > primitive xstorage1-stonith stonith:external/ssh-sonicle op
>     > > > monitor
>     > > > > interval="25" timeout="25" start-delay="25" params
>     > > > hostlist="xstorage1"
>     > > > >
>     > > > > primitive xstorage2-stonith stonith:external/ssh-sonicle op
>     > > > monitor
>     > > > > interval="25" timeout="25" start-delay="25" params
>     > > > hostlist="xstorage2"
>     > > > >
>     > > > > location xstorage1-stonith-pref xstorage1-stonith -inf:
>     > xstorage1
>     > > > > location xstorage2-stonith-pref xstorage2-stonith -inf:
>     > xstorage2
>     > > > >
>     > > > > property stonith-action=poweroff
>     > > > >
>     > > > >
>     > > > >
>     > > > > ===IP RESOURCES===
>     > > > >
>     > > > >
>     > > > > primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params
>     > > > ip="1.2.3.4"
>     > > > > cidr_netmask="255.255.255.0" nic="e1000g1"
>     > > > > primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params
>     > > > ip="1.2.3.5"
>     > > > > cidr_netmask="255.255.255.0" nic="e1000g1"
>     > > > >
>     > > > > location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100:
>     xstorage1
>     > > > > location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100:
>     xstorage2
>     > > > >
>     > > > > ===================
>     > > > >
>     > > > > So I plumbed e1000g1 with unconfigured IP on both machines and
>     > > > started
>     > > > > corosync/pacemaker, and after some time I got all nodes
>     > online and
>     > > > > started, with IP configured as virtual interfaces
>     (e1000g1:1 and
>     > > > > e1000g1:2) one in host1 and one in host2.
>     > > > >
>     > > > > Then I halted host2, and I expected to have host1 started with
>     > > > both
>     > > > > IPs configured on host1.
>     > > > > Instead, I got host1 started with the IP stopped and removed
>     > (only
>     > > > > e1000g1 unconfigured), host2 stopped saying IP started (!?).
>     > > > > Not exactly what I expected...
>     > > > > What's wrong?
>     > > >
>     > > > How did you stop host2? Graceful shutdown of pacemaker? If
>     not ...
>     > > > Anyway ssh-fencing is just working if the machine is still
>     > > > running ...
>     > > > So it will stay unclean and thus pacemaker is thinking that
>     > > > the IP might still be running on it. So this is actually the
>     > > > expected
>     > > > behavior.
>     > > > You might add a watchdog via sbd if you don't have other fencing
>     > > > hardware at hand ...
>     > > > >
>     > > > > Here is the crm status after I stopped host 2:
>     > > > >
>     > > > > 2 nodes and 4 resources configured
>     > > > >
>     > > > > Node xstorage2: UNCLEAN (offline)
>     > > > > Online: [ xstorage1 ]
>     > > > >
>     > > > > Full list of resources:
>     > > > >
>     > > > > xstorage1-stonith (stonith:external/ssh-sonicle): Started
>     > > > xstorage2
>     > > > > (UNCLEAN)
>     > > > > xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
>     > > > > xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
>     > > > > xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Started xstorage2
>     > > > (UNCLEAN)
>     > > > >
>     > > > >
>     > > > > Gabriele
>     > > > >
>     > > > >
>     > > >
>     > >
>     >
>     ----------------------------------------------------------------------------------------
>     > > > > *Sonicle S.r.l. *: http://www.sonicle.com
>     > > > <http://www.sonicle.com/>
>     > > > > *Music: *http://www.gabrielebulfon.com
>     > > > <http://www.gabrielebulfon.com/>
>     > > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>     >
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     > Users mailing list: Users at clusterlabs.org
>     > http://clusterlabs.org/mailman/listinfo/users
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
>
>
>     _______________________________________________
>     Users mailing list: Users at clusterlabs.org
>     http://clusterlabs.org/mailman/listinfo/users
>
>     Project Home: http://www.clusterlabs.org
>     Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160905/18e0fa3f/attachment-0002.html>


More information about the Users mailing list