[ClusterLabs] ip clustering strange behaviour
Klaus Wenninger
kwenning at redhat.com
Mon Sep 5 13:29:42 UTC 2016
On 09/05/2016 03:02 PM, Gabriele Bulfon wrote:
> I read docs, looks like sbd fencing is more about iscsi/fc exposed
> storage resources.
> Here I have real shared disks (seen from solaris with the format
> utility as normal sas disks, but on both nodes).
> They are all jbod disks, that ZFS organizes in raidz/mirror pools, so
> I have 5 disks on one pool in one node, and the other 5 disks on
> another pool in one node.
> How can sbd work in this situation? Has it already been used/tested on
> a Solaris env with ZFS ?
You wouldn't have to have discs at all with sbd. You can just use it for
pacemaker
to be monitored by a hardware-watchdog.
But if you want to add discs it shouldn't really matter how they are
accessed as
long as you can concurrently read/write the block-devices. Configuration of
caching in the controllers might be an issue as well.
I'm e.g. currently testing with a simple kvm setup using following
virsh-config
for the shared block-device:
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source file='SHARED_IMAGE_FILE'/>
<target dev='vdb' bus='virtio'/>
<shareable/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x15'
function='0x0'/>
</disk>
Don't know about test-coverage for sbd on Solaris. Actually it should be
independent
of which file-system you are using as you would anyway use a partition
without
filesystem for sbd.
>
> BTW, is there any other possibility other than sbd.
>
Probably - see Kens' suggestions.
Excuse me thinking a little unidimensional at the moment
working on some sbd-issue ;-)
And not having a proper fencing-device a watchdog is the last resort to
have something
working reliably. And pacemakers' way to do watchdog is sbd...
> Last but not least, is there any way to let ssh-fencing be considered
> good?
> At the moment, with ssh-fencing, if I shut down the second node, I get
> all second resources in UNCLEAN state, not taken by the first one.
> If I reboot the second , I only get the node on again, but resources
> remain stopped.
Strange... What do the logs say about the fencing-action being
successful or not?
>
> I remember my tests with heartbeat react different (halt would move
> everything to node1 and get back everything on restart)
>
> Gabriele
>
> ----------------------------------------------------------------------------------------
> *Sonicle S.r.l. *: http://www.sonicle.com <http://www.sonicle.com/>
> *Music: *http://www.gabrielebulfon.com <http://www.gabrielebulfon.com/>
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
> ----------------------------------------------------------------------------------
>
> Da: Klaus Wenninger <kwenning at redhat.com>
> A: users at clusterlabs.org
> Data: 5 settembre 2016 12.21.25 CEST
> Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>
> On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
> > The dual machine is equipped with a syncro controller LSI 3008
> MPT SAS3.
> > Both nodes can see the same jbod disks (10 at the moment, up to 24).
> > Systems are XStreamOS / illumos, with ZFS.
> > Each system has one ZFS pool of 5 disks, with different pool names
> > (data1, data2).
> > When in active / active, the two machines run different zones and
> > services on their pools, on their networks.
> > I have custom resource agents (tested on pacemaker/heartbeat, now
> > porting to pacemaker/corosync) for ZFS pools and zones migration.
> > When I was testing pacemaker/heartbeat, when ssh-fencing discovered
> > the other node to be down (cleanly or abrupt halt), it was
> > automatically using IPaddr and our ZFS agents to take control of
> > everything, mounting the other pool and running any configured
> zone in it.
> > I would like to do the same with pacemaker/corosync.
> > The two nodes of the dual machine have an inernal lan connecting
> them,
> > a 100Mb ethernet: maybe this is enough reliable to trust
> ssh-fencing?
> > Or is there anything I can do to ensure at the controller level that
> > the pool is not in use on the other node?
>
> It is not just about the reliability of the networking-connection why
> ssh-fencing might be
> suboptimal. Something with the IP-Stack config (dynamic due to moving
> resources)
> might have gone wrong. And resources might be somehow hanging so that
> the node
> can't be brought down gracefully. Thus my suggestion to add a watchdog
> (so available)
> via sbd.
>
> >
> > Gabriele
> >
> >
> ----------------------------------------------------------------------------------------
> > *Sonicle S.r.l. *: http://www.sonicle.com <http://www.sonicle.com/>
> > *Music: *http://www.gabrielebulfon.com
> <http://www.gabrielebulfon.com/>
> > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> >
> >
> >
> >
> ----------------------------------------------------------------------------------
> >
> > Da: Ken Gaillot <kgaillot at redhat.com>
> > A: gbulfon at sonicle.com Cluster Labs - All topics related to
> > open-source clustering welcomed <users at clusterlabs.org>
> > Data: 1 settembre 2016 15.49.04 CEST
> > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
> >
> > On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
> > > Thanks, got it.
> > > So, is it better to use "two_node: 1" or, as suggested else
> > where, or
> > > "no-quorum-policy=stop"?
> >
> > I'd prefer "two_node: 1" and letting pacemaker's options
> default. But
> > see the votequorum(5) man page for what two_node implies -- most
> > importantly, both nodes have to be available when the cluster starts
> > before it will start any resources. Node failure is handled fine
> once
> > the cluster has started, but at start time, both nodes must be up.
> >
> > > About fencing, the machine I'm going to implement the 2-nodes
> > cluster is
> > > a dual machine with shared disks backend.
> > > Each node has two 10Gb ethernets dedicated to the public ip
> and the
> > > admin console.
> > > Then there is a third 100Mb ethernet connecing the two machines
> > internally.
> > > I was going to use this last one as fencing via ssh, but looks
> > like this
> > > way I'm not gonna have ip/pool/zone movements if one of the nodes
> > > freezes or halts without shutting down pacemaker clean.
> > > What should I use instead?
> >
> > I'm guessing as a dual machine, they share a power supply, so that
> > rules
> > out a power switch. If the box has IPMI that can individually power
> > cycle each host, you can use fence_ipmilan. If the disks are
> > shared via
> > iSCSI, you could use fence_scsi. If the box has a hardware watchdog
> > device that can individually target the hosts, you could use sbd. If
> > none of those is an option, probably the best you could do is
> run the
> > cluster nodes as VMs on each host, and use fence_xvm.
> >
> > > Thanks for your help,
> > > Gabriele
> > >
> > >
> >
> ----------------------------------------------------------------------------------------
> > > *Sonicle S.r.l. *: http://www.sonicle.com
> <http://www.sonicle.com/>
> > > *Music: *http://www.gabrielebulfon.com
> > <http://www.gabrielebulfon.com/>
> > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> > >
> > >
> > >
> > >
> >
> ----------------------------------------------------------------------------------
> > >
> > > Da: Ken Gaillot <kgaillot at redhat.com>
> > > A: users at clusterlabs.org
> > > Data: 31 agosto 2016 17.25.05 CEST
> > > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
> > >
> > > On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
> > > > Sorry for reiterating, but my main question was:
> > > >
> > > > why does node 1 removes its own IP if I shut down node 2
> abruptly?
> > > > I understand that it does not take the node 2 IP (because the
> > > > ssh-fencing has no clue about what happened on the 2nd node),
> > but I
> > > > wouldn't expect it to shut down its own IP...this would kill any
> > > service
> > > > on both nodes...what am I wrong?
> > >
> > > Assuming you're using corosync 2, be sure you have "two_node:
> 1" in
> > > corosync.conf. That will tell corosync to pretend there is always
> > > quorum, so pacemaker doesn't need any special quorum settings.
> > See the
> > > votequorum(5) man page for details. Of course, you need fencing
> > in this
> > > setup, to handle when communication between the nodes is broken
> > but both
> > > are still up.
> > >
> > > >
> > >
> >
> ----------------------------------------------------------------------------------------
> > > > *Sonicle S.r.l. *: http://www.sonicle.com
> > <http://www.sonicle.com/>
> > > > *Music: *http://www.gabrielebulfon.com
> > > <http://www.gabrielebulfon.com/>
> > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------
> > > >
> > > >
> > > > *Da:* Gabriele Bulfon <gbulfon at sonicle.com>
> > > > *A:* kwenning at redhat.com Cluster Labs - All topics related to
> > > > open-source clustering welcomed <users at clusterlabs.org>
> > > > *Data:* 29 agosto 2016 17.37.36 CEST
> > > > *Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
> > > >
> > > >
> > > > Ok, got it, I hadn't gracefully shut pacemaker on node2.
> > > > Now I restarted, everything was up, stopped pacemaker service on
> > > > host2 and I got host1 with both IPs configured. ;)
> > > >
> > > > But, though I understand that if I halt host2 with no grace
> > shut of
> > > > pacemaker, it will not move the IP2 to Host1, I don't expect
> host1
> > > > to loose its own IP! Why?
> > > >
> > > > Gabriele
> > > >
> > > >
> > >
> >
> ----------------------------------------------------------------------------------------
> > > > *Sonicle S.r.l. *: http://www.sonicle.com
> > <http://www.sonicle.com/>
> > > > *Music: *http://www.gabrielebulfon.com
> > > <http://www.gabrielebulfon.com/>
> > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> > > >
> > > >
> > > >
> > > >
> > >
> >
> ----------------------------------------------------------------------------------
> > > >
> > > > Da: Klaus Wenninger <kwenning at redhat.com>
> > > > A: users at clusterlabs.org
> > > > Data: 29 agosto 2016 17.26.49 CEST
> > > > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
> > > >
> > > > On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
> > > > > Hi,
> > > > >
> > > > > now that I have IPaddr work, I have a strange behaviour on
> > my test
> > > > > setup of 2 nodes, here is my configuration:
> > > > >
> > > > > ===STONITH/FENCING===
> > > > >
> > > > > primitive xstorage1-stonith stonith:external/ssh-sonicle op
> > > > monitor
> > > > > interval="25" timeout="25" start-delay="25" params
> > > > hostlist="xstorage1"
> > > > >
> > > > > primitive xstorage2-stonith stonith:external/ssh-sonicle op
> > > > monitor
> > > > > interval="25" timeout="25" start-delay="25" params
> > > > hostlist="xstorage2"
> > > > >
> > > > > location xstorage1-stonith-pref xstorage1-stonith -inf:
> > xstorage1
> > > > > location xstorage2-stonith-pref xstorage2-stonith -inf:
> > xstorage2
> > > > >
> > > > > property stonith-action=poweroff
> > > > >
> > > > >
> > > > >
> > > > > ===IP RESOURCES===
> > > > >
> > > > >
> > > > > primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params
> > > > ip="1.2.3.4"
> > > > > cidr_netmask="255.255.255.0" nic="e1000g1"
> > > > > primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params
> > > > ip="1.2.3.5"
> > > > > cidr_netmask="255.255.255.0" nic="e1000g1"
> > > > >
> > > > > location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100:
> xstorage1
> > > > > location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100:
> xstorage2
> > > > >
> > > > > ===================
> > > > >
> > > > > So I plumbed e1000g1 with unconfigured IP on both machines and
> > > > started
> > > > > corosync/pacemaker, and after some time I got all nodes
> > online and
> > > > > started, with IP configured as virtual interfaces
> (e1000g1:1 and
> > > > > e1000g1:2) one in host1 and one in host2.
> > > > >
> > > > > Then I halted host2, and I expected to have host1 started with
> > > > both
> > > > > IPs configured on host1.
> > > > > Instead, I got host1 started with the IP stopped and removed
> > (only
> > > > > e1000g1 unconfigured), host2 stopped saying IP started (!?).
> > > > > Not exactly what I expected...
> > > > > What's wrong?
> > > >
> > > > How did you stop host2? Graceful shutdown of pacemaker? If
> not ...
> > > > Anyway ssh-fencing is just working if the machine is still
> > > > running ...
> > > > So it will stay unclean and thus pacemaker is thinking that
> > > > the IP might still be running on it. So this is actually the
> > > > expected
> > > > behavior.
> > > > You might add a watchdog via sbd if you don't have other fencing
> > > > hardware at hand ...
> > > > >
> > > > > Here is the crm status after I stopped host 2:
> > > > >
> > > > > 2 nodes and 4 resources configured
> > > > >
> > > > > Node xstorage2: UNCLEAN (offline)
> > > > > Online: [ xstorage1 ]
> > > > >
> > > > > Full list of resources:
> > > > >
> > > > > xstorage1-stonith (stonith:external/ssh-sonicle): Started
> > > > xstorage2
> > > > > (UNCLEAN)
> > > > > xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
> > > > > xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
> > > > > xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Started xstorage2
> > > > (UNCLEAN)
> > > > >
> > > > >
> > > > > Gabriele
> > > > >
> > > > >
> > > >
> > >
> >
> ----------------------------------------------------------------------------------------
> > > > > *Sonicle S.r.l. *: http://www.sonicle.com
> > > > <http://www.sonicle.com/>
> > > > > *Music: *http://www.gabrielebulfon.com
> > > > <http://www.gabrielebulfon.com/>
> > > > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> >
> >
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160905/18e0fa3f/attachment.htm>
More information about the Users
mailing list