[ClusterLabs] ip clustering strange behaviour

Mon Sep 5 09:02:00 EDT 2016

I read docs, looks like sbd fencing is more about iscsi/fc exposed storage resources.
Here I have real shared disks (seen from solaris with the format utility as normal sas disks, but on both nodes).
They are all jbod disks, that ZFS organizes in raidz/mirror pools, so I have 5 disks on one pool in one node, and the other 5 disks on another pool in one node.
How can sbd work in this situation? Has it already been used/tested on a Solaris env with ZFS ?
BTW, is there any other possibility other than sbd.
Last but not least, is there any way to let ssh-fencing be considered good?
At the moment, with ssh-fencing, if I shut down the second node, I get all second resources in UNCLEAN state, not taken by the first one.
If I reboot the second , I only get the node on again, but resources remain stopped.
I remember my tests with heartbeat react different (halt would move everything to node1 and get back everything on restart)
Gabriele
----------------------------------------------------------------------------------------
Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
----------------------------------------------------------------------------------
Da: Klaus Wenninger
A: users at clusterlabs.org
Data: 5 settembre 2016 12.21.25 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
Both nodes can see the same jbod disks (10 at the moment, up to 24).
Systems are XStreamOS / illumos, with ZFS.
Each system has one ZFS pool of 5 disks, with different pool names
(data1, data2).
When in active / active, the two machines run different zones and
services on their pools, on their networks.
I have custom resource agents (tested on pacemaker/heartbeat, now
porting to pacemaker/corosync) for ZFS pools and zones migration.
When I was testing pacemaker/heartbeat, when ssh-fencing discovered
the other node to be down (cleanly or abrupt halt), it was
automatically using IPaddr and our ZFS agents to take control of
everything, mounting the other pool and running any configured zone in it.
I would like to do the same with pacemaker/corosync.
The two nodes of the dual machine have an inernal lan connecting them,
a 100Mb ethernet: maybe this is enough reliable to trust ssh-fencing?
Or is there anything I can do to ensure at the controller level that
the pool is not in use on the other node?
It is not just about the reliability of the networking-connection why
ssh-fencing might be
suboptimal. Something with the IP-Stack config (dynamic due to moving
resources)
might have gone wrong. And resources might be somehow hanging so that
the node
can't be brought down gracefully. Thus my suggestion to add a watchdog
(so available)
via sbd.
Gabriele
----------------------------------------------------------------------------------------
*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
----------------------------------------------------------------------------------
Da: Ken Gaillot
A: gbulfon at sonicle.com Cluster Labs - All topics related to
open-source clustering welcomed
Data: 1 settembre 2016 15.49.04 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else
where, or
"no-quorum-policy=stop"?
I'd prefer "two_node: 1" and letting pacemaker's options default. But
see the votequorum(5) man page for what two_node implies -- most
importantly, both nodes have to be available when the cluster starts
before it will start any resources. Node failure is handled fine once
the cluster has started, but at start time, both nodes must be up.
About fencing, the machine I'm going to implement the 2-nodes
cluster is
a dual machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the
admin console.
Then there is a third 100Mb ethernet connecing the two machines
internally.
I was going to use this last one as fencing via ssh, but looks
like this
way I'm not gonna have ip/pool/zone movements if one of the nodes
freezes or halts without shutting down pacemaker clean.
What should I use instead?
I'm guessing as a dual machine, they share a power supply, so that
rules
out a power switch. If the box has IPMI that can individually power
cycle each host, you can use fence_ipmilan. If the disks are
shared via
iSCSI, you could use fence_scsi. If the box has a hardware watchdog
device that can individually target the hosts, you could use sbd. If
none of those is an option, probably the best you could do is run the
cluster nodes as VMs on each host, and use fence_xvm.
Thanks for your help,
Gabriele
----------------------------------------------------------------------------------------
*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
----------------------------------------------------------------------------------
Da: Ken Gaillot
A: users at clusterlabs.org
Data: 31 agosto 2016 17.25.05 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
Sorry for reiterating, but my main question was:
why does node 1 removes its own IP if I shut down node 2 abruptly?
I understand that it does not take the node 2 IP (because the
ssh-fencing has no clue about what happened on the 2nd node),
but I
wouldn't expect it to shut down its own IP...this would kill any
service
on both nodes...what am I wrong?
Assuming you're using corosync 2, be sure you have "two_node: 1" in
corosync.conf. That will tell corosync to pretend there is always
quorum, so pacemaker doesn't need any special quorum settings.
See the
votequorum(5) man page for details. Of course, you need fencing
in this
setup, to handle when communication between the nodes is broken
but both
are still up.
----------------------------------------------------------------------------------------
*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
------------------------------------------------------------------------
*Da:* Gabriele Bulfon
*A:* kwenning at redhat.com Cluster Labs - All topics related to
open-source clustering welcomed
*Data:* 29 agosto 2016 17.37.36 CEST
*Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I restarted, everything was up, stopped pacemaker service on
host2 and I got host1 with both IPs configured. ;)
But, though I understand that if I halt host2 with no grace
shut of
pacemaker, it will not move the IP2 to Host1, I don't expect host1
to loose its own IP! Why?
Gabriele
----------------------------------------------------------------------------------------
*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
----------------------------------------------------------------------------------
Da: Klaus Wenninger
A: users at clusterlabs.org
Data: 29 agosto 2016 17.26.49 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/29/2016 05:18 PM, Gabriele Bulfon wrote:
Hi,
now that I have IPaddr work, I have a strange behaviour on
my test
setup of 2 nodes, here is my configuration:
===STONITH/FENCING===
primitive xstorage1-stonith stonith:external/ssh-sonicle op
monitor
interval="25" timeout="25" start-delay="25" params
hostlist="xstorage1"
primitive xstorage2-stonith stonith:external/ssh-sonicle op
monitor
interval="25" timeout="25" start-delay="25" params
hostlist="xstorage2"
location xstorage1-stonith-pref xstorage1-stonith -inf:
xstorage1
location xstorage2-stonith-pref xstorage2-stonith -inf:
xstorage2
property stonith-action=poweroff
===IP RESOURCES===
primitive xstorage1_wan1_IP ocf:heartbeat:IPaddr params
ip="1.2.3.4"
cidr_netmask="255.255.255.0" nic="e1000g1"
primitive xstorage2_wan2_IP ocf:heartbeat:IPaddr params
ip="1.2.3.5"
cidr_netmask="255.255.255.0" nic="e1000g1"
location xstorage1_wan1_IP_pref xstorage1_wan1_IP 100: xstorage1
location xstorage2_wan2_IP_pref xstorage2_wan2_IP 100: xstorage2
===================
So I plumbed e1000g1 with unconfigured IP on both machines and
started
corosync/pacemaker, and after some time I got all nodes
online and
started, with IP configured as virtual interfaces (e1000g1:1 and
e1000g1:2) one in host1 and one in host2.
Then I halted host2, and I expected to have host1 started with
both
IPs configured on host1.
Instead, I got host1 started with the IP stopped and removed
(only
e1000g1 unconfigured), host2 stopped saying IP started (!?).
Not exactly what I expected...
What's wrong?
How did you stop host2? Graceful shutdown of pacemaker? If not ...
Anyway ssh-fencing is just working if the machine is still
running ...
So it will stay unclean and thus pacemaker is thinking that
the IP might still be running on it. So this is actually the
expected
behavior.
You might add a watchdog via sbd if you don't have other fencing
hardware at hand ...
Here is the crm status after I stopped host 2:
2 nodes and 4 resources configured
Node xstorage2: UNCLEAN (offline)
Online: [ xstorage1 ]
Full list of resources:
xstorage1-stonith (stonith:external/ssh-sonicle): Started
xstorage2
(UNCLEAN)
xstorage2-stonith (stonith:external/ssh-sonicle): Stopped
xstorage1_wan1_IP (ocf::heartbeat:IPaddr): Stopped
xstorage2_wan2_IP (ocf::heartbeat:IPaddr): Started xstorage2
(UNCLEAN)
Gabriele
----------------------------------------------------------------------------------------
*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160905/bf32088d/attachment-0003.html>