[ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

Gao,Yan ygao at suse.com
Tue Feb 12 03:42:44 EST 2019


On 2/12/19 3:38 AM, Fulong Wang wrote:
> Klaus,
> 
> Thanks for the infor!
> Did you mean i should compile sbd from github source to include the fixs 
> you mentioned by myself?
> 
> The corosync, pacemaker and sbd version in my setup is as below:
> corosync:     2.3.6-9.13.1
> pacemaker: 1.1.16-6.5.1
> sbd:               1.3.1+20180507
I'm pretty sure this version has the fix in regard of 2-node cluster 
from Klaus.

Regards,
   Yan

> 
> 
> 
> Regards
> Fulong
> ------------------------------------------------------------------------
> *From:* Klaus Wenninger <kwenning at redhat.com>
> *Sent:* Monday, February 11, 2019 18:51
> *To:* Cluster Labs - All topics related to open-source clustering 
> welcomed; Fulong Wang; Gao,Yan
> *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
> On 02/11/2019 09:49 AM, Fulong Wang wrote:
>> Thanks Yan,
>>
>> You gave me more valuable hints on the SBD operation!
>> Now, i can see the verbose output after service restart.
>>
>>
>> >Be aware since pacemaker integration (-P) is enabled by default, which
>> >means despite the sbd failure, if the node itself was clean and
>> >"healthy" from pacemaker's point of view and if it's in the cluster
>> >partition with the quorum, it wouldn't self-fence -- meaning a node just
>> >being unable to fence doesn't necessarily need to be fenced.
>>
>> >As described in sbd man page, "this allows sbd to survive temporary
>> >outages of the majority of devices. However, while the cluster is in
>> >such a degraded state, it can neither successfully fence nor be shutdown
>> >cleanly (as taking the cluster below the quorum threshold will
>> >immediately cause all remaining nodes to self-fence). In short, it will
>> >not tolerate any further faults.  Please repair the system before
>> >continuing."
>>
>> Yes, I can see the "pacemaker integration" was enabled in my sbd 
>> config file by default.
>> So, you mean in some sbd failure cases, if the node was considered as 
>> "healthy" from pacemaker's poinit of view, it still wouldn't sel-fence.
>>
>> Honestly speaking, i didn't get you at this point. I have 
>> "no-quorum-policy=ignore" setting in my setup and it's a two node 
>> cluster.
>> Can you show me a sample situation for this?
> 
> When using sbd with 2-node-clusters and pacemaker-integration you might
> check 
> https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377
> to be included in your sbd-version.
> This is relevant when 2-node is configured in corosync.
> 
> Regards,
> Klaus
> 
>>
>> Many Thanks!!!
>>
>>
>>
>>
>> Reagards
>> Fulong
>>
>>
>>
>> ------------------------------------------------------------------------
>> *From:* Gao,Yan <ygao at suse.com> <mailto:ygao at suse.com>
>> *Sent:* Thursday, January 3, 2019 20:43
>> *To:* Fulong Wang; Cluster Labs - All topics related to open-source 
>> clustering welcomed
>> *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
>> On 12/24/18 7:10 AM, Fulong Wang wrote:
>> > Yan, klaus and Everyone,
>> > 
>> > 
>> >   Merry Christmas!!!
>> > 
>> > 
>> > 
>> > Many thanks for your advice!
>> > I added the "-v" param in "SBD_OPTS", but didn't see any apparent change 
>> > in the system message log,  am i looking at a wrong place?
>> Did you restart all cluster services, for example by "crm cluster stop"
>> and then "crm cluster start"? Basically sbd.service needs to be
>> restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker.
>>
>> SBD daemons log into syslog. When a sbd watcher receives a "test"
>> command, there should be a syslog like this showing up:
>>
>> "servant: Received command test from ..."
>>
>> sbd won't actually do anything about a "test" command but logging a 
>> message.
>>
>> If you are not running a late version of sbd (maintenance update) yet, a
>> single "-v" will make sbd too verbose already. But of course you could
>> use grep.
>>
>> > 
>> > By the way, we want to test when the disk access paths (multipath 
>> > devices) lost, the sbd can fence the node automatically.
>> Be aware since pacemaker integration (-P) is enabled by default, which
>> means despite the sbd failure, if the node itself was clean and
>> "healthy" from pacemaker's point of view and if it's in the cluster
>> partition with the quorum, it wouldn't self-fence -- meaning a node just
>> being unable to fence doesn't necessarily need to be fenced.
>>
>> As described in sbd man page, "this allows sbd to survive temporary
>> outages of the majority of devices. However, while the cluster is in
>> such a degraded state, it can neither successfully fence nor be shutdown
>> cleanly (as taking the cluster below the quorum threshold will
>> immediately cause all remaining nodes to self-fence). In short, it will
>> not tolerate any further faults.  Please repair the system before
>> continuing."
>>
>> Regards,
>>    Yan
>>
>>
>> > what's your recommendation for this scenario?
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > The "crm node fence"  did the work.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > Regards
>> > Fulong
>> > 
>> > ------------------------------------------------------------------------
>> > *From:* Gao,Yan <ygao at suse.com> <mailto:ygao at suse.com>
>> > *Sent:* Friday, December 21, 2018 20:43
>> > *To:* kwenning at redhat.com <mailto:kwenning at redhat.com>; Cluster Labs - All 
>> topics related to
>> > open-source clustering welcomed; Fulong Wang
>> > *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
>> > First thanks for your reply, Klaus!
>> > 
>> > On 2018/12/21 10:09, Klaus Wenninger wrote:
>> >> On 12/21/2018 08:15 AM, Fulong Wang wrote:
>> >>> Hello Experts,
>> >>>
>> >>> I'm New to this mail lists.
>> >>> Pls kindlyforgive me if this mail has disturb you!
>> >>>
>> >>> Our Company recently is evaluating the usage of the SuSE HAE on x86 
>> >>> platform.
>> >>> Wen simulating the storage disaster fail-over, i finally found that 
>> >>> the SBD communication functioned normal on SuSE11 SP4 but abnormal on 
>> >>> SuSE12 SP3.
>> >> 
>> >> I have no experience with SBD on SLES but I know that handling of the
>> >> logging verbosity-levels has changed recently in the upstream-repo.
>> >> Given that it was done by Yan Gao iirc I'd assume it went into SLES.
>> >> So changing the verbosity of the sbd-daemon might get you back
>> >> these logs.
>> > Yes, I think it's the issue. Could you please retrieve the latest
>> > maintenance update for SLE12SP3 and try? Otherwise of course you could
>> > temporarily enable verbose/debug logging by adding a couple of "-v" into
>> >    "SBD_OPTS" in /etc/sysconfig/sbd.
>> > 
>> > But frankly, it makes more sense to manually trigger fencing for example
>> > by "crm node fence" and see if it indeed works correctly.
>> > 
>> >> And of course you can use the list command on the other node
>> >> to verify as well.
>> > The "test" message in the slot might get overwritten soon by a "clear"
>> > if the sbd daemon is running.
>> > 
>> > Regards,
>> >     Yan
>> > 
>> > 
>> >> 
>> >> Klaus
>> >> 
>> >>> The SBD device was added during the initialization of the first 
>> >>> cluster node.
>> >>>
>> >>> I have requested help from SuSE guys, but they didn't give me any 
>> >>> valuable feedback yet now!
>> >>>
>> >>>
>> >>> Below are some screenshots to explain what i have encountered.
>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >>>
>> >>> on a SuSE11 SP4 HAE cluster,  i  run the sbd test command as below:
>> >>>
>> >>>
>> >>> then there will be some information showed up in the local system 
>> >>> message log
>> >>>
>> >>>
>> >>>
>> >>> on the second node,  we can found that the communication is normal by
>> >>>
>> >>>
>> >>>
>> >>> but when i turn to a SuSE12 SP3 HAE cluster,  ran the same command as 
>> >>> above:
>> >>>
>> >>>
>> >>>
>> >>> I didn't get any  response in the system message log.
>> >>>
>> >>>
>> >>> "systemctl status sbd" also doesn't give me any clue on this.
>> >>>
>> >>>
>> >>>
>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> >>>
>> >>> What could be the reason for this abnormal behavior?  Is there any 
>> >>> problems with my setup?
>> >>> Any suggestions are appreciate!
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>> Regards
>> >>> FuLong
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Users mailing list:Users at clusterlabs.org <mailto:list:Users at clusterlabs.org>
>> >>> https://lists.clusterlabs.org/mailman/listinfo/users
>> >>>
>> >>> Project Home:http://www.clusterlabs.org
>> >>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs:http://bugs.clusterlabs.org
>> >> 
>> >> 
>> >> 
>> >> _______________________________________________
>> >> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
>> >> https://lists.clusterlabs.org/mailman/listinfo/users
>> >> 
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >> 
>>
>>
>> _______________________________________________
>> Users mailing list:Users at clusterlabs.org  <mailto:Users at clusterlabs.org>
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home:http://www.clusterlabs.org
>> Getting started:http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 



More information about the Users mailing list