[ClusterLabs] [EXTERNAL] What's the best practice to scale-out/increase the cluster size?

Mon Jul 15 08:19:39 EDT 2019

Dne 11. 07. 19 v 16:01 Michael Powell napsal(a):
> Thanks again for the feedback.  As a novice to Pacemaker, I am learning a great deal and have a great deal more to learn.
> 
> I'm afraid I was not precise in my choice of the term "stand-alone".  As you point out, our situation is really case b) "as the 2-node cluster but only one is up at the moment".  That said, there are cases where we would want to bring up a single node at a time.  These occur during maintenance periods, though, not during normal operation.  Hence we believe we can live with the STONITH/reboot.
> 
> I was not aware that corosync.conf could be reloaded while Corosync was running.  I'll review the Corosync documentation again.

Be aware that not all corosync settings can be changed while corosync is 
running. See corosync wiki[1] for details.

[1]: https://github.com/corosync/corosync/wiki/Config-file-values

Regards,
Tomas

> 
> Regards,
>    Michael
> 
> -----Original Message-----
> From: Roger Zhou <ZZhou at suse.com>
> Sent: Thursday, July 11, 2019 1:16 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; Michael Powell <Michael.Powell at harmonicinc.com>; Ken Gaillot <kgaillot at redhat.com>
> Cc: Venkata Reddy Chappavarapu <Venkata.Chappavarapu at harmonicinc.com>
> Subject: [EXTERNAL] What's the best practice to scale-out/increase the cluster size? (was: [ClusterLabs] "node is unclean" leads to gratuitous reboot)
> 
> 
> On 7/11/19 2:15 AM, Michael Powell wrote:
>> Thanks to you and Andrei for your responses.  In our particular situation, we want to be able to operate with either node in stand-alone mode, or with both nodes protected by HA.  I did not mention this, but I am working on upgrading our product from a version which used Pacemaker version 1.0.13 and Heartbeat to run under CentOS 7.6 (later 8.0).  The older version did not exhibit this behavior, hence my concern.
>>
>> I do understand the "wait_for_all" option better, and now that I know why the "gratuitous" reboot is happening, I'm more comfortable with that behavior.  I think the biggest operational risk would occur following a power-up of the chassis.  If one node were significantly delayed during bootup, e.g. because of networking issues, the other node would issue the STONITH and reboot the delayed node.  That would be an annoyance, but it would be relatively infrequent.  Our customers almost always keep at least one node (and usually both nodes) operational 24/7.
>>
> 
> 2 cents,
> 
> I think your requirement is very clear. Well, I view this is a tricky design challenge. There are two different situations likely fool people:
> 
> a) the situation of being stand-alone (one node, really not a cluster)
> b) the situation as the 2-node cluster but only one is up at the moment
> 
> Being not define the concepts clearly and not clarify their difference, people could mix them together and set wrong expectation but on a different concept, really.
> 
> In your case, the configuration is 2-node cluster. The log indicates the correct behavior for b), eg. those STONITH actions are by-design indeed.
> But, people set the wrong expectation to treat it as a).
> 
> With that, could be a cleaner design to let it be a stand-alone system first, then smoothly grow it to two nodes?
> 
> Furthermore, this trigger me to raise a question, mostly for corosync:
> 
> What's the best practice to scale-out/increase the cluster size?
> 
> I can think one of the approach is to modify corosync.conf and reload it in run-time. Well, it doesn't look like as smart as the reverse way, namely, allow_downscale/auto_tie_breaker/last_man_standing of the advanced corosync feature set, see `man votequorum`.
> 
> 
> Cheers,
> Roger
> 
> 
> 
> 
>> Regards,
>>     Michael
>>
>> -----Original Message-----
>> From: Ken Gaillot <kgaillot at redhat.com>
>> Sent: Tuesday, July 09, 2019 12:42 PM
>> To: Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> Cc: Michael Powell <Michael.Powell at harmonicinc.com>; Venkata Reddy
>> Chappavarapu <Venkata.Chappavarapu at harmonicinc.com>
>> Subject: [EXTERNAL] Re: [ClusterLabs] "node is unclean" leads to
>> gratuitous reboot
>>
>> On Tue, 2019-07-09 at 12:54 +0000, Michael Powell wrote:
>>> I have a two-node cluster with a problem.  If I start
>>
>> Not so much a problem as a configuration choice :)
>>
>> There are trade-offs in any case.
>>
>> - wait_for_all in corosync.conf: If set, this will make each starting node wait until it sees the other before gaining quorum for the first time. The downside is that both nodes must be up for the cluster to start; the upside is a clean starting point and no fencing.
>>
>> - startup-fencing in pacemaker properties: If disabled, either node
>> can start without fencing the other. This is unsafe; if the other node
>> is actually active and running resources, but unreachable from the
>> newly up node, the newly up node may start the same resources, causing
>> split- brain. (Easier than you might think: consider taking a node
>> down for hardware maintenance, bringing it back up without a network,
>> then plugging it back into the network -- by that point it may have
>> brought up resources and starts causing havoc.)
>>
>> - Start corosync on both nodes, then start pacemaker. This avoids start-up fencing since when pacemaker starts on either node, it already sees the other node present, even if that node's pacemaker isn't up yet.
>>
>> Personally I'd go for wait_for_all in normal operation. You can always disable it if there are special circumstances where a node is expected to be out of the cluster for a long time.
>>
>>> Corosync/Pacemaker on one node, and then delay startup on the 2nd
>>> node (which is otherwise up and running), the 2nd node will be
>>> rebooted very soon after STONITH is enabled on the first node.  This
>>> reboot seems to be gratuitous and could under some circumstances be
>>> problematic.  While, at present,  I “manually” start
>>> Corosync/Pacemaker by invoking a script from an ssh session, in a
>>> production environment, this script would be started by a systemd
>>> service.  It’s not hard to imagine that if both nodes were started at
>>> approximately the same time (each node runs on a separate motherboard
>>> in the same chassis), this behavior could cause one of the nodes to
>>> be rebooted while it’s in the process of booting up.
>>>    
>>> The two nodes’ host names are mgraid-16201289RN00023-0 and mgraid-
>>> 16201289RN00023-1.  Both hosts are running, but Pacemaker has been
>>> started on neither.  If Pacemaker is started on mgraid-
>>> 16201289RN00023-0, within a few seconds after STONITH is enabled, the
>>> following messages will appear in the system log file, and soon
>>> thereafter STONITH will be invoked to reboot the other node, on which
>>> Pacemaker has not yet been started.  (NB: The fence agent is a
>>> process named mgpstonith which uses the ipmi interface to reboot the
>>> other node.  For debugging, it prints the data it receives from
>>> stdin. )
>>>    
>>> 2019-07-08T13:11:14.907668-07:00 mgraid-16201289RN00023-0
>>> HA_STARTSTOP: Configure mgraid-stonith    # This message indicates
>>> that STONITH is about to be configured and enabled …
>>> 2019-07-08T13:11:15.018131-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16299]: info:... action=metadata#012
>>> 2019-07-08T13:11:15.050817-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16301]: info:... action=metadata#012 …
>>> 2019-07-08T13:11:21.085092-07:00 mgraid-16201289RN00023-0
>>> pengine[16216]:  warning: Scheduling Node mgraid-16201289RN00023-1
>>> for STONITH
>>> 2019-07-08T13:11:21.085615-07:00 mgraid-16201289RN00023-0
>>> pengine[16216]:   notice:  * Fence (reboot) mgraid-16201289RN00023-1
>>> 'node is unclean'
>>> 2019-07-08T13:11:21.085663-07:00 mgraid-16201289RN00023-0
>>> pengine[16216]:   notice:  * Promote    SS16201289RN00023:0     (
>>> Stopped -> Master mgraid-16201289RN00023-0 )
>>> 2019-07-08T13:11:21.085704-07:00 mgraid-16201289RN00023-0
>>> pengine[16216]:   notice:  * Start      mgraid-stonith:0
>>> (                   mgraid-16201289RN00023-0 )
>>> 2019-07-08T13:11:21.091673-07:00 mgraid-16201289RN00023-0
>>> pengine[16216]:  warning: Calculated transition 0 (with warnings),
>>> saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2
>>> 2019-07-08T13:11:21.093155-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: Initiating monitor operation
>>> SS16201289RN00023:0_monitor_0 locally on mgraid-16201289RN00023-0
>>> 2019-07-08T13:11:21.124403-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: Initiating monitor operation mgraid-
>>> stonith:0_monitor_0 locally on mgraid-16201289RN00023-0 …
>>> 2019-07-08T13:11:21.132994-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16361]: info:... action=metadata#012 …
>>> 2019-07-08T13:11:22.128139-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: Requesting fencing (reboot) of node mgraid-
>>> 16201289RN00023-1
>>> 2019-07-08T13:11:22.129150-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: Result of probe operation for
>>> SS16201289RN00023 on mgraid-16201289RN00023-0: 7 (not running)
>>> 2019-07-08T13:11:22.129191-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: mgraid-16201289RN00023-0-
>>> SS16201289RN00023_monitor_0:6 [ \n\n ]
>>> 2019-07-08T13:11:22.133846-07:00 mgraid-16201289RN00023-0 stonith-
>>> ng[16213]:   notice: Client crmd.16218.a7e3cbae wants to fence
>>> (reboot) 'mgraid-16201289RN00023-1' with device '(any)'
>>> 2019-07-08T13:11:22.133997-07:00 mgraid-16201289RN00023-0 stonith-
>>> ng[16213]:   notice: Requesting peer fencing (reboot) of mgraid-
>>> 16201289RN00023-1
>>> 2019-07-08T13:11:22.136287-07:00 mgraid-16201289RN00023-0
>>> crmd[16218]:   notice: Result of probe operation for mgraid-stonith
>>> on mgraid-16201289RN00023-0: 7 (not running)
>>> 2019-07-08T13:11:22.141393-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16444]: info:... action=status#012   # Status requests
>>> always return 0.
>>> 2019-07-08T13:11:22.141418-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16444]: info:... nodename=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.141432-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16444]: info:... port=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.141444-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16444]: info:Ignoring: port …
>>> 2019-07-08T13:11:22.148973-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16445]: info:... action=status#012
>>> 2019-07-08T13:11:22.148997-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16445]: info:... nodename=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.149009-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16445]: info:... port=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.149019-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16445]: info:Ignoring: port …
>>> 2019-07-08T13:11:22.155226-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16446]: info:... action=reboot#012
>>> 2019-07-08T13:11:22.155250-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16446]: info:... nodename=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.155263-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16446]: info:... port=mgraid-16201289RN00023-1#012
>>> 2019-07-08T13:11:22.155273-07:00 mgraid-16201289RN00023-0
>>> MGPSTONITH[16446]: info:Ignoring: port Following is a relevant
>>> excerpt of the corosync.log file –
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  STONITH timeout: 60000
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  STONITH of failed nodes is enabled
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  Concurrent fencing is disabled
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  Stop all active resources: false
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  Cluster is symmetric - resources can run
>>> anywhere by default
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  Default stickiness: 0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  On loss of CCM Quorum: Stop ALL resources
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: unpack_config:  Node scores: 'red' = -INFINITY, 'yellow' = 0,
>>> 'green' = 0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: determine_online_status_fencing: Node mgraid-16201289RN00023-0
>>> is active
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: determine_online_status:       Node mgraid-16201289RN00023-0 is
>>> online
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: unpack_node_loop:     Node 1 is already processed
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: unpack_node_loop:     Node 1 is already processed
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: clone_print:    Master/Slave Set: ms-SS16201289RN00023
>>> [SS16201289RN00023]
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: short_print:         Stopped: [ mgraid-16201289RN00023-0
>>> mgraid-16201289RN00023-1 ]
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: clone_print:    Clone Set: mgraid-stonith-clone [mgraid-
>>> stonith]
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: short_print:         Stopped: [ mgraid-16201289RN00023-0
>>> mgraid-16201289RN00023-1 ]
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
>>> master-w1-rule) is not active (role : Master vs. Unknown)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
>>> master-w1-rule) is not active (role : Master vs. Unknown)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_rsc_location:       Constraint (ms-SS16201289RN00023-
>>> master-w1-rule) is not active (role : Master vs. Unknown)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: distribute_children:       Allocating up to 2 ms-
>>> SS16201289RN00023 instances to a possible 1 nodes (at most 1 per
>>> host,
>>> 2 optimal)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:       Assigning mgraid-16201289RN00023-0
>>> to SS16201289RN00023:0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:   All nodes for resource
>>> SS16201289RN00023:1 are unavailable, unclean or shutting down
>>> (mgraid-16201289RN00023-1: 0, -1000000)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:   Could not allocate a node for
>>> SS16201289RN00023:1
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: native_color:   Resource SS16201289RN00023:1 cannot run
>>> anywhere
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: distribute_children:       Allocated 1 ms-SS16201289RN00023
>>> instances of a possible 2
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: master_color:       SS16201289RN00023:0 master score: 99
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: master_color:   Promoting SS16201289RN00023:0 (Stopped mgraid-
>>> 16201289RN00023-0)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: master_color:       SS16201289RN00023:1 master score: 0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: master_color:   ms-SS16201289RN00023: Promoted 1 instances of a
>>> possible 1 to master
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: distribute_children:       Allocating up to 2 mgraid-stonith-
>>> clone instances to a possible 1 nodes (at most 1 per host, 2 optimal)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:       Assigning mgraid-16201289RN00023-0
>>> to mgraid-stonith:0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:   All nodes for resource mgraid-stonith:1
>>> are unavailable, unclean or shutting down (mgraid-16201289RN00023-1:
>>> 0, -1000000)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_assign_node:   Could not allocate a node for mgraid-
>>> stonith:1
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: native_color:   Resource mgraid-stonith:1 cannot run anywhere
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: distribute_children:       Allocated 1 mgraid-stonith-clone
>>> instances of a possible 2
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_create_probe:       Probing SS16201289RN00023:0 on
>>> mgraid-16201289RN00023-0 (Stopped) 1 (nil)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_create_probe:       Probing mgraid-stonith:0 on mgraid-
>>> 16201289RN00023-0 (Stopped) 1 (nil)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: master_create_actions:       Creating actions for ms-
>>> SS16201289RN00023
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: RecurringOp:    Start recurring monitor (3s) for
>>> SS16201289RN00023:0 on mgraid-16201289RN00023-0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: RecurringOp:    Start recurring monitor (3s) for
>>> SS16201289RN00023:0 on mgraid-16201289RN00023-0
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> warning: stage6:  Scheduling Node mgraid-16201289RN00023-1 for STONITH
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: native_start_constraints:       Ordering mgraid-
>>> stonith:0_start_0 after mgraid-16201289RN00023-1 recovery
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> notice: LogNodeActions: * Fence (reboot) mgraid-16201289RN00023-1
>>> 'node is unclean'
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> notice: LogAction:      * Promote    SS16201289RN00023:0     (
>>> Stopped -> Master mgraid-16201289RN00023-0 )
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: LogActions:     Leave   SS16201289RN00023:1 (Stopped)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> notice: LogAction:      * Start      mgraid-stonith:0
>>> (                   mgraid-16201289RN00023-0 )
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> info: LogActions:     Leave   mgraid-stonith:1    (Stopped)
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name SS16201289RN00023
>>> for SS16201289RN00023:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name mgraid-stonith for
>>> mgraid-stonith:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> debug: action2xml:     Using anonymous clone name mgraid-stonith for
>>> mgraid-stonith:0 (aka. (null))
>>> Jul 08 13:11:21 [16216] mgraid-16201289RN00023-0    pengine:
>>> warning: process_pe_message:       Calculated transition 0 (with
>>> warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-3.bz2
>>>    
>>> Here is the status of the first node, once Pacemaker is started –
>>>    
>>> [root at mgraid-16201289RN00023-0 bin]# pcs status Cluster name:
>>> Stack: corosync
>>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-
>>> c3c624ea3d) - partition with quorum
>>> Last updated: Mon Jul  8 17:51:22 2019 Last change: Mon Jul  8
>>> 16:11:23 2019 by root via cibadmin on mgraid-
>>> 16201289RN00023-0
>>>    
>>> 2 nodes configured
>>> 4 resources configured
>>>    
>>> Online: [ mgraid-16201289RN00023-0 ]
>>> OFFLINE: [ mgraid-16201289RN00023-1 ]
>>>    
>>> Full list of resources:
>>>    
>>> Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>>        SS16201289RN00023  (ocf::omneon:ss):       Starting mgraid-
>>> 16201289RN00023-0
>>>        Stopped: [ mgraid-16201289RN00023-1 ] Clone Set:
>>> mgraid-stonith-clone [mgraid-stonith]
>>>        Started: [ mgraid-16201289RN00023-0 ]
>>>        Stopped: [ mgraid-16201289RN00023-1 ]
>>>    
>>> Daemon Status:
>>>     corosync: active/disabled
>>>     pacemaker: active/disabled
>>>     pcsd: inactive/disabled
>>> Here’s the configuration, from the first node –
>>>    
>>> [root at mgraid-16201289RN00023-0 bin]# pcs status Cluster name:
>>> Stack: corosync
>>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-
>>> c3c624ea3d) - partition with quorum
>>> Last updated: Mon Jul  8 17:51:22 2019 Last change: Mon Jul  8
>>> 16:11:23 2019 by root via cibadmin on mgraid-
>>> 16201289RN00023-0
>>>    
>>> 2 nodes configured
>>> 4 resources configured
>>>    
>>> Online: [ mgraid-16201289RN00023-0 ]
>>> OFFLINE: [ mgraid-16201289RN00023-1 ]
>>>    
>>> Full list of resources:
>>>    
>>> Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>>        SS16201289RN00023  (ocf::omneon:ss):       Starting mgraid-
>>> 16201289RN00023-0
>>>        Stopped: [ mgraid-16201289RN00023-1 ] Clone Set:
>>> mgraid-stonith-clone [mgraid-stonith]
>>>        Started: [ mgraid-16201289RN00023-0 ]
>>>        Stopped: [ mgraid-16201289RN00023-1 ]
>>>    
>>> Daemon Status:
>>>     corosync: active/disabled
>>>     pacemaker: active/disabled
>>>     pcsd: inactive/disabled
>>> [root at mgraid-16201289RN00023-0 bin]# pcs config Cluster Name:
>>> Corosync Nodes:
>>> mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 Pacemaker Nodes:
>>> mgraid-16201289RN00023-0 mgraid-16201289RN00023-1
>>>    
>>> Resources:
>>> Master: ms-SS16201289RN00023
>>>     Meta Attrs: clone-max=2 notify=true globally-unique=false target-
>>> role=Started
>>>     Resource: SS16201289RN00023 (class=ocf provider=omneon type=ss)
>>>      Attributes: ss_resource=SS16201289RN00023
>>> ssconf=/var/omneon/config/config.16201289RN00023
>>>      Operations: monitor interval=3s role=Master timeout=7s
>>> (SS16201289RN00023-monitor-3s)
>>>                  monitor interval=10s role=Slave timeout=7
>>> (SS16201289RN00023-monitor-10s)
>>>                  stop interval=0 timeout=20 (SS16201289RN00023-stop-0)
>>>                  start interval=0 timeout=300
>>> (SS16201289RN00023-start-
>>> 0)
>>> Clone: mgraid-stonith-clone
>>>     Resource: mgraid-stonith (class=stonith type=mgpstonith)
>>>      Operations: monitor interval=0 timeout=20s (mgraid-stonith-
>>> monitor-interval-0)
>>>    
>>> Stonith Devices:
>>> Fencing Levels:
>>>    
>>> Location Constraints:
>>>     Resource: ms-SS16201289RN00023
>>>       Constraint: ms-SS16201289RN00023-master-w1
>>>         Rule: role=master score=100  (id:ms-SS16201289RN00023-master-
>>> w1-rule)
>>>           Expression: #uname eq mgraid-16201289rn00023-0  (id:ms-
>>> SS16201289RN00023-master-w1-rule-expression)
>>> Ordering Constraints:
>>> Colocation Constraints:
>>> Ticket Constraints:
>>>    
>>> Alerts:
>>> No alerts defined
>>>    
>>> Resources Defaults:
>>> failure-timeout: 1min
>>> Operations Defaults:
>>> No defaults set
>>>    
>>> Cluster Properties:
>>> cluster-infrastructure: corosync
>>> cluster-recheck-interval: 1min
>>> dc-deadtime: 5s
>>> dc-version: 1.1.19-8.el7-c3c624ea3d
>>> have-watchdog: false
>>> last-lrm-refresh: 1562513532
>>> stonith-enabled: true
>>>    
>>> Quorum:
>>>     Options:
>>>       wait_for_all: 0
>>> Interestingly, as you’ll note below, the “two_node” option is also
>>> set to 1, but is not reported as such above.
>>>    
>>> Finally, here’s /etc/corosync/corosync.conf – totem {
>>>           version: 2
>>>    
>>>           crypto_cipher: none
>>>           crypto_hash: none
>>>    
>>>           interface {
>>>                   ringnumber: 0
>>>           bindnetaddr: 169.254.1.1
>>>                   mcastaddr: 239.255.1.1
>>>                   mcastport: 5405
>>>                   ttl: 1
>>>           }
>>> }
>>>    
>>> logging {
>>>           fileline: off
>>>           to_stderr: no
>>>           to_logfile: yes
>>>           logfile: /var/log/cluster/corosync.log
>>>           to_syslog: yes
>>>           debug: on
>>>           timestamp: on
>>>           logger_subsys {
>>>                   subsys: QUORUM
>>>                   debug: on
>>>           }
>>> }
>>>    
>>> nodelist {
>>>           node {
>>>                   ring0_addr: mgraid-16201289RN00023-0
>>>                   nodeid: 1
>>>           }
>>>    
>>>           node {
>>>                   ring0_addr: mgraid-16201289RN00023-1
>>>                   nodeid: 2
>>>           }
>>> }
>>>    
>>> quorum {
>>>           provider: corosync_votequorum
>>>    
>>>           two_node: 1
>>>    
>>>           wait_for_all: 0
>>> }
>>>    
>>> I’d appreciate any insight you can offer into this behavior, and any
>>> suggestions you may have.
>>>
>>> Regards,
>>>     Michael
>>>
>>>    
>>>       Michael Powell
>>>       Sr. Staff Engineer
>>>    
>>>       15220 NW Greenbrier Pkwy
>>>           Suite 290
>>>       Beaverton, OR   97006
>>>       T 503-372-7327    M 503-789-3019   H 503-625-5332
>>>    
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>