[ClusterLabs] 答复: 答复: Could not start only one node in pacemaker

Wed May 2 12:37:11 UTC 2018

On Wed, 2 May 2018 05:24:23 +0000
范国腾 <fanguoteng at highgo.com> wrote:

> Andrei,
> 
> We use the following command to create the cluster:
> 
> pcs cluster auth node1 node2 node3 node4 -u hacluster;
> pcs cluster setup --name cluster_pgsql node1 node2 node3 node4;
> pcs cluster start --all;
> pcs property set no-quorum-policy=freeze;
> pcs property set stonith-enabled=true;
> pcs stonith create scsi-stonith-device fence_scsi devices=/dev/mapper/fence
> pcmk_monitor_action=metadata pcmk_reboot_action=off pcmk_host_list="node1
> node2 node3 node4" meta provides=unfencing;
> 
> 
> Could you please tell how to configure if I want to use only fencing or only
> quorum? Maybe no-quorum-policy=freeze and stonith-enabled=true could not be
> set at the same time?

You can not run without fencing. Period. Quorum might be optional, but not
fencing. Moreover, yes, they can play together well: you don't have to disable
quorum because you are using fencing.

A cluster shutdown or startup is a controlled (and rare) operation. You should
not start one node at a time but make sure to start all of them. If something
goes wrong, fix it.

Your logs cover 1m09s of startup and almost nothing about the shutdown
operation. There's no way we can find out if the shutdown run well on all nodes.

At the very beginning of the startup, Corosync (which provides the quorum
information to Pacemaker) expose all nodes are up:

  May  1 22:01:45 node3 corosync[17980]: [TOTEM ] adding new UDPU member
                                       {192.168.199.191}
  May  1 22:01:45 node3 corosync[17980]: [TOTEM ] adding new UDPU member
                                       {192.168.199.197}
  May  1 22:01:45 node3 corosync[17980]: [TOTEM ] adding new UDPU member
                                       {192.168.199.193}
  May  1 22:01:45 node3 corosync[17980]: [TOTEM ] A new membership
                                       (192.168.199.193:1292) was formed.
                                       Members joined: 3 
  May  1 22:01:45 node3 corosync[17980]: [QUORUM] Members[1]: 3

However, soon after, pacemaker cry while loosing the quorum: 

  May  1 22:01:45 node3 pacemakerd[17992]: warning: Quorum lost
  May  1 22:01:45 node3 pacemakerd[17992]:  notice: Node node3 state is now
                                           member
  May  1 22:01:46 node3 crmd[17998]: warning: Quorum lost
  May  1 22:01:47 node3 crmd[17998]:  notice: Node node3 state is now member

Sadly, we have no more logs from corosync, maybe because it switched to its
own logfile management instead of syslog? Maybe you could dig on corosync.log
for some more information?

Anyway, in my opinion, you have to figure why the quorum is lost on cluster
startup, not if stonith and quorum play well together.

> -----邮件原件-----
> 发件人: Users [mailto:users-bounces at clusterlabs.org] 代表 Andrei Borzenkov
> 发送时间: 2018年5月2日 13:06
> 收件人: users at clusterlabs.org
> 主题: Re: [ClusterLabs] 答复: Could not start only one node in pacemaker
> 
> 02.05.2018 07:28, 范国腾 пишет:
> > Andrei,
> > 
> > We set "pcs property set no-quorum-policy=freeze;" If we want to keep 
> > this "freeze" value, could you please tell what quorum parameter we 
> > should set?
> >   
> 
> There is no other parameter. Either you base your cluster on quorum or you
> base your cluster on fencing. Attempt to mix them will give you the result
> you have observed.
> 
> You cannot start resource management until you are aware of state of other
> nodes so either starting node puts other nodes in defined state (fencing) or
> you *MUST* stop and wait for (sufficient number of) other nodes to appear,
> because doing anything else will clearly violate quorum requirement. Quorum
> relies on the fact that out-of-quorum nodes will not do anything.
> 
> > Thanks
> > 
> > 
> > -----邮件原件----- 发件人: Users [mailto:users-bounces at clusterlabs.org] 代表
> > Andrei Borzenkov 发送时间: 2018年5月2日 12:20 收件人: users at clusterlabs.org
> > 主题: Re: [ClusterLabs] Could not start only one node in pacemaker
> > 
> > 02.05.2018 05:52, 范国腾 пишет:  
> >> Hi, The cluster has three nodes: one is master and two are slave.
> >> Now we run “pcs cluster stop --all” to stop all of the nodes. Then we 
> >> run “pcs cluster start” in the master node. We find it not able to 
> >> started. The cause is that the stonith resource could not be started 
> >> so all of the other resource could not be started.
> >> 
> >> We test this case in two cluster system and the result is same:
> >> 
> >> l  If we start all of the three nodes, the stonith resource could be 
> >> started. If we stop one node after it starts, the stonith resource 
> >> could be migrated to another node and the cluster still work.
> >> 
> >> l  If we start only one or only two nodes, the stonith resource could 
> >> not be started.
> >> 
> >> 
> >> (1)   We create the stonith resource using this method in one
> >> system: pcs stonith create ipmi_node1 fence_ipmilan 
> >> ipaddr="192.168.100.202" login="ADMIN" passwd="ADMIN"
> >> pcmk_host_list="node1" pcs stonith create ipmi_node2 fence_ipmilan 
> >> ipaddr="192.168.100.203" login="ADMIN" passwd="ADMIN"
> >> pcmk_host_list="node2" pcs stonith create ipmi_node3 fence_ipmilan 
> >> ipaddr="192.168.100.204" login="ADMIN" passwd="ADMIN"
> >> pcmk_host_list="node3"
> >> 
> >> 
> >> (2)   We create the stonith resource using this method in another
> >> system:
> >> 
> >> pcs stonith create scsi-stonith-device fence_scsi 
> >> devices=/dev/mapper/fence pcmk_monitor_action=metadata 
> >> pcmk_reboot_action=off pcmk_host_list="node1 node2 node3 node4"
> >> meta provides=unfencing;
> >> 
> >> 
> >> The log is in the attachment. What prevents the stonith resource to 
> >> be started if we only started part of the nodes?  
> > 
> > It says quite clearly
> > 
> > May  1 22:02:09 node3 pengine[17997]:  notice: Cannot fence unclean 
> > nodes until quorum is attained (or no-quorum-policy is set to ignore)