[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Wed Mar 14 05:54:48 EDT 2018

On 03/14/2018 08:35 AM, Muhammad Sharfuddin wrote:
> Hi Andrei,
> >Somehow I miss corosync confiuration in this thread. Do you know
> >wait-for-all is set (how?) or you just assume it?
> >
> solution found, I was not using "wait_for_all"  option, I was assuming
> that "two_node: 1"
> would be sufficient:
>
> nodelist {
>         node { ring0_addr:     10.8.9.151  }
>         node { ring0_addr:     10.8.9.152  }
> }
> ###previously:
> quorum {
>         two_node:       1
>         provider:       corosync_votequorum
> }
> ###now/fix:
> quorum {
>         two_node:       1
>         provider:       corosync_votequorum
>         wait_for_all: 0  }
>
> My observation:
> when I was not using "wait_for_all: 0" in corosync.conf, only ocfs2
> resources were
> not running, rest of the resources were running fine because:
>     a - "two_node: 1" in corosync.conf file.
>     b - "no-quorum-policy=ignore" in cib.

If you now loose network-connection between the two nodes
one node might be lucky to fence the other.
If it is set to just power-off the other you are probably fine.
(With sbd you can achieve this behavior if you configure it
to just come up if the corresponding slot is clean.)
If fencing reboots the other node that one would come up
and right away fence the first doing startup-fencing.

>
> @ Klaus
> > what I tried to point out is that "no-quorum-policy=ignore"
> >is dangerous for services that do require a resource-manager. If you
> don't
> >have any of those go with a systemd startup.
> >
> running a single node is obviously something in-acceptable, but say if
> both the nodes crashes
> and only node come back and if I start the resources via systemd then
> the day the other node
> come back, I have to stop the services via systemd, to start the
> resources via cluster, while if a
> single node cluster was running the other node simply joins the
> cluster and no downtime would occur.

I had meant (a little bit provocative ;-) ) consider if you need the
resources to be started via a
resource-manager at all.

Klaus
>
> -- 
> Regards,
> Muhammad Sharfuddin
>
> On 3/13/2018 11:20 PM, Andrei Borzenkov wrote:
>> 13.03.2018 17:32, Klaus Wenninger пишет:
>>> On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote:
>>>> Yes, by saying pacemaker,  I meant to say corosync as well.
>>>>
>>>> Is there any fix ? or a two node cluster can't run ocfs2 resources
>>>> when one node is offline ?
>>> Actually there can't be a "fix" as 2 nodes are just not enough
>>> for a partial-cluster to be quorate in the classical sense
>>> (more votes than half of the cluster nodes).
>>>
>>> So to still be able to use it we have this 2-node config that
>>> permanently sets quorum. But not to run into issues on
>>> startup we need it to require both nodes seeing each
>>> other once.
>>>
>> I'm rather confused. I have run quite a lot of 2 node clusters and
>> standard way to resolve it is to require fencing on startup. Then single
>> node may assume it can safely proceed with starting resources. So it is
>> rather unexpected to suddenly read "cannot be fixed".
>>
>>> So this is definitely nothing that is specific to ocfs2.
>>> It just looks specific to ocfs2 because you've disabled
>>> quorum for pacemaker.
>>> To be honnest doing this you wouldn't need a resource-manager
>>> at all and could just start up your services using systemd.
>>>
>>> If you don't want a full 3rd node, and still want to handle cases
>>> where one node doesn't come up after a full shutdown of
>>> all nodes, you probably could go for a setup with qdevice.
>>>> Regards,
>>> Klaus
>>>
>>>> -- 
>>>> Regards,
>>>> Muhammad Sharfuddin
>>>>
>>>> On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
>>>>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>>>>>> online node(node1)
>>>>>>
>>>>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>>>>>> resource group successfully so many times in a row.
>>>>>>
>>>>>> 3 - while node2 was offline; I restart the pacemaker service on the
>>>>>> node1 and then tries to start the ocfs2 resource group, dlm started
>>>>>> but ocfs2 file system resource does not start.
>>>>>>
>>>>>> Nutshell:
>>>>>>
>>>>>> a - both nodes must be online to start the ocfs2 resource.
>>>>>>
>>>>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps
>>>>>> running
>>>>>> on the other/surviving node.
>>>>>>
>>>>>> c - while one node was offline, we can stop/start the ocfs2 resource
>>>>>> group on the surviving node but if we stops the pacemaker service,
>>>>>> then ocfs2 file system resource does not start with the following
>>>>>> info
>>>>>> in the logs:
>>>>> >From the logs I would say startup of dlm_controld times out
>>>>> because it
>>>>> is waiting
>>>>> for quorum - which doesn't happen because of wait-for-all.
>> Somehow I miss corosync confiuration in this thread. Do you know
>> wait-for-all is set (how?) or you just assume it?
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org