[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Tue Mar 13 14:20:10 EDT 2018

13.03.2018 17:32, Klaus Wenninger пишет:
> On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote:
>> Yes, by saying pacemaker,  I meant to say corosync as well.
>>
>> Is there any fix ? or a two node cluster can't run ocfs2 resources
>> when one node is offline ?
> 
> Actually there can't be a "fix" as 2 nodes are just not enough
> for a partial-cluster to be quorate in the classical sense
> (more votes than half of the cluster nodes).
> 
> So to still be able to use it we have this 2-node config that
> permanently sets quorum. But not to run into issues on
> startup we need it to require both nodes seeing each
> other once.
> 

I'm rather confused. I have run quite a lot of 2 node clusters and
standard way to resolve it is to require fencing on startup. Then single
node may assume it can safely proceed with starting resources. So it is
rather unexpected to suddenly read "cannot be fixed".

> So this is definitely nothing that is specific to ocfs2.
> It just looks specific to ocfs2 because you've disabled
> quorum for pacemaker.
> To be honnest doing this you wouldn't need a resource-manager
> at all and could just start up your services using systemd.
> 
> If you don't want a full 3rd node, and still want to handle cases
> where one node doesn't come up after a full shutdown of
> all nodes, you probably could go for a setup with qdevice.
> > Regards,
> Klaus
> 
>>
>> -- 
>> Regards,
>> Muhammad Sharfuddin
>>
>> On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
>>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>>>> Hi,
>>>>
>>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>>>> online node(node1)
>>>>
>>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>>>> resource group successfully so many times in a row.
>>>>
>>>> 3 - while node2 was offline; I restart the pacemaker service on the
>>>> node1 and then tries to start the ocfs2 resource group, dlm started
>>>> but ocfs2 file system resource does not start.
>>>>
>>>> Nutshell:
>>>>
>>>> a - both nodes must be online to start the ocfs2 resource.
>>>>
>>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
>>>> on the other/surviving node.
>>>>
>>>> c - while one node was offline, we can stop/start the ocfs2 resource
>>>> group on the surviving node but if we stops the pacemaker service,
>>>> then ocfs2 file system resource does not start with the following info
>>>> in the logs:
>>> >From the logs I would say startup of dlm_controld times out because it
>>> is waiting
>>> for quorum - which doesn't happen because of wait-for-all.

Somehow I miss corosync confiuration in this thread. Do you know
wait-for-all is set (how?) or you just assume it?