[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Wed Mar 14 03:35:26 EDT 2018

Hi Andrei,
 >Somehow I miss corosync confiuration in this thread. Do you know
 >wait-for-all is set (how?) or you just assume it?
 >
solution found, I was not using "wait_for_all"  option, I was assuming 
that "two_node: 1"
would be sufficient:

nodelist {
         node { ring0_addr:     10.8.9.151  }
         node { ring0_addr:     10.8.9.152  }
}
###previously:
quorum {
         two_node:       1
         provider:       corosync_votequorum
}
###now/fix:
quorum {
         two_node:       1
         provider:       corosync_votequorum
         wait_for_all: 0  }

My observation:
when I was not using "wait_for_all: 0" in corosync.conf, only ocfs2 
resources were
not running, rest of the resources were running fine because:
     a - "two_node: 1" in corosync.conf file.
     b - "no-quorum-policy=ignore" in cib.

@ Klaus
 > what I tried to point out is that "no-quorum-policy=ignore"
 >is dangerous for services that do require a resource-manager. If you don't
 >have any of those go with a systemd startup.
 >
running a single node is obviously something in-acceptable, but say if 
both the nodes crashes
and only node come back and if I start the resources via systemd then 
the day the other node
come back, I have to stop the services via systemd, to start the 
resources via cluster, while if a
single node cluster was running the other node simply joins the cluster 
and no downtime would occur.

--
Regards,
Muhammad Sharfuddin

On 3/13/2018 11:20 PM, Andrei Borzenkov wrote:
> 13.03.2018 17:32, Klaus Wenninger пишет:
>> On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote:
>>> Yes, by saying pacemaker,  I meant to say corosync as well.
>>>
>>> Is there any fix ? or a two node cluster can't run ocfs2 resources
>>> when one node is offline ?
>> Actually there can't be a "fix" as 2 nodes are just not enough
>> for a partial-cluster to be quorate in the classical sense
>> (more votes than half of the cluster nodes).
>>
>> So to still be able to use it we have this 2-node config that
>> permanently sets quorum. But not to run into issues on
>> startup we need it to require both nodes seeing each
>> other once.
>>
> I'm rather confused. I have run quite a lot of 2 node clusters and
> standard way to resolve it is to require fencing on startup. Then single
> node may assume it can safely proceed with starting resources. So it is
> rather unexpected to suddenly read "cannot be fixed".
>
>> So this is definitely nothing that is specific to ocfs2.
>> It just looks specific to ocfs2 because you've disabled
>> quorum for pacemaker.
>> To be honnest doing this you wouldn't need a resource-manager
>> at all and could just start up your services using systemd.
>>
>> If you don't want a full 3rd node, and still want to handle cases
>> where one node doesn't come up after a full shutdown of
>> all nodes, you probably could go for a setup with qdevice.
>>> Regards,
>> Klaus
>>
>>> -- 
>>> Regards,
>>> Muhammad Sharfuddin
>>>
>>> On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
>>>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>>>>> Hi,
>>>>>
>>>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>>>>> online node(node1)
>>>>>
>>>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>>>>> resource group successfully so many times in a row.
>>>>>
>>>>> 3 - while node2 was offline; I restart the pacemaker service on the
>>>>> node1 and then tries to start the ocfs2 resource group, dlm started
>>>>> but ocfs2 file system resource does not start.
>>>>>
>>>>> Nutshell:
>>>>>
>>>>> a - both nodes must be online to start the ocfs2 resource.
>>>>>
>>>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
>>>>> on the other/surviving node.
>>>>>
>>>>> c - while one node was offline, we can stop/start the ocfs2 resource
>>>>> group on the surviving node but if we stops the pacemaker service,
>>>>> then ocfs2 file system resource does not start with the following info
>>>>> in the logs:
>>>> >From the logs I would say startup of dlm_controld times out because it
>>>> is waiting
>>>> for quorum - which doesn't happen because of wait-for-all.
> Somehow I miss corosync confiuration in this thread. Do you know
> wait-for-all is set (how?) or you just assume it?
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus