[ClusterLabs] single node fails to start the ocfs2 resource

Mon Mar 12 03:41:50 EDT 2018



>>> 
> Hello Gang,
> 
> to follow your instructions, I started the dlm resource via:
> 
>      crm resource start dlm
> 
> then mount/unmount the ocfs2 file system manually..(which seems to be 
> the fix of the situation).
> 
> Now resources are getting started properly on a single node.. I am happy 
> as the issue is fixed, but at the same time I am lost because I have no idea
> 
> how things get fixed here(merely by mounting/unmounting the ocfs2 file 
> systems)

>From your description.
I just wonder  the DLM resource does not work normally under that situation.
Yan/Bin, do you have any comments about two-node cluster? which configuration settings will affect corosync quorum/DLM ?


Thanks
Gang


> 
> 
> --
> Regards,
> Muhammad Sharfuddin
> 
> On 3/12/2018 10:59 AM, Gang He wrote:
>> Hello Muhammad,
>>
>> Usually, ocfs2 resource startup failure is caused by mount command timeout 
> (or hanged).
>> The sample debugging method is,
>> remove ocfs2 resource from crm first,
>> then mount this file system manually, see if the mount command will be 
> timeout or hanged.
>> If this command is hanged, please watch where is mount.ocfs2 process hanged 
> via "cat /proc/xxx/stack" command.
>> If the back trace is stopped at DLM kernel module, usually the root cause is 
> cluster configuration problem.
>>
>>
>> Thanks
>> Gang
>>
>>
>>> On 3/12/2018 7:32 AM, Gang He wrote:
>>>> Hello Muhammad,
>>>>
>>>> I think this problem is not in ocfs2, the cause looks like the cluster
>>> quorum is missed.
>>>> For two-node cluster (does not three-node cluster), if one node is offline,
>>> the quorum will be missed by default.
>>>> So, you should configure two-node related quorum setting according to the
>>> pacemaker manual.
>>>> Then, DLM can work normal, and ocfs2 resource can start up.
>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>
>>> property cib-bootstrap-options: \
>>>            have-watchdog=true \
>>>            stonith-enabled=true \
>>>            stonith-timeout=80 \
>>>            startup-fencing=true \
>>>            no-quorum-policy=ignore
>>>
>>>> Thanks
>>>> Gang
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> This two node cluster starts resources when both nodes are online but
>>>>> does not start the ocfs2 resources
>>>>>
>>>>> when one node is offline. e.g if I gracefully stop the cluster resources
>>>>> then stop the pacemaker service on
>>>>>
>>>>> either node, and try to start the ocfs2 resource on the online node, it
>>>>> fails.
>>>>>
>>>>> logs:
>>>>>
>>>>> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
>>>>> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
>>>>> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
>>>>> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
>>>>> pipci001
>>>>> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>>> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>>> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0 (ok)
>>>>> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_60000
>>>>> locally on pipci001
>>>>> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
>>>>> locally on pipci001
>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>>> /dev/mapper/sapmnt on /sapmnt
>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>>>>> the lockspace group.
>>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
>>>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>>>> event done -512 0
>>>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
>>>>> failed -512 0
>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 60000ms
>>>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
>>>>> crmd[17733]:    error: Result of start operation for p-fssapmnt on
>>>>> pipci001: Timed Out
>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>>>> (target: 0 vs. rc: 1): Error
>>>>> crmd[17733]:   notice: Transition aborted by operation
>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>>>> (target: 0 vs. rc: 1): Error
>>>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
>>>>> Skipped=0, Incomplete=6,
>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
>>>>> required
>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>>>> pipci001: unknown error (1)
>>>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>>>> pipci001: unknown error (1)
>>>>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>>>>> 1000000 failures (max=2)
>>>>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>>>>> 1000000 failures (max=2)
>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>> pengine[17732]:   notice: Calculated transition 3, saving inputs in
>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
>>>>> required
>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>>>> pipci001: unknown error (1)
>>>>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>>>>> pipci001: unknown error (1)
>>>>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>>>>> 1000000 failures (max=2)
>>>>> pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
>>>>> after 1000000 failures (max=2)
>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>> pengine[17732]:   notice: Calculated transition 4, saving inputs in
>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>> crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>> crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
>>>>> locally on pipci001
>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop call_id:72
>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt
>>>>> on /sapmnt
>>>>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:stop
>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
>>>>> pipci001 crmd[17733]:   notice: Result of stop operation for p-fssapmnt
>>>>> on pipci001: 0 (ok)
>>>>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0 locally on
>>>>> pipci001
>>>>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm action:stop call_id:74
>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces
>>>>>
>>>>>
>>>>> resource configuration:
>>>>>
>>>>> primitive p-fssapmnt Filesystem \
>>>>>            params device="/dev/mapper/sapmnt" directory="/sapmnt"
>>>>> fstype=ocfs2 \
>>>>>            op monitor interval=20 timeout=40 \
>>>>>            op start timeout=60 interval=0 \
>>>>>            op stop timeout=60 interval=0
>>>>> primitive dlm ocf:pacemaker:controld \
>>>>>            op monitor interval=60 timeout=60 \
>>>>>            op start interval=0 timeout=90 \
>>>>>            op stop interval=0 timeout=100
>>>>> clone base-clone base-group \
>>>>>            meta interleave=true target-role=Started
>>>>>
>>>>> cluster properties:
>>>>> property cib-bootstrap-options: \
>>>>>            have-watchdog=true \
>>>>>            stonith-enabled=true \
>>>>>            stonith-timeout=80 \
>>>>>            startup-fencing=true \
>>>>>
>>>>>
>>>>> Software versions:
>>>>>
>>>>> kernel version: 4.4.114-94.11-default
>>>>> pacemaker-1.1.16-4.8.x86_64
>>>>> corosync-2.3.6-9.5.1.x86_64
>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>>> libdlm3-4.0.7-1.28.x86_64
>>>>> libdlm-4.0.7-1.28.x86_64
>>>>>
>>>>>
>>>>> -- 
>>>>> Regards,
>>>>> Muhammad Sharfuddin
>>>>>
>>>>>
>>>>> ---
>>>>> This email has been checked for viruses by Avast antivirus software.
>>>>> https://www.avast.com/antivirus 
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org 
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>
>>>>> Project Home: http://www.clusterlabs.org 
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>>> Bugs: http://bugs.clusterlabs.org 
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org 
>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>>>
>>> --
>>> Regards,
>>> Muhammad Sharfuddin
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org 
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org