[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Mar 12 04:04:24 EDT 2018
Hi!
I didn't read the logs carefully, but I remember one pitfall (SLES 11):
If I formatted the filesystem when the OCFS serveices were not running, I was unable to mount it; I had to reformat the filesystem when the OCFS services were running.
Maybe that helps.
Regards,
Ulrich
>>> "Gang He" <ghe at suse.com> schrieb am 12.03.2018 um 06:59 in Nachricht
<5AA687C8020000F9000AE79B at prv-mh.provo.novell.com>:
> Hello Muhammad,
>
> Usually, ocfs2 resource startup failure is caused by mount command timeout
> (or hanged).
> The sample debugging method is,
> remove ocfs2 resource from crm first,
> then mount this file system manually, see if the mount command will be
> timeout or hanged.
> If this command is hanged, please watch where is mount.ocfs2 process hanged
> via "cat /proc/xxx/stack" command.
> If the back trace is stopped at DLM kernel module, usually the root cause is
> cluster configuration problem.
>
>
> Thanks
> Gang
>
>
>>>>
>> On 3/12/2018 7:32 AM, Gang He wrote:
>>> Hello Muhammad,
>>>
>>> I think this problem is not in ocfs2, the cause looks like the cluster
>> quorum is missed.
>>> For two-node cluster (does not three-node cluster), if one node is offline,
>> the quorum will be missed by default.
>>> So, you should configure two-node related quorum setting according to the
>> pacemaker manual.
>>> Then, DLM can work normal, and ocfs2 resource can start up.
>> Yes its configured accordingly, no-quorum is set to "ignore".
>>
>> property cib-bootstrap-options: \
>> have-watchdog=true \
>> stonith-enabled=true \
>> stonith-timeout=80 \
>> startup-fencing=true \
>> no-quorum-policy=ignore
>>
>>>
>>> Thanks
>>> Gang
>>>
>>>
>>>> Hi,
>>>>
>>>> This two node cluster starts resources when both nodes are online but
>>>> does not start the ocfs2 resources
>>>>
>>>> when one node is offline. e.g if I gracefully stop the cluster resources
>>>> then stop the pacemaker service on
>>>>
>>>> either node, and try to start the ocfs2 resource on the online node, it
>>>> fails.
>>>>
>>>> logs:
>>>>
>>>> pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001)
>>>> pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001)
>>>> pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001)
>>>> pipci001 pengine[17732]: notice: Calculated transition 2, saving
>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>> pipci001 crmd[17733]: notice: Processing graph 2
>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>> crmd[17733]: notice: Initiating start operation dlm_start_0 locally on
>>>> pipci001
>>>> lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69
>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>> lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69
>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>> crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 (ok)
>>>> crmd[17733]: notice: Initiating monitor operation dlm_monitor_60000
>>>> locally on pipci001
>>>> crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0
>>>> locally on pipci001
>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71
>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>> /dev/mapper/sapmnt on /sapmnt
>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>>>> the lockspace group.
>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
>>>> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out
>>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>>> event done -512 0
>>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
>>>> failed -512 0
>>>> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 60000ms
>>>> lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71
>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
>>>> crmd[17733]: error: Result of start operation for p-fssapmnt on
>>>> pipci001: Timed Out
>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>>> (target: 0 vs. rc: 1): Error
>>>> crmd[17733]: notice: Transition aborted by operation
>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>>>> (target: 0 vs. rc: 1): Error
>>>> crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0,
>>>> Skipped=0, Incomplete=6,
>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>>>> pengine[17732]: notice: Watchdog will be used via SBD if fencing is
>>>> required
>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore
>>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after
>>>> 1000000 failures (max=2)
>>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after
>>>> 1000000 failures (max=2)
>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001)
>>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001)
>>>> pengine[17732]: notice: Calculated transition 3, saving inputs in
>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>> pengine[17732]: notice: Watchdog will be used via SBD if fencing is
>>>> required
>>>> pengine[17732]: notice: On loss of CCM Quorum: Ignore
>>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[17732]: warning: Forcing base-clone away from pipci001 after
>>>> 1000000 failures (max=2)
>>>> pipci001 pengine[17732]: warning: Forcing base-clone away from pipci001
>>>> after 1000000 failures (max=2)
>>>> pengine[17732]: notice: Stop dlm:0#011(pipci001)
>>>> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001)
>>>> pengine[17732]: notice: Calculated transition 4, saving inputs in
>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>> crmd[17733]: notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>> crmd[17733]: notice: Initiating stop operation p-fssapmnt_stop_0
>>>> locally on pipci001
>>>> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:stop call_id:72
>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt
>>>> on /sapmnt
>>>> pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmnt action:stop
>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
>>>> pipci001 crmd[17733]: notice: Result of stop operation for p-fssapmnt
>>>> on pipci001: 0 (ok)
>>>> crmd[17733]: notice: Initiating stop operation dlm_stop_0 locally on
>>>> pipci001
>>>> pipci001 lrmd[17730]: notice: executing - rsc:dlm action:stop call_id:74
>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces
>>>>
>>>>
>>>> resource configuration:
>>>>
>>>> primitive p-fssapmnt Filesystem \
>>>> params device="/dev/mapper/sapmnt" directory="/sapmnt"
>>>> fstype=ocfs2 \
>>>> op monitor interval=20 timeout=40 \
>>>> op start timeout=60 interval=0 \
>>>> op stop timeout=60 interval=0
>>>> primitive dlm ocf:pacemaker:controld \
>>>> op monitor interval=60 timeout=60 \
>>>> op start interval=0 timeout=90 \
>>>> op stop interval=0 timeout=100
>>>> clone base-clone base-group \
>>>> meta interleave=true target-role=Started
>>>>
>>>> cluster properties:
>>>> property cib-bootstrap-options: \
>>>> have-watchdog=true \
>>>> stonith-enabled=true \
>>>> stonith-timeout=80 \
>>>> startup-fencing=true \
>>>>
>>>>
>>>> Software versions:
>>>>
>>>> kernel version: 4.4.114-94.11-default
>>>> pacemaker-1.1.16-4.8.x86_64
>>>> corosync-2.3.6-9.5.1.x86_64
>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>> libdlm3-4.0.7-1.28.x86_64
>>>> libdlm-4.0.7-1.28.x86_64
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Muhammad Sharfuddin
>>>>
>>>>
>>>> ---
>>>> This email has been checked for viruses by Avast antivirus software.
>>>> https://www.avast.com/antivirus
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> --
>> Regards,
>> Muhammad Sharfuddin
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list