[ClusterLabs] single node fails to start the ocfs2 resource
Gang He
ghe at suse.com
Sun Mar 11 22:32:56 EDT 2018
Hello Muhammad,
I think this problem is not in ocfs2, the cause looks like the cluster quorum is missed.
For two-node cluster (does not three-node cluster), if one node is offline, the quorum will be missed by default.
So, you should configure two-node related quorum setting according to the pacemaker manual.
Then, DLM can work normal, and ocfs2 resource can start up.
Thanks
Gang
>>>
> Hi,
>
> This two node cluster starts resources when both nodes are online but
> does not start the ocfs2 resources
>
> when one node is offline. e.g if I gracefully stop the cluster resources
> then stop the pacemaker service on
>
> either node, and try to start the ocfs2 resource on the online node, it
> fails.
>
> logs:
>
> pipci001 pengine[17732]: notice: Start dlm:0#011(pipci001)
> pengine[17732]: notice: Start p-fssapmnt:0#011(pipci001)
> pengine[17732]: notice: Start p-fsusrsap:0#011(pipci001)
> pipci001 pengine[17732]: notice: Calculated transition 2, saving
> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
> pipci001 crmd[17733]: notice: Processing graph 2
> (ref=pe_calc-dc-1520613202-31) derived from
> /var/lib/pacemaker/pengine/pe-input-339.bz2
> crmd[17733]: notice: Initiating start operation dlm_start_0 locally on
> pipci001
> lrmd[17730]: notice: executing - rsc:dlm action:start call_id:69
> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
> lrmd[17730]: notice: finished - rsc:dlm action:start call_id:69
> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
> crmd[17733]: notice: Result of start operation for dlm on pipci001: 0 (ok)
> crmd[17733]: notice: Initiating monitor operation dlm_monitor_60000
> locally on pipci001
> crmd[17733]: notice: Initiating start operation p-fssapmnt_start_0
> locally on pipci001
> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:start call_id:71
> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
> /dev/mapper/sapmnt on /sapmnt
> kernel: [ 4576.529938] dlm: Using TCP for communications
> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
> the lockspace group.
> dlm_controld[19019]: 4629 fence work wait for quorum
> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
> lrmd[17730]: warning: p-fssapmnt_start_0 process (PID 19052) timed out
> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
> event done -512 0
> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
> failed -512 0
> lrmd[17730]: warning: p-fssapmnt_start_0:19052 - timed out after 60000ms
> lrmd[17730]: notice: finished - rsc:p-fssapmnt action:start call_id:71
> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
> crmd[17733]: error: Result of start operation for p-fssapmnt on
> pipci001: Timed Out
> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]: notice: Transition aborted by operation
> p-fssapmnt_start_0 'modify' on pipci001: Event failed
> crmd[17733]: warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
> (target: 0 vs. rc: 1): Error
> crmd[17733]: notice: Transition 2 (Complete=5, Pending=0, Fired=0,
> Skipped=0, Incomplete=6,
> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
> pengine[17732]: notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]: notice: On loss of CCM Quorum: Ignore
> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]: warning: Forcing base-clone away from pipci001 after
> 1000000 failures (max=2)
> pengine[17732]: warning: Forcing base-clone away from pipci001 after
> 1000000 failures (max=2)
> pengine[17732]: notice: Stop dlm:0#011(pipci001)
> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001)
> pengine[17732]: notice: Calculated transition 3, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-340.bz2
> pengine[17732]: notice: Watchdog will be used via SBD if fencing is
> required
> pengine[17732]: notice: On loss of CCM Quorum: Ignore
> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]: warning: Processing failed op start for p-fssapmnt:0 on
> pipci001: unknown error (1)
> pengine[17732]: warning: Forcing base-clone away from pipci001 after
> 1000000 failures (max=2)
> pipci001 pengine[17732]: warning: Forcing base-clone away from pipci001
> after 1000000 failures (max=2)
> pengine[17732]: notice: Stop dlm:0#011(pipci001)
> pengine[17732]: notice: Stop p-fssapmnt:0#011(pipci001)
> pengine[17732]: notice: Calculated transition 4, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]: notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
> crmd[17733]: notice: Initiating stop operation p-fssapmnt_stop_0
> locally on pipci001
> lrmd[17730]: notice: executing - rsc:p-fssapmnt action:stop call_id:72
> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt
> on /sapmnt
> pipci001 lrmd[17730]: notice: finished - rsc:p-fssapmnt action:stop
> call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
> pipci001 crmd[17733]: notice: Result of stop operation for p-fssapmnt
> on pipci001: 0 (ok)
> crmd[17733]: notice: Initiating stop operation dlm_stop_0 locally on
> pipci001
> pipci001 lrmd[17730]: notice: executing - rsc:dlm action:stop call_id:74
> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces
>
>
> resource configuration:
>
> primitive p-fssapmnt Filesystem \
> params device="/dev/mapper/sapmnt" directory="/sapmnt"
> fstype=ocfs2 \
> op monitor interval=20 timeout=40 \
> op start timeout=60 interval=0 \
> op stop timeout=60 interval=0
> primitive dlm ocf:pacemaker:controld \
> op monitor interval=60 timeout=60 \
> op start interval=0 timeout=90 \
> op stop interval=0 timeout=100
> clone base-clone base-group \
> meta interleave=true target-role=Started
>
> cluster properties:
> property cib-bootstrap-options: \
> have-watchdog=true \
> stonith-enabled=true \
> stonith-timeout=80 \
> startup-fencing=true \
>
>
> Software versions:
>
> kernel version: 4.4.114-94.11-default
> pacemaker-1.1.16-4.8.x86_64
> corosync-2.3.6-9.5.1.x86_64
> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
> ocfs2-tools-1.8.5-1.35.x86_64
> dlm-kmp-default-4.4.114-94.11.3.x86_64
> libdlm3-4.0.7-1.28.x86_64
> libdlm-4.0.7-1.28.x86_64
>
>
> --
> Regards,
> Muhammad Sharfuddin
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list