[ClusterLabs] single node fails to start the ocfs2 resource

Muhammad Sharfuddin M.Sharfuddin at nds.com.pk
Sat Mar 10 05:48:51 UTC 2018


On 3/10/2018 10:00 AM, Andrei Borzenkov wrote:
> 09.03.2018 19:55, Muhammad Sharfuddin пишет:
>> Hi,
>>
>> This two node cluster starts resources when both nodes are online but
>> does not start the ocfs2 resources
>>
>> when one node is offline. e.g if I gracefully stop the cluster resources
>> then stop the pacemaker service on
>>
>> either node, and try to start the ocfs2 resource on the online node, it
>> fails.
>>
>> logs:
>>
>> pipci001 pengine[17732]:   notice: Start   dlm:0#011(pipci001)
>> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
>> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
>> pipci001 pengine[17732]:   notice: Calculated transition 2, saving
>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>> pipci001 crmd[17733]:   notice: Processing graph 2
>> (ref=pe_calc-dc-1520613202-31) derived from
>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>> crmd[17733]:   notice: Initiating start operation dlm_start_0 locally on
>> pipci001
>> lrmd[17730]:   notice: executing - rsc:dlm action:start call_id:69
>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>> lrmd[17730]:   notice: finished - rsc:dlm action:start call_id:69
>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>> crmd[17733]:   notice: Result of start operation for dlm on pipci001: 0
>> (ok)
>> crmd[17733]:   notice: Initiating monitor operation dlm_monitor_60000
>> locally on pipci001
>> crmd[17733]:   notice: Initiating start operation p-fssapmnt_start_0
>> locally on pipci001
>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:start call_id:71
>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>> /dev/mapper/sapmnt on /sapmnt
>> kernel: [ 4576.529938] dlm: Using TCP for communications
>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>> the lockspace group.
>> dlm_controld[19019]: 4629 fence work wait for quorum
>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052) timed out
> That sounds like the problem. It attempts to fence the other node, but
> you do not have any fencing resources configured so it cannot work. You
> need to ensure you have working fencing agent in your configuration.
sbd is being perfectly used in this cluster and after multiple failed 
attempts to start the ocfs2
resource, this standalone online node gets fenced too

logs:
pengine[17732]:  warning: Scheduling Node pipci001 for STONITH
pengine[17732]:   notice: Stop of failed resource dlm:0 is implicit 
after pipci001 is fenced
pengine[17732]:   notice:  * Fence pipci001
pengine[17732]:   notice: Stop    sbd-stonith#011(pipci001)
pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
pengine[17732]:  warning: Calculated transition 6 (with warnings), 
saving inputs in /var/lib/pacemaker/pengine/pe-warn-15.bz2
2018-03-09T21:03:30.588865+05:00 pipci002 crmd[13030]:   notice: 
Processing graph 6 (ref=pe_calc-dc-1520611410-34) derived from 
/var/lib/pacemaker/pengine/pe-warn-15.bz2
crmd[17733]:   notice: Requesting fencing (reboot) of node pipci001
stonith-ng[13026]:   notice: Client crmd.13030.f5570444 wants to fence 
(reboot) 'pipci001' with device '(any)'
stonith-ng[13026]:   notice: Requesting peer fencing (reboot) of pipci001
stonith-ng[13026]:   notice: sbd-stonith can fence (rebo

Also as informed this cluster starts resources when both nodes are 
online and stonith is enabled
and works too.

cluster properties:
property cib-bootstrap-options: \
         have-watchdog=true \
         stonith-enabled=true \
         stonith-timeout=80 \
         startup-fencing=true \


>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>> event done -512 0
>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group join
>> failed -512 0
>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out after 60000ms
>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start call_id:71
>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on (node 0)
>> crmd[17733]:    error: Result of start operation for p-fssapmnt on
>> pipci001: Timed Out
>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>> (target: 0 vs. rc: 1): Error
>> crmd[17733]:   notice: Transition aborted by operation
>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on pipci001 failed
>> (target: 0 vs. rc: 1): Error
>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0, Fired=0,
>> Skipped=0, Incomplete=6,
>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
>> required
>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>> pipci001: unknown error (1)
>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>> pipci001: unknown error (1)
>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>> 1000000 failures (max=2)
>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>> 1000000 failures (max=2)
>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>> pengine[17732]:   notice: Calculated transition 3, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>> pengine[17732]:   notice: Watchdog will be used via SBD if fencing is
>> required
>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>> pipci001: unknown error (1)
>> pengine[17732]:  warning: Processing failed op start for p-fssapmnt:0 on
>> pipci001: unknown error (1)
>> pengine[17732]:  warning: Forcing base-clone away from pipci001 after
>> 1000000 failures (max=2)
>> pipci001 pengine[17732]:  warning: Forcing base-clone away from pipci001
>> after 1000000 failures (max=2)
>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>> pengine[17732]:   notice: Calculated transition 4, saving inputs in
>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>> crmd[17733]:   notice: Processing graph 4 (ref=pe_calc-dc-1520613263-36)
>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>> crmd[17733]:   notice: Initiating stop operation p-fssapmnt_stop_0
>> locally on pipci001
>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop call_id:72
>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for /dev/mapper/sapmnt
>> on /sapmnt
>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:stop
>> call_id:72 pid:19189 exit-code:0 exec-time:83ms queue-time:0ms
>> pipci001 crmd[17733]:   notice: Result of stop operation for p-fssapmnt
>> on pipci001: 0 (ok)
>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0 locally on
>> pipci001
>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm action:stop call_id:74
>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active lockspaces
>>
>>
>> resource configuration:
>>
>> primitive p-fssapmnt Filesystem \
>>          params device="/dev/mapper/sapmnt" directory="/sapmnt"
>> fstype=ocfs2 \
>>          op monitor interval=20 timeout=40 \
>>          op start timeout=60 interval=0 \
>>          op stop timeout=60 interval=0
>> primitive dlm ocf:pacemaker:controld \
>>          op monitor interval=60 timeout=60 \
>>          op start interval=0 timeout=90 \
>>          op stop interval=0 timeout=100
>> clone base-clone base-group \
>>          meta interleave=true target-role=Started
>>
>> cluster properties:
>> property cib-bootstrap-options: \
>>          have-watchdog=true \
>>          stonith-enabled=true \
>>          stonith-timeout=80 \
>>          startup-fencing=true \
>>
>>
>> Software versions:
>>
>> kernel version: 4.4.114-94.11-default
>> pacemaker-1.1.16-4.8.x86_64
>> corosync-2.3.6-9.5.1.x86_64
>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>> ocfs2-tools-1.8.5-1.35.x86_64
>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>> libdlm3-4.0.7-1.28.x86_64
>> libdlm-4.0.7-1.28.x86_64
>>
>>
>

--
Regards,
Muhammad Sharfuddin


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



More information about the Users mailing list