[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Muhammad Sharfuddin M.Sharfuddin at nds.com.pk
Tue Mar 13 09:30:10 EDT 2018


Yes, by saying pacemaker,  I meant to say corosync as well.

Is there any fix ? or a two node cluster can't run ocfs2 resources when 
one node is offline ?

--
Regards,
Muhammad Sharfuddin

On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>> Hi,
>>
>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>> online node(node1)
>>
>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>> resource group successfully so many times in a row.
>>
>> 3 - while node2 was offline; I restart the pacemaker service on the
>> node1 and then tries to start the ocfs2 resource group, dlm started
>> but ocfs2 file system resource does not start.
>>
>> Nutshell:
>>
>> a - both nodes must be online to start the ocfs2 resource.
>>
>> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
>> on the other/surviving node.
>>
>> c - while one node was offline, we can stop/start the ocfs2 resource
>> group on the surviving node but if we stops the pacemaker service,
>> then ocfs2 file system resource does not start with the following info
>> in the logs:
> >From the logs I would say startup of dlm_controld times out because it
> is waiting
> for quorum - which doesn't happen because of wait-for-all.
> Question is if you really just stopped pacemaker or if you stopped
> corosync as well.
> In the latter case I would say it is the expected behavior.
>
> Regards,
> Klaus
>   
>> lrmd[4317]:   notice: executing - rsc:p-fssapmnt action:start call_id:53
>> Filesystem(p-fssapmnt)[5139]: INFO: Running start for
>> /dev/mapper/sapmnt on /sapmnt
>> kernel: [  706.162676] dlm: Using TCP for communications
>> kernel: [  706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>> the lockspace group...
>> dlm_controld[5105]: 759 fence work wait for quorum
>> dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for quorum
>> lrmd[4317]:  warning: p-fssapmnt_start_0 process (PID 5139) timed out
>> lrmd[4317]:  warning: p-fssapmnt_start_0:5139 - timed out after 60000ms
>> lrmd[4317]:   notice: finished - rsc:p-fssapmnt action:start
>> call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms
>> kernel: [  766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>> event done -512 0
>> kernel: [  766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>> join failed -512 0
>> crmd[4320]:   notice: Result of stop operation for p-fssapmnt on
>> pipci001: 0 (ok)
>> crmd[4320]:   notice: Initiating stop operation dlm_stop_0 locally on
>> pipci001
>> lrmd[4317]:   notice: executing - rsc:dlm action:stop call_id:56
>> dlm_controld[5105]: 766 shutdown ignored, active lockspaces
>> lrmd[4317]:  warning: dlm_stop_0 process (PID 5326) timed out
>> lrmd[4317]:  warning: dlm_stop_0:5326 - timed out after 100000ms
>> lrmd[4317]:   notice: finished - rsc:dlm action:stop call_id:56
>> pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms
>> crmd[4320]:    error: Result of stop operation for dlm on pipci001:
>> Timed Out
>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>> (target: 0 vs. rc: 1): Error
>> crmd[4320]:   notice: Transition aborted by operation dlm_stop_0
>> 'modify' on pipci001: Event failed
>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>> (target: 0 vs. rc: 1): Error
>> pengine[4319]:   notice: Watchdog will be used via SBD if fencing is
>> required
>> pengine[4319]:   notice: On loss of CCM Quorum: Ignore
>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>> pipci001: unknown error (1)
>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>> pipci001: unknown error (1)
>> pengine[4319]:  warning: Cluster node pipci001 will be fenced: dlm:0
>> failed there
>> pengine[4319]:  warning: Processing failed op start for p-fssapmnt:0
>> on pipci001: unknown error (1)
>> pengine[4319]:   notice: Stop of failed resource dlm:0 is implicit
>> after pipci001 is fenced
>> pengine[4319]:   notice:  * Fence pipci001
>> pengine[4319]:   notice: Stop    sbd-stonith#011(pipci001)
>> pengine[4319]:   notice: Stop    dlm:0#011(pipci001)
>> crmd[4320]:   notice: Requesting fencing (reboot) of node pipci001
>> stonith-ng[4316]:   notice: Client crmd.4320.4c2f757b wants to fence
>> (reboot) 'pipci001' with device '(any)'
>> stonith-ng[4316]:   notice: Requesting peer fencing (reboot) of pipci001
>> stonith-ng[4316]:   notice: sbd-stonith can fence (reboot) pipci001:
>> dynamic-list
>>
>>
>> -- 
>> Regards,
>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>
>> On 3/13/2018 1:04 PM, Ulrich Windl wrote:
>>> Hi!
>>>
>>> I'd recommend this:
>>> Cleanly boot your nodes, avoiding any manual operation with cluster
>>> resources. Keep the logs.
>>> Then start your tests, keeping the logs for each.
>>> Try to fix issues by reading the logs and adjusting the cluster
>>> configuration, and not by starting commands that the cluster should
>>> start.
>>>
>>> We had an 2-node OCFS2 cluster running for quite some time with
>>> SLES11, but now the cluster is three nodes. To me the output of
>>> "crm_mon -1Arfj" combined with having set record-pending=true was
>>> very valuable finding problems.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>
>>>>>> Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk> schrieb am
>>>>>> 13.03.2018 um 08:43 in
>>> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e623b3 at nds.com.pk>:
>>>> Dear Klaus,
>>>>
>>>> If I understand you properly then, its a fencing issue, and whatever I
>>>> am facing is "natural" or "by-design" in a two node cluster where
>>>> quorum
>>>> is incomplete.
>>>>
>>>> I am quite convinced that you have pointed out right because, when I
>>>> start the dlm resource via cluster and then tries to start the ocfs2
>>>> file system manually from command line, mount command remains hanged
>>>> and
>>>> following events are reported in the logs:
>>>>
>>>>        kernel: [62622.864828] ocfs2: Registered cluster interface user
>>>>        kernel: [62622.884427] dlm: Using TCP for communications
>>>>        kernel: [62622.884750] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>> joining the lockspace group...
>>>>        dlm_controld[17655]: 62627 fence work wait for quorum
>>>>        dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait
>>>> for quorum
>>>>
>>>> and then following messages keep reported every 5-10 minutes, till I
>>>> kill the mount.ocfs2 process:
>>>>
>>>>        dlm_controld[17655]: 62627 fence work wait for quorum
>>>>        dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9 wait
>>>> for quorum
>>>>
>>>> I am also very much confused, because yesterday I did the same and was
>>>> able to mount the ocfs2 file system manually from command line(at least
>>>> once), and then unmount the file system manually stop the dlm resource
>>>> from cluster and then complete ocfs2 resource stack(dlm, file systems)
>>>> start/stop successfully via cluster even when only machine was online.
>>>>
>>>> In a two-node cluster, which have ocfs2 resources, we can't run the
>>>> ocfs2 resources when quorum is incomplete(one node is offline) ?
>>>>
>>>> -- 
>>>> Regards,
>>>> Muhammad Sharfuddin
>>>>
>>>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote:
>>>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote:
>>>>>> Hi Klaus,
>>>>>>
>>>>>> primitive sbd-stonith stonith:external/sbd \
>>>>>>            op monitor interval=3000 timeout=20 \
>>>>>>            op start interval=0 timeout=240 \
>>>>>>            op stop interval=0 timeout=100 \
>>>>>>            params sbd_device="/dev/mapper/sbd" \
>>>>>>            meta target-role=Started
>>>>> Makes more sense now.
>>>>> Using pcmk_delay_max would probably be useful here
>>>>> to prevent a fence-race.
>>>>> That stonith-resource was not in your resource-list below ...
>>>>>
>>>>>> property cib-bootstrap-options: \
>>>>>>            have-watchdog=true \
>>>>>>            stonith-enabled=true \
>>>>>>            no-quorum-policy=ignore \
>>>>>>            stonith-timeout=90 \
>>>>>>            startup-fencing=true
>>>>> You've set no-quorum-policy=ignore for pacemaker.
>>>>> Whether this is a good idea or not in your setup is
>>>>> written on another page.
>>>>> But isn't dlm directly interfering with corosync so
>>>>> that it would get the quorum state from there?
>>>>> As you have 2-node set probably on a 2-node-cluster
>>>>> this would - after both nodes down - wait for all
>>>>> nodes up first.
>>>>>
>>>>> Regards,
>>>>> Klaus
>>>>>
>>>>>> # ps -eaf |grep sbd
>>>>>> root      6129     1  0 17:35 ?        00:00:00 sbd: inquisitor
>>>>>> root      6133  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>> /dev/mapper/sbd - slot: 1 - uuid:
>>>>>> 6e80a337-95db-4608-bd62-d59517f39103
>>>>>> root      6134  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>> Pacemaker
>>>>>> root      6135  6129  0 17:35 ?        00:00:00 sbd: watcher: Cluster
>>>>>>
>>>>>> This cluster does not start ocfs2 resources when I first
>>>>>> intentionally
>>>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource
>>>>>> while
>>>>>> one node is  offline.
>>>>>>
>>>>>> To fix the issue, I have one permanent solution, bring the other
>>>>>> node(offline) online and things get fixed automatically, i.e ocfs2
>>>>>> resources mounts.
>>>>>>
>>>>>> -- 
>>>>>> Regards,
>>>>>> Muhammad Sharfuddin
>>>>>>
>>>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote:
>>>>>>> Hi Muhammad!
>>>>>>>
>>>>>>> Could you be a little bit more elaborate on your fencing-setup!
>>>>>>> I read about you using SBD but I don't see any sbd-fencing-resource.
>>>>>>> For the case you wanted to use watchdog-fencing with SBD this
>>>>>>> would require stonith-watchdog-timeout property to be set.
>>>>>>> But watchdog-fencing relies on quorum (without 2-node trickery)
>>>>>>> and thus wouldn't work on a 2-node-cluster anyway.
>>>>>>>
>>>>>>> Didn't read through the whole thread - so I might be missing
>>>>>>> something ...
>>>>>>>
>>>>>>> Regards,
>>>>>>> Klaus
>>>>>>>
>>>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
>>>>>>>> Hello Gang,
>>>>>>>>
>>>>>>>> as informed, previously cluster was fixed to start the ocfs2
>>>>>>>> resources by
>>>>>>>>
>>>>>>>> a) crm resource start dlm
>>>>>>>>
>>>>>>>> b) mount/umount the ocfs2 file system manually. (this step was the
>>>>>>>> fix)
>>>>>>>>
>>>>>>>> and then starting the clone group(which include dlm, ocfs2 file
>>>>>>>> systems) worked fine:
>>>>>>>>
>>>>>>>> c) crm resource start base-clone.
>>>>>>>>
>>>>>>>> Now I crash the nodes intentionally and then keep only one node
>>>>>>>> online, again cluster stopped starting the ocfs2 resources. I again
>>>>>>>> tried to follow your instructions i.e
>>>>>>>>
>>>>>>>> i) crm resource start dlm
>>>>>>>>
>>>>>>>> then try to mount the ocfs2 file system manually which got
>>>>>>>> hanged this
>>>>>>>> time(previously manually mounting helped me):
>>>>>>>>
>>>>>>>> # cat /proc/3966/stack
>>>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm]
>>>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm]
>>>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm]
>>>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350
>>>>>>>> [ocfs2_stack_user]
>>>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240
>>>>>>>> [ocfs2_stackglue]
>>>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0
>>>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170
>>>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110
>>>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0
>>>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0
>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>
>>>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the
>>>>>>>> dlm
>>>>>>>> process, and then try to start(crm resource start dlm) the
>>>>>>>> dlm(which
>>>>>>>> previously always get started successfully), this time dlm didn't
>>>>>>>> start and I checked the dlm_controld process
>>>>>>>>
>>>>>>>> cat /proc/3754/stack
>>>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60
>>>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0
>>>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0
>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>
>>>>>>>> Nutshell:
>>>>>>>>
>>>>>>>> 1 - this cluster is configured to run when single node is online
>>>>>>>>
>>>>>>>> 2 - this cluster does not start the ocfs2 resources after a
>>>>>>>> crash when
>>>>>>>> only one node is online.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Regards,
>>>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>>>>>>
>>>>>>>> On 3/12/2018 12:41 PM, Gang He wrote:
>>>>>>>>>> Hello Gang,
>>>>>>>>>>
>>>>>>>>>> to follow your instructions, I started the dlm resource via:
>>>>>>>>>>
>>>>>>>>>>           crm resource start dlm
>>>>>>>>>>
>>>>>>>>>> then mount/unmount the ocfs2 file system manually..(which
>>>>>>>>>> seems to be
>>>>>>>>>> the fix of the situation).
>>>>>>>>>>
>>>>>>>>>> Now resources are getting started properly on a single node..
>>>>>>>>>> I am
>>>>>>>>>> happy
>>>>>>>>>> as the issue is fixed, but at the same time I am lost because
>>>>>>>>>> I have
>>>>>>>>>> no idea
>>>>>>>>>>
>>>>>>>>>> how things get fixed here(merely by mounting/unmounting the ocfs2
>>>>>>>>>> file
>>>>>>>>>> systems)
>>>>>>>>> >From your description.
>>>>>>>>> I just wonder  the DLM resource does not work normally under that
>>>>>>>>> situation.
>>>>>>>>> Yan/Bin, do you have any comments about two-node cluster? which
>>>>>>>>> configuration settings will affect corosync quorum/DLM ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Gang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Regards,
>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>
>>>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote:
>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>
>>>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount
>>>>>>>>>>> command
>>>>>>>>>>> timeout
>>>>>>>>>> (or hanged).
>>>>>>>>>>> The sample debugging method is,
>>>>>>>>>>> remove ocfs2 resource from crm first,
>>>>>>>>>>> then mount this file system manually, see if the mount command
>>>>>>>>>>> will be
>>>>>>>>>> timeout or hanged.
>>>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2
>>>>>>>>>>> process hanged
>>>>>>>>>> via "cat /proc/xxx/stack" command.
>>>>>>>>>>> If the back trace is stopped at DLM kernel module, usually
>>>>>>>>>>> the root
>>>>>>>>>>> cause is
>>>>>>>>>> cluster configuration problem.
>>>>>>>>>>> Thanks
>>>>>>>>>>> Gang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote:
>>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks like the
>>>>>>>>>>>>> cluster
>>>>>>>>>>>> quorum is missed.
>>>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one
>>>>>>>>>>>>> node
>>>>>>>>>>>>> is offline,
>>>>>>>>>>>> the quorum will be missed by default.
>>>>>>>>>>>>> So, you should configure two-node related quorum setting
>>>>>>>>>>>>> according to the
>>>>>>>>>>>> pacemaker manual.
>>>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start up.
>>>>>>>>>>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>>>>>>>>>>
>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>                 have-watchdog=true \
>>>>>>>>>>>>                 stonith-enabled=true \
>>>>>>>>>>>>                 stonith-timeout=80 \
>>>>>>>>>>>>                 startup-fencing=true \
>>>>>>>>>>>>                 no-quorum-policy=ignore
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This two node cluster starts resources when both nodes are
>>>>>>>>>>>>>> online but
>>>>>>>>>>>>>> does not start the ocfs2 resources
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> resources
>>>>>>>>>>>>>> then stop the pacemaker service on
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the
>>>>>>>>>>>>>> online
>>>>>>>>>>>>>> node, it
>>>>>>>>>>>>>> fails.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Start
>>>>>>>>>>>>>> dlm:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Start   p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Start   p-fsusrsap:0#011(pipci001)
>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Calculated transition 2,
>>>>>>>>>>>>>> saving
>>>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation dlm_start_0
>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:dlm action:start
>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:dlm action:start
>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>>>>>>>>>>>> crmd[17733]:   notice: Result of start operation for dlm on
>>>>>>>>>>>>>> pipci001: 0 (ok)
>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating monitor operation
>>>>>>>>>>>>>> dlm_monitor_60000
>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>>> p-fssapmnt_start_0
>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>>> action:start
>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt
>>>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>>>>>>>>>>>> kernel: [ 4576.530233] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>> joining
>>>>>>>>>>>>>> the lockspace group.
>>>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>>>>>>>>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>>>>>>>>>> wait
>>>>>>>>>>>>>> for quorum
>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID 19052)
>>>>>>>>>>>>>> timed out
>>>>>>>>>>>>>> kernel: [ 4636.418223] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>> group
>>>>>>>>>>>>>> event done -512 0
>>>>>>>>>>>>>> kernel: [ 4636.418227] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>> group join
>>>>>>>>>>>>>> failed -512 0
>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out
>>>>>>>>>>>>>> after 60000ms
>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt action:start
>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on
>>>>>>>>>>>>>> (node 0)
>>>>>>>>>>>>>> crmd[17733]:    error: Result of start operation for
>>>>>>>>>>>>>> p-fssapmnt on
>>>>>>>>>>>>>> pipci001: Timed Out
>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>> crmd[17733]:   notice: Transition aborted by operation
>>>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0,
>>>>>>>>>>>>>> Fired=0,
>>>>>>>>>>>>>> Skipped=0, Incomplete=6,
>>>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2): Complete
>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>> required
>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 3, saving
>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>> required
>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>> pipci001 pengine[17732]:  warning: Forcing base-clone away
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> after 1000000 failures (max=2)
>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 4, saving
>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>> crmd[17733]:   notice: Processing graph 4
>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36)
>>>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation
>>>>>>>>>>>>>> p-fssapmnt_stop_0
>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt action:stop
>>>>>>>>>>>>>> call_id:72
>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for
>>>>>>>>>>>>>> /dev/mapper/sapmnt
>>>>>>>>>>>>>> on /sapmnt
>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms
>>>>>>>>>>>>>> queue-time:0ms
>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Result of stop operation for
>>>>>>>>>>>>>> p-fssapmnt
>>>>>>>>>>>>>> on pipci001: 0 (ok)
>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0
>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm
>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>> call_id:74
>>>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active
>>>>>>>>>>>>>> lockspaces
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> resource configuration:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> primitive p-fssapmnt Filesystem \
>>>>>>>>>>>>>>                 params device="/dev/mapper/sapmnt"
>>>>>>>>>>>>>> directory="/sapmnt"
>>>>>>>>>>>>>> fstype=ocfs2 \
>>>>>>>>>>>>>>                 op monitor interval=20 timeout=40 \
>>>>>>>>>>>>>>                 op start timeout=60 interval=0 \
>>>>>>>>>>>>>>                 op stop timeout=60 interval=0
>>>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \
>>>>>>>>>>>>>>                 op monitor interval=60 timeout=60 \
>>>>>>>>>>>>>>                 op start interval=0 timeout=90 \
>>>>>>>>>>>>>>                 op stop interval=0 timeout=100
>>>>>>>>>>>>>> clone base-clone base-group \
>>>>>>>>>>>>>>                 meta interleave=true target-role=Started
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster properties:
>>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>>                 have-watchdog=true \
>>>>>>>>>>>>>>                 stonith-enabled=true \
>>>>>>>>>>>>>>                 stonith-timeout=80 \
>>>>>>>>>>>>>>                 startup-fencing=true \
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Software versions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> kernel version: 4.4.114-94.11-default
>>>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64
>>>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64
>>>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64
>>>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>>>>>>>>> software.
>>>>>>>>>>>>>> https://www.avast.com/antivirus
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started:
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started:
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> ---
>>>>>> This email has been checked for viruses by Avast antivirus software.
>>>>>> https://www.avast.com/antivirus
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>




More information about the Users mailing list