[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Muhammad Sharfuddin M.Sharfuddin at nds.com.pk
Tue Mar 13 10:43:50 EDT 2018


Thanks a lot for the explanation. But other then the ocfs2 resource 
group, this cluster starts all other resources

on a single node, without any issue just because the use of 
"no-quorum-policy=ignore" option.

--
Regards,
Muhammad Sharfuddin

On 3/13/2018 7:32 PM, Klaus Wenninger wrote:
> On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote:
>> Yes, by saying pacemaker,  I meant to say corosync as well.
>>
>> Is there any fix ? or a two node cluster can't run ocfs2 resources
>> when one node is offline ?
> Actually there can't be a "fix" as 2 nodes are just not enough
> for a partial-cluster to be quorate in the classical sense
> (more votes than half of the cluster nodes).
>
> So to still be able to use it we have this 2-node config that
> permanently sets quorum. But not to run into issues on
> startup we need it to require both nodes seeing each
> other once.
>
> So this is definitely nothing that is specific to ocfs2.
> It just looks specific to ocfs2 because you've disabled
> quorum for pacemaker.
> To be honnest doing this you wouldn't need a resource-manager
> at all and could just start up your services using systemd.
>
> If you don't want a full 3rd node, and still want to handle cases
> where one node doesn't come up after a full shutdown of
> all nodes, you probably could go for a setup with qdevice.
>
> Regards,
> Klaus
>
>> -- 
>> Regards,
>> Muhammad Sharfuddin
>>
>> On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
>>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>>>> Hi,
>>>>
>>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>>>> online node(node1)
>>>>
>>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>>>> resource group successfully so many times in a row.
>>>>
>>>> 3 - while node2 was offline; I restart the pacemaker service on the
>>>> node1 and then tries to start the ocfs2 resource group, dlm started
>>>> but ocfs2 file system resource does not start.
>>>>
>>>> Nutshell:
>>>>
>>>> a - both nodes must be online to start the ocfs2 resource.
>>>>
>>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
>>>> on the other/surviving node.
>>>>
>>>> c - while one node was offline, we can stop/start the ocfs2 resource
>>>> group on the surviving node but if we stops the pacemaker service,
>>>> then ocfs2 file system resource does not start with the following info
>>>> in the logs:
>>> >From the logs I would say startup of dlm_controld times out because it
>>> is waiting
>>> for quorum - which doesn't happen because of wait-for-all.
>>> Question is if you really just stopped pacemaker or if you stopped
>>> corosync as well.
>>> In the latter case I would say it is the expected behavior.
>>>
>>> Regards,
>>> Klaus
>>>   
>>>> lrmd[4317]:   notice: executing - rsc:p-fssapmnt action:start
>>>> call_id:53
>>>> Filesystem(p-fssapmnt)[5139]: INFO: Running start for
>>>> /dev/mapper/sapmnt on /sapmnt
>>>> kernel: [  706.162676] dlm: Using TCP for communications
>>>> kernel: [  706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>>>> the lockspace group...
>>>> dlm_controld[5105]: 759 fence work wait for quorum
>>>> dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for
>>>> quorum
>>>> lrmd[4317]:  warning: p-fssapmnt_start_0 process (PID 5139) timed out
>>>> lrmd[4317]:  warning: p-fssapmnt_start_0:5139 - timed out after 60000ms
>>>> lrmd[4317]:   notice: finished - rsc:p-fssapmnt action:start
>>>> call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms
>>>> kernel: [  766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>>> event done -512 0
>>>> kernel: [  766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>>> join failed -512 0
>>>> crmd[4320]:   notice: Result of stop operation for p-fssapmnt on
>>>> pipci001: 0 (ok)
>>>> crmd[4320]:   notice: Initiating stop operation dlm_stop_0 locally on
>>>> pipci001
>>>> lrmd[4317]:   notice: executing - rsc:dlm action:stop call_id:56
>>>> dlm_controld[5105]: 766 shutdown ignored, active lockspaces
>>>> lrmd[4317]:  warning: dlm_stop_0 process (PID 5326) timed out
>>>> lrmd[4317]:  warning: dlm_stop_0:5326 - timed out after 100000ms
>>>> lrmd[4317]:   notice: finished - rsc:dlm action:stop call_id:56
>>>> pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms
>>>> crmd[4320]:    error: Result of stop operation for dlm on pipci001:
>>>> Timed Out
>>>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>>>> (target: 0 vs. rc: 1): Error
>>>> crmd[4320]:   notice: Transition aborted by operation dlm_stop_0
>>>> 'modify' on pipci001: Event failed
>>>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>>>> (target: 0 vs. rc: 1): Error
>>>> pengine[4319]:   notice: Watchdog will be used via SBD if fencing is
>>>> required
>>>> pengine[4319]:   notice: On loss of CCM Quorum: Ignore
>>>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>>>> pipci001: unknown error (1)
>>>> pengine[4319]:  warning: Cluster node pipci001 will be fenced: dlm:0
>>>> failed there
>>>> pengine[4319]:  warning: Processing failed op start for p-fssapmnt:0
>>>> on pipci001: unknown error (1)
>>>> pengine[4319]:   notice: Stop of failed resource dlm:0 is implicit
>>>> after pipci001 is fenced
>>>> pengine[4319]:   notice:  * Fence pipci001
>>>> pengine[4319]:   notice: Stop    sbd-stonith#011(pipci001)
>>>> pengine[4319]:   notice: Stop    dlm:0#011(pipci001)
>>>> crmd[4320]:   notice: Requesting fencing (reboot) of node pipci001
>>>> stonith-ng[4316]:   notice: Client crmd.4320.4c2f757b wants to fence
>>>> (reboot) 'pipci001' with device '(any)'
>>>> stonith-ng[4316]:   notice: Requesting peer fencing (reboot) of
>>>> pipci001
>>>> stonith-ng[4316]:   notice: sbd-stonith can fence (reboot) pipci001:
>>>> dynamic-list
>>>>
>>>>
>>>> -- 
>>>> Regards,
>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>>
>>>> On 3/13/2018 1:04 PM, Ulrich Windl wrote:
>>>>> Hi!
>>>>>
>>>>> I'd recommend this:
>>>>> Cleanly boot your nodes, avoiding any manual operation with cluster
>>>>> resources. Keep the logs.
>>>>> Then start your tests, keeping the logs for each.
>>>>> Try to fix issues by reading the logs and adjusting the cluster
>>>>> configuration, and not by starting commands that the cluster should
>>>>> start.
>>>>>
>>>>> We had an 2-node OCFS2 cluster running for quite some time with
>>>>> SLES11, but now the cluster is three nodes. To me the output of
>>>>> "crm_mon -1Arfj" combined with having set record-pending=true was
>>>>> very valuable finding problems.
>>>>>
>>>>> Regards,
>>>>> Ulrich
>>>>>
>>>>>
>>>>>>>> Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk> schrieb am
>>>>>>>> 13.03.2018 um 08:43 in
>>>>> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e623b3 at nds.com.pk>:
>>>>>> Dear Klaus,
>>>>>>
>>>>>> If I understand you properly then, its a fencing issue, and
>>>>>> whatever I
>>>>>> am facing is "natural" or "by-design" in a two node cluster where
>>>>>> quorum
>>>>>> is incomplete.
>>>>>>
>>>>>> I am quite convinced that you have pointed out right because, when I
>>>>>> start the dlm resource via cluster and then tries to start the ocfs2
>>>>>> file system manually from command line, mount command remains hanged
>>>>>> and
>>>>>> following events are reported in the logs:
>>>>>>
>>>>>>         kernel: [62622.864828] ocfs2: Registered cluster interface
>>>>>> user
>>>>>>         kernel: [62622.884427] dlm: Using TCP for communications
>>>>>>         kernel: [62622.884750] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>> joining the lockspace group...
>>>>>>         dlm_controld[17655]: 62627 fence work wait for quorum
>>>>>>         dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>> wait
>>>>>> for quorum
>>>>>>
>>>>>> and then following messages keep reported every 5-10 minutes, till I
>>>>>> kill the mount.ocfs2 process:
>>>>>>
>>>>>>         dlm_controld[17655]: 62627 fence work wait for quorum
>>>>>>         dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>> wait
>>>>>> for quorum
>>>>>>
>>>>>> I am also very much confused, because yesterday I did the same and
>>>>>> was
>>>>>> able to mount the ocfs2 file system manually from command line(at
>>>>>> least
>>>>>> once), and then unmount the file system manually stop the dlm
>>>>>> resource
>>>>>> from cluster and then complete ocfs2 resource stack(dlm, file
>>>>>> systems)
>>>>>> start/stop successfully via cluster even when only machine was
>>>>>> online.
>>>>>>
>>>>>> In a two-node cluster, which have ocfs2 resources, we can't run the
>>>>>> ocfs2 resources when quorum is incomplete(one node is offline) ?
>>>>>>
>>>>>> -- 
>>>>>> Regards,
>>>>>> Muhammad Sharfuddin
>>>>>>
>>>>>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote:
>>>>>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote:
>>>>>>>> Hi Klaus,
>>>>>>>>
>>>>>>>> primitive sbd-stonith stonith:external/sbd \
>>>>>>>>             op monitor interval=3000 timeout=20 \
>>>>>>>>             op start interval=0 timeout=240 \
>>>>>>>>             op stop interval=0 timeout=100 \
>>>>>>>>             params sbd_device="/dev/mapper/sbd" \
>>>>>>>>             meta target-role=Started
>>>>>>> Makes more sense now.
>>>>>>> Using pcmk_delay_max would probably be useful here
>>>>>>> to prevent a fence-race.
>>>>>>> That stonith-resource was not in your resource-list below ...
>>>>>>>
>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>             have-watchdog=true \
>>>>>>>>             stonith-enabled=true \
>>>>>>>>             no-quorum-policy=ignore \
>>>>>>>>             stonith-timeout=90 \
>>>>>>>>             startup-fencing=true
>>>>>>> You've set no-quorum-policy=ignore for pacemaker.
>>>>>>> Whether this is a good idea or not in your setup is
>>>>>>> written on another page.
>>>>>>> But isn't dlm directly interfering with corosync so
>>>>>>> that it would get the quorum state from there?
>>>>>>> As you have 2-node set probably on a 2-node-cluster
>>>>>>> this would - after both nodes down - wait for all
>>>>>>> nodes up first.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Klaus
>>>>>>>
>>>>>>>> # ps -eaf |grep sbd
>>>>>>>> root      6129     1  0 17:35 ?        00:00:00 sbd: inquisitor
>>>>>>>> root      6133  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>>> /dev/mapper/sbd - slot: 1 - uuid:
>>>>>>>> 6e80a337-95db-4608-bd62-d59517f39103
>>>>>>>> root      6134  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>>> Pacemaker
>>>>>>>> root      6135  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>>> Cluster
>>>>>>>>
>>>>>>>> This cluster does not start ocfs2 resources when I first
>>>>>>>> intentionally
>>>>>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource
>>>>>>>> while
>>>>>>>> one node is  offline.
>>>>>>>>
>>>>>>>> To fix the issue, I have one permanent solution, bring the other
>>>>>>>> node(offline) online and things get fixed automatically, i.e ocfs2
>>>>>>>> resources mounts.
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Regards,
>>>>>>>> Muhammad Sharfuddin
>>>>>>>>
>>>>>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote:
>>>>>>>>> Hi Muhammad!
>>>>>>>>>
>>>>>>>>> Could you be a little bit more elaborate on your fencing-setup!
>>>>>>>>> I read about you using SBD but I don't see any
>>>>>>>>> sbd-fencing-resource.
>>>>>>>>> For the case you wanted to use watchdog-fencing with SBD this
>>>>>>>>> would require stonith-watchdog-timeout property to be set.
>>>>>>>>> But watchdog-fencing relies on quorum (without 2-node trickery)
>>>>>>>>> and thus wouldn't work on a 2-node-cluster anyway.
>>>>>>>>>
>>>>>>>>> Didn't read through the whole thread - so I might be missing
>>>>>>>>> something ...
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Klaus
>>>>>>>>>
>>>>>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
>>>>>>>>>> Hello Gang,
>>>>>>>>>>
>>>>>>>>>> as informed, previously cluster was fixed to start the ocfs2
>>>>>>>>>> resources by
>>>>>>>>>>
>>>>>>>>>> a) crm resource start dlm
>>>>>>>>>>
>>>>>>>>>> b) mount/umount the ocfs2 file system manually. (this step was
>>>>>>>>>> the
>>>>>>>>>> fix)
>>>>>>>>>>
>>>>>>>>>> and then starting the clone group(which include dlm, ocfs2 file
>>>>>>>>>> systems) worked fine:
>>>>>>>>>>
>>>>>>>>>> c) crm resource start base-clone.
>>>>>>>>>>
>>>>>>>>>> Now I crash the nodes intentionally and then keep only one node
>>>>>>>>>> online, again cluster stopped starting the ocfs2 resources. I
>>>>>>>>>> again
>>>>>>>>>> tried to follow your instructions i.e
>>>>>>>>>>
>>>>>>>>>> i) crm resource start dlm
>>>>>>>>>>
>>>>>>>>>> then try to mount the ocfs2 file system manually which got
>>>>>>>>>> hanged this
>>>>>>>>>> time(previously manually mounting helped me):
>>>>>>>>>>
>>>>>>>>>> # cat /proc/3966/stack
>>>>>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm]
>>>>>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm]
>>>>>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm]
>>>>>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350
>>>>>>>>>> [ocfs2_stack_user]
>>>>>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240
>>>>>>>>>> [ocfs2_stackglue]
>>>>>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>>>>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>>>>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0
>>>>>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170
>>>>>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110
>>>>>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0
>>>>>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0
>>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>>>
>>>>>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the
>>>>>>>>>> dlm
>>>>>>>>>> process, and then try to start(crm resource start dlm) the
>>>>>>>>>> dlm(which
>>>>>>>>>> previously always get started successfully), this time dlm didn't
>>>>>>>>>> start and I checked the dlm_controld process
>>>>>>>>>>
>>>>>>>>>> cat /proc/3754/stack
>>>>>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60
>>>>>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0
>>>>>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0
>>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>>>
>>>>>>>>>> Nutshell:
>>>>>>>>>>
>>>>>>>>>> 1 - this cluster is configured to run when single node is online
>>>>>>>>>>
>>>>>>>>>> 2 - this cluster does not start the ocfs2 resources after a
>>>>>>>>>> crash when
>>>>>>>>>> only one node is online.
>>>>>>>>>>
>>>>>>>>>> -- 
>>>>>>>>>> Regards,
>>>>>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>>>>>>>>
>>>>>>>>>> On 3/12/2018 12:41 PM, Gang He wrote:
>>>>>>>>>>>> Hello Gang,
>>>>>>>>>>>>
>>>>>>>>>>>> to follow your instructions, I started the dlm resource via:
>>>>>>>>>>>>
>>>>>>>>>>>>            crm resource start dlm
>>>>>>>>>>>>
>>>>>>>>>>>> then mount/unmount the ocfs2 file system manually..(which
>>>>>>>>>>>> seems to be
>>>>>>>>>>>> the fix of the situation).
>>>>>>>>>>>>
>>>>>>>>>>>> Now resources are getting started properly on a single node..
>>>>>>>>>>>> I am
>>>>>>>>>>>> happy
>>>>>>>>>>>> as the issue is fixed, but at the same time I am lost because
>>>>>>>>>>>> I have
>>>>>>>>>>>> no idea
>>>>>>>>>>>>
>>>>>>>>>>>> how things get fixed here(merely by mounting/unmounting the
>>>>>>>>>>>> ocfs2
>>>>>>>>>>>> file
>>>>>>>>>>>> systems)
>>>>>>>>>>> >From your description.
>>>>>>>>>>> I just wonder  the DLM resource does not work normally under
>>>>>>>>>>> that
>>>>>>>>>>> situation.
>>>>>>>>>>> Yan/Bin, do you have any comments about two-node cluster? which
>>>>>>>>>>> configuration settings will affect corosync quorum/DLM ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Gang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -- 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote:
>>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount
>>>>>>>>>>>>> command
>>>>>>>>>>>>> timeout
>>>>>>>>>>>> (or hanged).
>>>>>>>>>>>>> The sample debugging method is,
>>>>>>>>>>>>> remove ocfs2 resource from crm first,
>>>>>>>>>>>>> then mount this file system manually, see if the mount command
>>>>>>>>>>>>> will be
>>>>>>>>>>>> timeout or hanged.
>>>>>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2
>>>>>>>>>>>>> process hanged
>>>>>>>>>>>> via "cat /proc/xxx/stack" command.
>>>>>>>>>>>>> If the back trace is stopped at DLM kernel module, usually
>>>>>>>>>>>>> the root
>>>>>>>>>>>>> cause is
>>>>>>>>>>>> cluster configuration problem.
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote:
>>>>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks
>>>>>>>>>>>>>>> like the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>> quorum is missed.
>>>>>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one
>>>>>>>>>>>>>>> node
>>>>>>>>>>>>>>> is offline,
>>>>>>>>>>>>>> the quorum will be missed by default.
>>>>>>>>>>>>>>> So, you should configure two-node related quorum setting
>>>>>>>>>>>>>>> according to the
>>>>>>>>>>>>>> pacemaker manual.
>>>>>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start up.
>>>>>>>>>>>>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>>                  have-watchdog=true \
>>>>>>>>>>>>>>                  stonith-enabled=true \
>>>>>>>>>>>>>>                  stonith-timeout=80 \
>>>>>>>>>>>>>>                  startup-fencing=true \
>>>>>>>>>>>>>>                  no-quorum-policy=ignore
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This two node cluster starts resources when both nodes are
>>>>>>>>>>>>>>>> online but
>>>>>>>>>>>>>>>> does not start the ocfs2 resources
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the
>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>> resources
>>>>>>>>>>>>>>>> then stop the pacemaker service on
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the
>>>>>>>>>>>>>>>> online
>>>>>>>>>>>>>>>> node, it
>>>>>>>>>>>>>>>> fails.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> logs:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Start
>>>>>>>>>>>>>>>> dlm:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Start
>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Start
>>>>>>>>>>>>>>>> p-fsusrsap:0#011(pipci001)
>>>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Calculated transition 2,
>>>>>>>>>>>>>>>> saving
>>>>>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>>>>> dlm_start_0
>>>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:dlm action:start
>>>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:dlm action:start
>>>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Result of start operation for dlm on
>>>>>>>>>>>>>>>> pipci001: 0 (ok)
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating monitor operation
>>>>>>>>>>>>>>>> dlm_monitor_60000
>>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>>>>> p-fssapmnt_start_0
>>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>>>>> action:start
>>>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>>>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt
>>>>>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>>>>>>>>>>>>>> kernel: [ 4576.530233] dlm:
>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>>> joining
>>>>>>>>>>>>>>>> the lockspace group.
>>>>>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>>>>>>>>>>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>>>>>>>>>>>> wait
>>>>>>>>>>>>>>>> for quorum
>>>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID
>>>>>>>>>>>>>>>> 19052)
>>>>>>>>>>>>>>>> timed out
>>>>>>>>>>>>>>>> kernel: [ 4636.418223] dlm:
>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>> event done -512 0
>>>>>>>>>>>>>>>> kernel: [ 4636.418227] dlm:
>>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>>> group join
>>>>>>>>>>>>>>>> failed -512 0
>>>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out
>>>>>>>>>>>>>>>> after 60000ms
>>>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>>>>> action:start
>>>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>>>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on
>>>>>>>>>>>>>>>> (node 0)
>>>>>>>>>>>>>>>> crmd[17733]:    error: Result of start operation for
>>>>>>>>>>>>>>>> p-fssapmnt on
>>>>>>>>>>>>>>>> pipci001: Timed Out
>>>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Transition aborted by operation
>>>>>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0,
>>>>>>>>>>>>>>>> Fired=0,
>>>>>>>>>>>>>>>> Skipped=0, Incomplete=6,
>>>>>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2):
>>>>>>>>>>>>>>>> Complete
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop
>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 3, saving
>>>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>>> pipci001 pengine[17732]:  warning: Forcing base-clone away
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> after 1000000 failures (max=2)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop
>>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 4, saving
>>>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Processing graph 4
>>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36)
>>>>>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation
>>>>>>>>>>>>>>>> p-fssapmnt_stop_0
>>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>>> call_id:72
>>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for
>>>>>>>>>>>>>>>> /dev/mapper/sapmnt
>>>>>>>>>>>>>>>> on /sapmnt
>>>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms
>>>>>>>>>>>>>>>> queue-time:0ms
>>>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Result of stop operation
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> p-fssapmnt
>>>>>>>>>>>>>>>> on pipci001: 0 (ok)
>>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0
>>>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm
>>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>>> call_id:74
>>>>>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active
>>>>>>>>>>>>>>>> lockspaces
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> resource configuration:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> primitive p-fssapmnt Filesystem \
>>>>>>>>>>>>>>>>                  params device="/dev/mapper/sapmnt"
>>>>>>>>>>>>>>>> directory="/sapmnt"
>>>>>>>>>>>>>>>> fstype=ocfs2 \
>>>>>>>>>>>>>>>>                  op monitor interval=20 timeout=40 \
>>>>>>>>>>>>>>>>                  op start timeout=60 interval=0 \
>>>>>>>>>>>>>>>>                  op stop timeout=60 interval=0
>>>>>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \
>>>>>>>>>>>>>>>>                  op monitor interval=60 timeout=60 \
>>>>>>>>>>>>>>>>                  op start interval=0 timeout=90 \
>>>>>>>>>>>>>>>>                  op stop interval=0 timeout=100
>>>>>>>>>>>>>>>> clone base-clone base-group \
>>>>>>>>>>>>>>>>                  meta interleave=true target-role=Started
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> cluster properties:
>>>>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>>>>                  have-watchdog=true \
>>>>>>>>>>>>>>>>                  stonith-enabled=true \
>>>>>>>>>>>>>>>>                  stonith-timeout=80 \
>>>>>>>>>>>>>>>>                  startup-fencing=true \
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Software versions:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> kernel version: 4.4.114-94.11-default
>>>>>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64
>>>>>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64
>>>>>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>>>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64
>>>>>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>>>>>>>>>>> software.
>>>>>>>>>>>>>>>> https://www.avast.com/antivirus
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started:
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started:
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> ---
>>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>>> software.
>>>>>>>> https://www.avast.com/antivirus
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>   
>
>


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus




More information about the Users mailing list