[ClusterLabs] Antw: Re: single node fails to start the ocfs2 resource

Klaus Wenninger kwenning at redhat.com
Tue Mar 13 10:32:41 EDT 2018


On 03/13/2018 02:30 PM, Muhammad Sharfuddin wrote:
> Yes, by saying pacemaker,  I meant to say corosync as well.
>
> Is there any fix ? or a two node cluster can't run ocfs2 resources
> when one node is offline ?

Actually there can't be a "fix" as 2 nodes are just not enough
for a partial-cluster to be quorate in the classical sense
(more votes than half of the cluster nodes).

So to still be able to use it we have this 2-node config that
permanently sets quorum. But not to run into issues on
startup we need it to require both nodes seeing each
other once.

So this is definitely nothing that is specific to ocfs2.
It just looks specific to ocfs2 because you've disabled
quorum for pacemaker.
To be honnest doing this you wouldn't need a resource-manager
at all and could just start up your services using systemd.

If you don't want a full 3rd node, and still want to handle cases
where one node doesn't come up after a full shutdown of
all nodes, you probably could go for a setup with qdevice.

Regards,
Klaus

>
> -- 
> Regards,
> Muhammad Sharfuddin
>
> On 3/13/2018 6:16 PM, Klaus Wenninger wrote:
>> On 03/13/2018 02:03 PM, Muhammad Sharfuddin wrote:
>>> Hi,
>>>
>>> 1 - if I put a node(node2) offline; ocfs2 resources keep running on
>>> online node(node1)
>>>
>>> 2 - while node2 was offline, via cluster I stop/start the ocfs2
>>> resource group successfully so many times in a row.
>>>
>>> 3 - while node2 was offline; I restart the pacemaker service on the
>>> node1 and then tries to start the ocfs2 resource group, dlm started
>>> but ocfs2 file system resource does not start.
>>>
>>> Nutshell:
>>>
>>> a - both nodes must be online to start the ocfs2 resource.
>>>
>>> b - if one crashes or offline(gracefully) ocfs2 resource keeps running
>>> on the other/surviving node.
>>>
>>> c - while one node was offline, we can stop/start the ocfs2 resource
>>> group on the surviving node but if we stops the pacemaker service,
>>> then ocfs2 file system resource does not start with the following info
>>> in the logs:
>> >From the logs I would say startup of dlm_controld times out because it
>> is waiting
>> for quorum - which doesn't happen because of wait-for-all.
>> Question is if you really just stopped pacemaker or if you stopped
>> corosync as well.
>> In the latter case I would say it is the expected behavior.
>>
>> Regards,
>> Klaus
>>  
>>> lrmd[4317]:   notice: executing - rsc:p-fssapmnt action:start
>>> call_id:53
>>> Filesystem(p-fssapmnt)[5139]: INFO: Running start for
>>> /dev/mapper/sapmnt on /sapmnt
>>> kernel: [  706.162676] dlm: Using TCP for communications
>>> kernel: [  706.162916] dlm: BFA9FF042AA045F4822C2A6A06020EE9: joining
>>> the lockspace group...
>>> dlm_controld[5105]: 759 fence work wait for quorum
>>> dlm_controld[5105]: 764 BFA9FF042AA045F4822C2A6A06020EE9 wait for
>>> quorum
>>> lrmd[4317]:  warning: p-fssapmnt_start_0 process (PID 5139) timed out
>>> lrmd[4317]:  warning: p-fssapmnt_start_0:5139 - timed out after 60000ms
>>> lrmd[4317]:   notice: finished - rsc:p-fssapmnt action:start
>>> call_id:53 pid:5139 exit-code:1 exec-time:60002ms queue-time:0ms
>>> kernel: [  766.056514] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>> event done -512 0
>>> kernel: [  766.056528] dlm: BFA9FF042AA045F4822C2A6A06020EE9: group
>>> join failed -512 0
>>> crmd[4320]:   notice: Result of stop operation for p-fssapmnt on
>>> pipci001: 0 (ok)
>>> crmd[4320]:   notice: Initiating stop operation dlm_stop_0 locally on
>>> pipci001
>>> lrmd[4317]:   notice: executing - rsc:dlm action:stop call_id:56
>>> dlm_controld[5105]: 766 shutdown ignored, active lockspaces
>>> lrmd[4317]:  warning: dlm_stop_0 process (PID 5326) timed out
>>> lrmd[4317]:  warning: dlm_stop_0:5326 - timed out after 100000ms
>>> lrmd[4317]:   notice: finished - rsc:dlm action:stop call_id:56
>>> pid:5326 exit-code:1 exec-time:100003ms queue-time:0ms
>>> crmd[4320]:    error: Result of stop operation for dlm on pipci001:
>>> Timed Out
>>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>>> (target: 0 vs. rc: 1): Error
>>> crmd[4320]:   notice: Transition aborted by operation dlm_stop_0
>>> 'modify' on pipci001: Event failed
>>> crmd[4320]:  warning: Action 15 (dlm_stop_0) on pipci001 failed
>>> (target: 0 vs. rc: 1): Error
>>> pengine[4319]:   notice: Watchdog will be used via SBD if fencing is
>>> required
>>> pengine[4319]:   notice: On loss of CCM Quorum: Ignore
>>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>>> pipci001: unknown error (1)
>>> pengine[4319]:  warning: Processing failed op stop for dlm:0 on
>>> pipci001: unknown error (1)
>>> pengine[4319]:  warning: Cluster node pipci001 will be fenced: dlm:0
>>> failed there
>>> pengine[4319]:  warning: Processing failed op start for p-fssapmnt:0
>>> on pipci001: unknown error (1)
>>> pengine[4319]:   notice: Stop of failed resource dlm:0 is implicit
>>> after pipci001 is fenced
>>> pengine[4319]:   notice:  * Fence pipci001
>>> pengine[4319]:   notice: Stop    sbd-stonith#011(pipci001)
>>> pengine[4319]:   notice: Stop    dlm:0#011(pipci001)
>>> crmd[4320]:   notice: Requesting fencing (reboot) of node pipci001
>>> stonith-ng[4316]:   notice: Client crmd.4320.4c2f757b wants to fence
>>> (reboot) 'pipci001' with device '(any)'
>>> stonith-ng[4316]:   notice: Requesting peer fencing (reboot) of
>>> pipci001
>>> stonith-ng[4316]:   notice: sbd-stonith can fence (reboot) pipci001:
>>> dynamic-list
>>>
>>>
>>> -- 
>>> Regards,
>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>
>>> On 3/13/2018 1:04 PM, Ulrich Windl wrote:
>>>> Hi!
>>>>
>>>> I'd recommend this:
>>>> Cleanly boot your nodes, avoiding any manual operation with cluster
>>>> resources. Keep the logs.
>>>> Then start your tests, keeping the logs for each.
>>>> Try to fix issues by reading the logs and adjusting the cluster
>>>> configuration, and not by starting commands that the cluster should
>>>> start.
>>>>
>>>> We had an 2-node OCFS2 cluster running for quite some time with
>>>> SLES11, but now the cluster is three nodes. To me the output of
>>>> "crm_mon -1Arfj" combined with having set record-pending=true was
>>>> very valuable finding problems.
>>>>
>>>> Regards,
>>>> Ulrich
>>>>
>>>>
>>>>>>> Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk> schrieb am
>>>>>>> 13.03.2018 um 08:43 in
>>>> Nachricht <7b773ae9-4209-d246-b5c0-2c8b67e623b3 at nds.com.pk>:
>>>>> Dear Klaus,
>>>>>
>>>>> If I understand you properly then, its a fencing issue, and
>>>>> whatever I
>>>>> am facing is "natural" or "by-design" in a two node cluster where
>>>>> quorum
>>>>> is incomplete.
>>>>>
>>>>> I am quite convinced that you have pointed out right because, when I
>>>>> start the dlm resource via cluster and then tries to start the ocfs2
>>>>> file system manually from command line, mount command remains hanged
>>>>> and
>>>>> following events are reported in the logs:
>>>>>
>>>>>        kernel: [62622.864828] ocfs2: Registered cluster interface
>>>>> user
>>>>>        kernel: [62622.884427] dlm: Using TCP for communications
>>>>>        kernel: [62622.884750] dlm: BFA9FF042AA045F4822C2A6A06020EE9:
>>>>> joining the lockspace group...
>>>>>        dlm_controld[17655]: 62627 fence work wait for quorum
>>>>>        dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9
>>>>> wait
>>>>> for quorum
>>>>>
>>>>> and then following messages keep reported every 5-10 minutes, till I
>>>>> kill the mount.ocfs2 process:
>>>>>
>>>>>        dlm_controld[17655]: 62627 fence work wait for quorum
>>>>>        dlm_controld[17655]: 62680 BFA9FF042AA045F4822C2A6A06020EE9
>>>>> wait
>>>>> for quorum
>>>>>
>>>>> I am also very much confused, because yesterday I did the same and
>>>>> was
>>>>> able to mount the ocfs2 file system manually from command line(at
>>>>> least
>>>>> once), and then unmount the file system manually stop the dlm
>>>>> resource
>>>>> from cluster and then complete ocfs2 resource stack(dlm, file
>>>>> systems)
>>>>> start/stop successfully via cluster even when only machine was
>>>>> online.
>>>>>
>>>>> In a two-node cluster, which have ocfs2 resources, we can't run the
>>>>> ocfs2 resources when quorum is incomplete(one node is offline) ?
>>>>>
>>>>> -- 
>>>>> Regards,
>>>>> Muhammad Sharfuddin
>>>>>
>>>>> On 3/12/2018 5:58 PM, Klaus Wenninger wrote:
>>>>>> On 03/12/2018 01:44 PM, Muhammad Sharfuddin wrote:
>>>>>>> Hi Klaus,
>>>>>>>
>>>>>>> primitive sbd-stonith stonith:external/sbd \
>>>>>>>            op monitor interval=3000 timeout=20 \
>>>>>>>            op start interval=0 timeout=240 \
>>>>>>>            op stop interval=0 timeout=100 \
>>>>>>>            params sbd_device="/dev/mapper/sbd" \
>>>>>>>            meta target-role=Started
>>>>>> Makes more sense now.
>>>>>> Using pcmk_delay_max would probably be useful here
>>>>>> to prevent a fence-race.
>>>>>> That stonith-resource was not in your resource-list below ...
>>>>>>
>>>>>>> property cib-bootstrap-options: \
>>>>>>>            have-watchdog=true \
>>>>>>>            stonith-enabled=true \
>>>>>>>            no-quorum-policy=ignore \
>>>>>>>            stonith-timeout=90 \
>>>>>>>            startup-fencing=true
>>>>>> You've set no-quorum-policy=ignore for pacemaker.
>>>>>> Whether this is a good idea or not in your setup is
>>>>>> written on another page.
>>>>>> But isn't dlm directly interfering with corosync so
>>>>>> that it would get the quorum state from there?
>>>>>> As you have 2-node set probably on a 2-node-cluster
>>>>>> this would - after both nodes down - wait for all
>>>>>> nodes up first.
>>>>>>
>>>>>> Regards,
>>>>>> Klaus
>>>>>>
>>>>>>> # ps -eaf |grep sbd
>>>>>>> root      6129     1  0 17:35 ?        00:00:00 sbd: inquisitor
>>>>>>> root      6133  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>> /dev/mapper/sbd - slot: 1 - uuid:
>>>>>>> 6e80a337-95db-4608-bd62-d59517f39103
>>>>>>> root      6134  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>> Pacemaker
>>>>>>> root      6135  6129  0 17:35 ?        00:00:00 sbd: watcher:
>>>>>>> Cluster
>>>>>>>
>>>>>>> This cluster does not start ocfs2 resources when I first
>>>>>>> intentionally
>>>>>>> crashed(reboot) both the nodes, then try to start ocfs2 resource
>>>>>>> while
>>>>>>> one node is  offline.
>>>>>>>
>>>>>>> To fix the issue, I have one permanent solution, bring the other
>>>>>>> node(offline) online and things get fixed automatically, i.e ocfs2
>>>>>>> resources mounts.
>>>>>>>
>>>>>>> -- 
>>>>>>> Regards,
>>>>>>> Muhammad Sharfuddin
>>>>>>>
>>>>>>> On 3/12/2018 5:25 PM, Klaus Wenninger wrote:
>>>>>>>> Hi Muhammad!
>>>>>>>>
>>>>>>>> Could you be a little bit more elaborate on your fencing-setup!
>>>>>>>> I read about you using SBD but I don't see any
>>>>>>>> sbd-fencing-resource.
>>>>>>>> For the case you wanted to use watchdog-fencing with SBD this
>>>>>>>> would require stonith-watchdog-timeout property to be set.
>>>>>>>> But watchdog-fencing relies on quorum (without 2-node trickery)
>>>>>>>> and thus wouldn't work on a 2-node-cluster anyway.
>>>>>>>>
>>>>>>>> Didn't read through the whole thread - so I might be missing
>>>>>>>> something ...
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Klaus
>>>>>>>>
>>>>>>>> On 03/12/2018 12:51 PM, Muhammad Sharfuddin wrote:
>>>>>>>>> Hello Gang,
>>>>>>>>>
>>>>>>>>> as informed, previously cluster was fixed to start the ocfs2
>>>>>>>>> resources by
>>>>>>>>>
>>>>>>>>> a) crm resource start dlm
>>>>>>>>>
>>>>>>>>> b) mount/umount the ocfs2 file system manually. (this step was
>>>>>>>>> the
>>>>>>>>> fix)
>>>>>>>>>
>>>>>>>>> and then starting the clone group(which include dlm, ocfs2 file
>>>>>>>>> systems) worked fine:
>>>>>>>>>
>>>>>>>>> c) crm resource start base-clone.
>>>>>>>>>
>>>>>>>>> Now I crash the nodes intentionally and then keep only one node
>>>>>>>>> online, again cluster stopped starting the ocfs2 resources. I
>>>>>>>>> again
>>>>>>>>> tried to follow your instructions i.e
>>>>>>>>>
>>>>>>>>> i) crm resource start dlm
>>>>>>>>>
>>>>>>>>> then try to mount the ocfs2 file system manually which got
>>>>>>>>> hanged this
>>>>>>>>> time(previously manually mounting helped me):
>>>>>>>>>
>>>>>>>>> # cat /proc/3966/stack
>>>>>>>>> [<ffffffffa039f18e>] do_uevent+0x7e/0x200 [dlm]
>>>>>>>>> [<ffffffffa039fe0a>] new_lockspace+0x80a/0xa70 [dlm]
>>>>>>>>> [<ffffffffa03a02d9>] dlm_new_lockspace+0x69/0x160 [dlm]
>>>>>>>>> [<ffffffffa038e758>] user_cluster_connect+0xc8/0x350
>>>>>>>>> [ocfs2_stack_user]
>>>>>>>>> [<ffffffffa03c2872>] ocfs2_cluster_connect+0x192/0x240
>>>>>>>>> [ocfs2_stackglue]
>>>>>>>>> [<ffffffffa045eefc>] ocfs2_dlm_init+0x31c/0x570 [ocfs2]
>>>>>>>>> [<ffffffffa04a9983>] ocfs2_fill_super+0xb33/0x1200 [ocfs2]
>>>>>>>>> [<ffffffff8120e130>] mount_bdev+0x1a0/0x1e0
>>>>>>>>> [<ffffffff8120ea1a>] mount_fs+0x3a/0x170
>>>>>>>>> [<ffffffff81228bf2>] vfs_kern_mount+0x62/0x110
>>>>>>>>> [<ffffffff8122b123>] do_mount+0x213/0xcd0
>>>>>>>>> [<ffffffff8122bed5>] SyS_mount+0x85/0xd0
>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>>
>>>>>>>>> I killed the mount.ocfs2 process stop(crm resource stop dlm) the
>>>>>>>>> dlm
>>>>>>>>> process, and then try to start(crm resource start dlm) the
>>>>>>>>> dlm(which
>>>>>>>>> previously always get started successfully), this time dlm didn't
>>>>>>>>> start and I checked the dlm_controld process
>>>>>>>>>
>>>>>>>>> cat /proc/3754/stack
>>>>>>>>> [<ffffffff8121dc55>] poll_schedule_timeout+0x45/0x60
>>>>>>>>> [<ffffffff8121f0bc>] do_sys_poll+0x38c/0x4f0
>>>>>>>>> [<ffffffff8121f2dd>] SyS_poll+0x5d/0xe0
>>>>>>>>> [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6
>>>>>>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>>>>>>
>>>>>>>>> Nutshell:
>>>>>>>>>
>>>>>>>>> 1 - this cluster is configured to run when single node is online
>>>>>>>>>
>>>>>>>>> 2 - this cluster does not start the ocfs2 resources after a
>>>>>>>>> crash when
>>>>>>>>> only one node is online.
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Regards,
>>>>>>>>> Muhammad Sharfuddin | +923332144823 | nds.com.pk
>>>>>>>>>
>>>>>>>>> On 3/12/2018 12:41 PM, Gang He wrote:
>>>>>>>>>>> Hello Gang,
>>>>>>>>>>>
>>>>>>>>>>> to follow your instructions, I started the dlm resource via:
>>>>>>>>>>>
>>>>>>>>>>>           crm resource start dlm
>>>>>>>>>>>
>>>>>>>>>>> then mount/unmount the ocfs2 file system manually..(which
>>>>>>>>>>> seems to be
>>>>>>>>>>> the fix of the situation).
>>>>>>>>>>>
>>>>>>>>>>> Now resources are getting started properly on a single node..
>>>>>>>>>>> I am
>>>>>>>>>>> happy
>>>>>>>>>>> as the issue is fixed, but at the same time I am lost because
>>>>>>>>>>> I have
>>>>>>>>>>> no idea
>>>>>>>>>>>
>>>>>>>>>>> how things get fixed here(merely by mounting/unmounting the
>>>>>>>>>>> ocfs2
>>>>>>>>>>> file
>>>>>>>>>>> systems)
>>>>>>>>>> >From your description.
>>>>>>>>>> I just wonder  the DLM resource does not work normally under
>>>>>>>>>> that
>>>>>>>>>> situation.
>>>>>>>>>> Yan/Bin, do you have any comments about two-node cluster? which
>>>>>>>>>> configuration settings will affect corosync quorum/DLM ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Gang
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>
>>>>>>>>>>> On 3/12/2018 10:59 AM, Gang He wrote:
>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>
>>>>>>>>>>>> Usually, ocfs2 resource startup failure is caused by mount
>>>>>>>>>>>> command
>>>>>>>>>>>> timeout
>>>>>>>>>>> (or hanged).
>>>>>>>>>>>> The sample debugging method is,
>>>>>>>>>>>> remove ocfs2 resource from crm first,
>>>>>>>>>>>> then mount this file system manually, see if the mount command
>>>>>>>>>>>> will be
>>>>>>>>>>> timeout or hanged.
>>>>>>>>>>>> If this command is hanged, please watch where is mount.ocfs2
>>>>>>>>>>>> process hanged
>>>>>>>>>>> via "cat /proc/xxx/stack" command.
>>>>>>>>>>>> If the back trace is stopped at DLM kernel module, usually
>>>>>>>>>>>> the root
>>>>>>>>>>>> cause is
>>>>>>>>>>> cluster configuration problem.
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Gang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/12/2018 7:32 AM, Gang He wrote:
>>>>>>>>>>>>>> Hello Muhammad,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this problem is not in ocfs2, the cause looks
>>>>>>>>>>>>>> like the
>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>> quorum is missed.
>>>>>>>>>>>>>> For two-node cluster (does not three-node cluster), if one
>>>>>>>>>>>>>> node
>>>>>>>>>>>>>> is offline,
>>>>>>>>>>>>> the quorum will be missed by default.
>>>>>>>>>>>>>> So, you should configure two-node related quorum setting
>>>>>>>>>>>>>> according to the
>>>>>>>>>>>>> pacemaker manual.
>>>>>>>>>>>>>> Then, DLM can work normal, and ocfs2 resource can start up.
>>>>>>>>>>>>> Yes its configured accordingly, no-quorum is set to "ignore".
>>>>>>>>>>>>>
>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>                 have-watchdog=true \
>>>>>>>>>>>>>                 stonith-enabled=true \
>>>>>>>>>>>>>                 stonith-timeout=80 \
>>>>>>>>>>>>>                 startup-fencing=true \
>>>>>>>>>>>>>                 no-quorum-policy=ignore
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This two node cluster starts resources when both nodes are
>>>>>>>>>>>>>>> online but
>>>>>>>>>>>>>>> does not start the ocfs2 resources
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> when one node is offline. e.g if I gracefully stop the
>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>> resources
>>>>>>>>>>>>>>> then stop the pacemaker service on
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> either node, and try to start the ocfs2 resource on the
>>>>>>>>>>>>>>> online
>>>>>>>>>>>>>>> node, it
>>>>>>>>>>>>>>> fails.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> logs:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Start
>>>>>>>>>>>>>>> dlm:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Start  
>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Start  
>>>>>>>>>>>>>>> p-fsusrsap:0#011(pipci001)
>>>>>>>>>>>>>>> pipci001 pengine[17732]:   notice: Calculated transition 2,
>>>>>>>>>>>>>>> saving
>>>>>>>>>>>>>>> inputs in /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Processing graph 2
>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613202-31) derived from
>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-339.bz2
>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>>>> dlm_start_0
>>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:dlm action:start
>>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>>> dlm_controld[19019]: 4575 dlm_controld 4.0.7 started
>>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:dlm action:start
>>>>>>>>>>>>>>> call_id:69
>>>>>>>>>>>>>>> pid:18999 exit-code:0 exec-time:1082ms queue-time:1ms
>>>>>>>>>>>>>>> crmd[17733]:   notice: Result of start operation for dlm on
>>>>>>>>>>>>>>> pipci001: 0 (ok)
>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating monitor operation
>>>>>>>>>>>>>>> dlm_monitor_60000
>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating start operation
>>>>>>>>>>>>>>> p-fssapmnt_start_0
>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>>>> action:start
>>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19052]: INFO: Running start for
>>>>>>>>>>>>>>> /dev/mapper/sapmnt on /sapmnt
>>>>>>>>>>>>>>> kernel: [ 4576.529938] dlm: Using TCP for communications
>>>>>>>>>>>>>>> kernel: [ 4576.530233] dlm:
>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>> joining
>>>>>>>>>>>>>>> the lockspace group.
>>>>>>>>>>>>>>> dlm_controld[19019]: 4629 fence work wait for quorum
>>>>>>>>>>>>>>> dlm_controld[19019]: 4634 BFA9FF042AA045F4822C2A6A06020EE9
>>>>>>>>>>>>>>> wait
>>>>>>>>>>>>>>> for quorum
>>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0 process (PID
>>>>>>>>>>>>>>> 19052)
>>>>>>>>>>>>>>> timed out
>>>>>>>>>>>>>>> kernel: [ 4636.418223] dlm:
>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>> event done -512 0
>>>>>>>>>>>>>>> kernel: [ 4636.418227] dlm:
>>>>>>>>>>>>>>> BFA9FF042AA045F4822C2A6A06020EE9:
>>>>>>>>>>>>>>> group join
>>>>>>>>>>>>>>> failed -512 0
>>>>>>>>>>>>>>> lrmd[17730]:  warning: p-fssapmnt_start_0:19052 - timed out
>>>>>>>>>>>>>>> after 60000ms
>>>>>>>>>>>>>>> lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>>>> action:start
>>>>>>>>>>>>>>> call_id:71
>>>>>>>>>>>>>>> pid:19052 exit-code:1 exec-time:60002ms queue-time:0ms
>>>>>>>>>>>>>>> kernel: [ 4636.420628] ocfs2: Unmounting device (254,1) on
>>>>>>>>>>>>>>> (node 0)
>>>>>>>>>>>>>>> crmd[17733]:    error: Result of start operation for
>>>>>>>>>>>>>>> p-fssapmnt on
>>>>>>>>>>>>>>> pipci001: Timed Out
>>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>>> crmd[17733]:   notice: Transition aborted by operation
>>>>>>>>>>>>>>> p-fssapmnt_start_0 'modify' on pipci001: Event failed
>>>>>>>>>>>>>>> crmd[17733]:  warning: Action 11 (p-fssapmnt_start_0) on
>>>>>>>>>>>>>>> pipci001 failed
>>>>>>>>>>>>>>> (target: 0 vs. rc: 1): Error
>>>>>>>>>>>>>>> crmd[17733]:   notice: Transition 2 (Complete=5, Pending=0,
>>>>>>>>>>>>>>> Fired=0,
>>>>>>>>>>>>>>> Skipped=0, Incomplete=6,
>>>>>>>>>>>>>>> Source=/var/lib/pacemaker/pengine/pe-input-339.bz2):
>>>>>>>>>>>>>>> Complete
>>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop   
>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 3, saving
>>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-340.bz2
>>>>>>>>>>>>>>> pengine[17732]:   notice: Watchdog will be used via SBD if
>>>>>>>>>>>>>>> fencing is
>>>>>>>>>>>>>>> required
>>>>>>>>>>>>>>> pengine[17732]:   notice: On loss of CCM Quorum: Ignore
>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>> pengine[17732]:  warning: Processing failed op start for
>>>>>>>>>>>>>>> p-fssapmnt:0 on
>>>>>>>>>>>>>>> pipci001: unknown error (1)
>>>>>>>>>>>>>>> pengine[17732]:  warning: Forcing base-clone away from
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> 1000000 failures (max=2)
>>>>>>>>>>>>>>> pipci001 pengine[17732]:  warning: Forcing base-clone away
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> after 1000000 failures (max=2)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop    dlm:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Stop   
>>>>>>>>>>>>>>> p-fssapmnt:0#011(pipci001)
>>>>>>>>>>>>>>> pengine[17732]:   notice: Calculated transition 4, saving
>>>>>>>>>>>>>>> inputs in
>>>>>>>>>>>>>>> /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>>> crmd[17733]:   notice: Processing graph 4
>>>>>>>>>>>>>>> (ref=pe_calc-dc-1520613263-36)
>>>>>>>>>>>>>>> derived from /var/lib/pacemaker/pengine/pe-input-341.bz2
>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation
>>>>>>>>>>>>>>> p-fssapmnt_stop_0
>>>>>>>>>>>>>>> locally on pipci001
>>>>>>>>>>>>>>> lrmd[17730]:   notice: executing - rsc:p-fssapmnt
>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>> call_id:72
>>>>>>>>>>>>>>> Filesystem(p-fssapmnt)[19189]: INFO: Running stop for
>>>>>>>>>>>>>>> /dev/mapper/sapmnt
>>>>>>>>>>>>>>> on /sapmnt
>>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: finished - rsc:p-fssapmnt
>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>> call_id:72 pid:19189 exit-code:0 exec-time:83ms
>>>>>>>>>>>>>>> queue-time:0ms
>>>>>>>>>>>>>>> pipci001 crmd[17733]:   notice: Result of stop operation
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> p-fssapmnt
>>>>>>>>>>>>>>> on pipci001: 0 (ok)
>>>>>>>>>>>>>>> crmd[17733]:   notice: Initiating stop operation dlm_stop_0
>>>>>>>>>>>>>>> locally on
>>>>>>>>>>>>>>> pipci001
>>>>>>>>>>>>>>> pipci001 lrmd[17730]:   notice: executing - rsc:dlm
>>>>>>>>>>>>>>> action:stop
>>>>>>>>>>>>>>> call_id:74
>>>>>>>>>>>>>>> pipci001 dlm_controld[19019]: 4636 shutdown ignored, active
>>>>>>>>>>>>>>> lockspaces
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> resource configuration:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> primitive p-fssapmnt Filesystem \
>>>>>>>>>>>>>>>                 params device="/dev/mapper/sapmnt"
>>>>>>>>>>>>>>> directory="/sapmnt"
>>>>>>>>>>>>>>> fstype=ocfs2 \
>>>>>>>>>>>>>>>                 op monitor interval=20 timeout=40 \
>>>>>>>>>>>>>>>                 op start timeout=60 interval=0 \
>>>>>>>>>>>>>>>                 op stop timeout=60 interval=0
>>>>>>>>>>>>>>> primitive dlm ocf:pacemaker:controld \
>>>>>>>>>>>>>>>                 op monitor interval=60 timeout=60 \
>>>>>>>>>>>>>>>                 op start interval=0 timeout=90 \
>>>>>>>>>>>>>>>                 op stop interval=0 timeout=100
>>>>>>>>>>>>>>> clone base-clone base-group \
>>>>>>>>>>>>>>>                 meta interleave=true target-role=Started
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> cluster properties:
>>>>>>>>>>>>>>> property cib-bootstrap-options: \
>>>>>>>>>>>>>>>                 have-watchdog=true \
>>>>>>>>>>>>>>>                 stonith-enabled=true \
>>>>>>>>>>>>>>>                 stonith-timeout=80 \
>>>>>>>>>>>>>>>                 startup-fencing=true \
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Software versions:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> kernel version: 4.4.114-94.11-default
>>>>>>>>>>>>>>> pacemaker-1.1.16-4.8.x86_64
>>>>>>>>>>>>>>> corosync-2.3.6-9.5.1.x86_64
>>>>>>>>>>>>>>> ocfs2-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>>> ocfs2-tools-1.8.5-1.35.x86_64
>>>>>>>>>>>>>>> dlm-kmp-default-4.4.114-94.11.3.x86_64
>>>>>>>>>>>>>>> libdlm3-4.0.7-1.28.x86_64
>>>>>>>>>>>>>>> libdlm-4.0.7-1.28.x86_64
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>>>>>>>>>> software.
>>>>>>>>>>>>>>> https://www.avast.com/antivirus
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Muhammad Sharfuddin
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>>
>>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>> Getting started:
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started:
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started:
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started:
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> ---
>>>>>>> This email has been checked for viruses by Avast antivirus
>>>>>>> software.
>>>>>>> https://www.avast.com/antivirus
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>

 




More information about the Users mailing list