[ClusterLabs] Antw: Resource Demote Time Out Question

Marc Smith marc.smith at parodyne.com
Wed Jan 10 15:26:53 EST 2018


Thank you Ulrich and Ken, that was exactly the solution! Much appreciated.

--Marc


On Wed, Jan 10, 2018 at 12:02 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> On Wed, 2018-01-10 at 16:48 +0100, Ulrich Windl wrote:
>> Hi!
>>
>> Common pitfall: The default parameters in the RA's metadata are not
>> the defaults being configured when you don't specify a value; instead
>> they are suggestions for you when configuring (don't ask me why!).
>> Instead there is a global default timeout being used when you don't
>> specify one.
>> I hope I put that correctly. You could verify by manually adding the
>> default avlues from the metadata to "demote".
>>
>> Regards,
>> Ulrich
>
> Yep. That would be in the section of the configuration with "op start
> interval=0 timeout=120" ... you want "op demote interval=0 timeout="
> with the desired value.
>
>>
>> > > > Marc Smith <marc.smith at parodyne.com> schrieb am 10.01.2018 um
>> > > > 16:26 in
>>
>> Nachricht
>> <CAKdCJ==ABfaKgsL4awK=VY_90PmamhReK6ExxnAzDPLGx2Av0A at mail.gmail.com>:
>> > Hi,
>> >
>> > I'm experiencing a time out on a demote operation and I'm not sure
>> > which parameter / attribute needs to be updated to extend the time
>> > out
>> > window.
>> >
>> > I'm using Pacemaker 1.1.16 and Corosync 2.4.2.
>> >
>> > Here are the set of log lines that show the issue (shutdown
>> > initiated,
>> > then demote time out after 20 seconds):
>> > --snip--
>> > Jan 10 09:08:13 tgtnode2 pacemakerd[1096]:   notice: Caught
>> > 'Terminated'
>> > signal
>> > Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: Caught 'Terminated'
>> > signal
>> > Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: State transition
>> > S_IDLE
>> > -> S_POLICY_ENGINE
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Scheduling Node
>> > tgtnode2.parodyne.com for shutdown
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Promote
>> > p_scst_zfs_vols:0^I(Slave -> Master tgtnode1.parodyne.com)
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Demote
>> > p_scst_zfs_vols:1^I(Master -> Stopped tgtnode2.parodyne.com)
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Stop
>> > p_dlm:1^I(tgtnode2.parodyne.com)
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Migrate
>> > p_dummy_g_zfs^I(Started tgtnode2.parodyne.com ->
>> > tgtnode1.parodyne.com)
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Move
>> > p_zfs_pool_one^I(Started tgtnode2.parodyne.com ->
>> > tgtnode1.parodyne.com)
>> > Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Calculated
>> > transition 3, saving inputs in
>> > /var/lib/pacemaker/pengine/pe-input-1441.bz2
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG:
>> > scst_notify() -> Received a 'pre' / 'demote' notification.
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG:
>> > p_scst_zfs_vols notify returned: 0
>> > Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: Result of notify
>> > operation for p_scst_zfs_vols on tgtnode2.parodyne.com: 0 (ok)
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_monitor() -> SCST version: 3.3.0-rc
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_monitor() -> Resource is running.
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_monitor() -> SCST local target group state: active
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_demote() -> Resource is currently running as Master.
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO:
>> > Blocking
>> > all 'zfs_vols' devices...
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > Waiting
>> > for devices to finish blocking...
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_demote() -> Setting target group 'zfs_vols_local' ALUA state
>> > to
>> > 'transitioning'...
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO:
>> > Collecting current configuration: done. -> Making requested
>> > changes.
>> > -> Setting Target Group attribute 'state' to value 'transitioning'
>> > for
>> > target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s)
>> > made. All done.
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_demote() -> Setting target group 'zfs_vols_local' ALUA state
>> > to
>> > 'unavailable'...
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO:
>> > Collecting current configuration: done. -> Making requested
>> > changes.
>> > -> Setting Target Group attribute 'state' to value 'unavailable'
>> > for
>> > target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s)
>> > made. All done.
>> > Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
>> > scst_demote() -> Changing the group's devices to inactive...
>> > Jan 10 09:08:33 tgtnode2 lrmd[1101]:  warning:
>> > p_scst_zfs_vols_demote_0 process (PID 17473) timed out
>> > Jan 10 09:08:33 tgtnode2 crmd[1104]:   notice: Transition aborted
>> > by
>> > operation p_scst_zfs_vols_demote_0 'modify' on
>> > tgtnode2.parodyne.com:
>> > Event failed
>> > Jan 10 09:08:33 tgtnode2 crmd[1104]:   notice: Transition aborted
>> > by
>> > status-2-fail-count-p_scst_zfs_vols doing create
>> > fail-count-p_scst_zfs_vols=1: Transient attribute change
>> > --snip--
>> >
>> > So I'm getting a "time out" after 20 seconds of waiting in the
>> > demote
>> > operation with this line: Jan 10 09:08:33 tgtnode2 lrmd[1101]:
>> > warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out
>> >
>> > The 20 second time out is consistent when testing this, so I'm sure
>> > it's just a configuration thing, but it's not obvious to me which
>> > parameter/attribute/setting needs to be modified.
>> >
>> > The relevant metadata section from the RA referenced above:
>> > --snip--
>> >           <actions>
>> >             <action name="meta-data" timeout="5" />
>> >             <action name="start" timeout="120" />
>> >             <action name="stop" timeout="90" />
>> >             <action name="monitor" timeout="20" depth="0"
>> > interval="10" role="Master" />
>> >             <action name="monitor" timeout="20" depth="0"
>> > interval="20" role="Slave" />
>> >             <action name="notify" timeout="20" />
>> >             <action name="promote" timeout="60" />
>> >             <action name="demote" timeout="60" />
>> >             <action name="reload" timeout="20" />
>> >             <action name="validate-all" timeout="20" />
>> >           </actions>
>> > --snip--
>> >
>> > And the primitive and clone (multi-state) actual cluster
>> > configuration
>> > for the referenced resource:
>> > --snip--
>> > primitive p_scst_zfs_vols ocf:esos:scst \
>> > params alua=true device_group=zfs_vols local_tgt_grp=zfs_vols_local
>> > remote_tgt_grp=zfs_vols_remote m_alua_state=active
>> > s_alua_state=unavailable use_trans_state=true set_dev_active=true \
>> > op monitor interval=10 role=Master \
>> > op monitor interval=20 role=Slave \
>> > op start interval=0 timeout=120 \
>> > op stop interval=0 timeout=90
>> > ms ms_scst_zfs_vols p_scst_zfs_vols \
>> > meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
>> > notify=true interleave=true
>> > --snip--
>> >
>> > I see a few values in the RA's metadata action section with "20
>> > seconds" and the interval parameter for the primitive, but I'm not
>> > sure which might be affecting this demote time out setting. Any
>> > would
>> > help be greatly appreciated.
>> >
>> > Thanks so much for your time! And thank you for a great software
>> > product!
>> >
>> >
>> > --Marc
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
>> > h.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list