[ClusterLabs] Antw: Resource Demote Time Out Question

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Wed Jan 10 10:48:05 EST 2018


Hi!

Common pitfall: The default parameters in the RA's metadata are not the defaults being configured when you don't specify a value; instead they are suggestions for you when configuring (don't ask me why!). Instead there is a global default timeout being used when you don't specify one.
I hope I put that correctly. You could verify by manually adding the default avlues from the metadata to "demote".

Regards,
Ulrich


>>> Marc Smith <marc.smith at parodyne.com> schrieb am 10.01.2018 um 16:26 in
Nachricht
<CAKdCJ==ABfaKgsL4awK=VY_90PmamhReK6ExxnAzDPLGx2Av0A at mail.gmail.com>:
> Hi,
> 
> I'm experiencing a time out on a demote operation and I'm not sure
> which parameter / attribute needs to be updated to extend the time out
> window.
> 
> I'm using Pacemaker 1.1.16 and Corosync 2.4.2.
> 
> Here are the set of log lines that show the issue (shutdown initiated,
> then demote time out after 20 seconds):
> --snip--
> Jan 10 09:08:13 tgtnode2 pacemakerd[1096]:   notice: Caught 'Terminated' 
> signal
> Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: Caught 'Terminated' signal
> Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: State transition S_IDLE
> -> S_POLICY_ENGINE
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Scheduling Node
> tgtnode2.parodyne.com for shutdown
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Promote
> p_scst_zfs_vols:0^I(Slave -> Master tgtnode1.parodyne.com)
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Demote
> p_scst_zfs_vols:1^I(Master -> Stopped tgtnode2.parodyne.com)
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Stop
> p_dlm:1^I(tgtnode2.parodyne.com)
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Migrate
> p_dummy_g_zfs^I(Started tgtnode2.parodyne.com ->
> tgtnode1.parodyne.com)
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Move
> p_zfs_pool_one^I(Started tgtnode2.parodyne.com ->
> tgtnode1.parodyne.com)
> Jan 10 09:08:13 tgtnode2 pengine[1103]:   notice: Calculated
> transition 3, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-1441.bz2
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG:
> scst_notify() -> Received a 'pre' / 'demote' notification.
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG:
> p_scst_zfs_vols notify returned: 0
> Jan 10 09:08:13 tgtnode2 crmd[1104]:   notice: Result of notify
> operation for p_scst_zfs_vols on tgtnode2.parodyne.com: 0 (ok)
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_monitor() -> SCST version: 3.3.0-rc
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_monitor() -> Resource is running.
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_monitor() -> SCST local target group state: active
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_demote() -> Resource is currently running as Master.
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Blocking
> all 'zfs_vols' devices...
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: Waiting
> for devices to finish blocking...
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to
> 'transitioning'...
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO:
> Collecting current configuration: done. -> Making requested changes.
> -> Setting Target Group attribute 'state' to value 'transitioning' for
> target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s)
> made. All done.
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to
> 'unavailable'...
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO:
> Collecting current configuration: done. -> Making requested changes.
> -> Setting Target Group attribute 'state' to value 'unavailable' for
> target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s)
> made. All done.
> Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG:
> scst_demote() -> Changing the group's devices to inactive...
> Jan 10 09:08:33 tgtnode2 lrmd[1101]:  warning:
> p_scst_zfs_vols_demote_0 process (PID 17473) timed out
> Jan 10 09:08:33 tgtnode2 crmd[1104]:   notice: Transition aborted by
> operation p_scst_zfs_vols_demote_0 'modify' on tgtnode2.parodyne.com:
> Event failed
> Jan 10 09:08:33 tgtnode2 crmd[1104]:   notice: Transition aborted by
> status-2-fail-count-p_scst_zfs_vols doing create
> fail-count-p_scst_zfs_vols=1: Transient attribute change
> --snip--
> 
> So I'm getting a "time out" after 20 seconds of waiting in the demote
> operation with this line: Jan 10 09:08:33 tgtnode2 lrmd[1101]:
> warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out
> 
> The 20 second time out is consistent when testing this, so I'm sure
> it's just a configuration thing, but it's not obvious to me which
> parameter/attribute/setting needs to be modified.
> 
> The relevant metadata section from the RA referenced above:
> --snip--
>           <actions>
>             <action name="meta-data" timeout="5" />
>             <action name="start" timeout="120" />
>             <action name="stop" timeout="90" />
>             <action name="monitor" timeout="20" depth="0"
> interval="10" role="Master" />
>             <action name="monitor" timeout="20" depth="0"
> interval="20" role="Slave" />
>             <action name="notify" timeout="20" />
>             <action name="promote" timeout="60" />
>             <action name="demote" timeout="60" />
>             <action name="reload" timeout="20" />
>             <action name="validate-all" timeout="20" />
>           </actions>
> --snip--
> 
> And the primitive and clone (multi-state) actual cluster configuration
> for the referenced resource:
> --snip--
> primitive p_scst_zfs_vols ocf:esos:scst \
> params alua=true device_group=zfs_vols local_tgt_grp=zfs_vols_local
> remote_tgt_grp=zfs_vols_remote m_alua_state=active
> s_alua_state=unavailable use_trans_state=true set_dev_active=true \
> op monitor interval=10 role=Master \
> op monitor interval=20 role=Slave \
> op start interval=0 timeout=120 \
> op stop interval=0 timeout=90
> ms ms_scst_zfs_vols p_scst_zfs_vols \
> meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true interleave=true
> --snip--
> 
> I see a few values in the RA's metadata action section with "20
> seconds" and the interval parameter for the primitive, but I'm not
> sure which might be affecting this demote time out setting. Any would
> help be greatly appreciated.
> 
> Thanks so much for your time! And thank you for a great software product!
> 
> 
> --Marc
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list