[ClusterLabs] Approach to validate on stop op (Was Re: crmsh configure delete for constraints)

Tue Mar 29 08:28:39 EDT 2016

10.02.2016 12:31, Vladislav Bogdanov wrote:
> 10.02.2016 11:38, Ulrich Windl wrote:
>>>>> Vladislav Bogdanov <bubble at hoster-ok.com> schrieb am 10.02.2016 um
>>>>> 05:39 in
>> Nachricht <6E479808-6362-4932-B2C6-348C7EFC4020 at hoster-ok.com>:
>>
>> [...]
>>> Well, I'd reword. Generally, RA should not exit with error if validation
>>> fails on stop.
>>> Is that better?
>> [...]
>>
>> As we have different error codes, what type of error?
>
> Any which makes pacemaker to think resource stop op failed.
> OCF_ERR_* particularly.
>
> If pacemaker has got an error on start, it will run stop with the same
> set of parameters anyways. And will get error again if that one was from
> validation and RA does not differentiate validation for start and stop.
> And then circular fencing over the whole cluster is triggered for no
> reason.
>
> Of course, for safety, RA could save its state if start was successful
> and skip validation on stop only if that state is not found. Otherwise
> removed binary or config file would result in resource running on
> several nodes.
>
> Well, this all seems to be very complicated to make some general
> algorithm ;)

Well, after some thinking, I've got an approach which sounds both 
elegant and safe enough to me and my colleagues. Please look at the 
following excerpt (part of hypothetical RA before the main 'case'):

-----
VALIDATION_FAILURE_FLAG="${HA_RSCTMP}/${OCF_RESOURCE_INSTANCE}.invalid"

case "${__OCF_ACTION}" in
     meta-data)
         meta_data
         exit $OCF_SUCCESS
         ;;
     usage|help)
         usage
         exit $OCF_SUCCESS
         ;;
     start)
         validate
         ret=$?
         if [ ${ret} -ne $OCF_SUCCESS ] ; then
             touch "${VALIDATION_FAILURE_FLAG}"
             exit ${ret}
         fi
         ;;
     stop)
         validate
         ret=$?
         if [ ${ret} -ne $OCF_SUCCESS ] ; then
             if [ -f "${VALIDATION_FAILURE_FLAG}" ] ; then
                 rm -f "${VALIDATION_FAILURE_FLAG}"
                 exit $OCF_SUCCESS
             else
                 exit ${ret}
             fi
         fi
         ;;
     *) # monitor | notify | reload | etc
         validate
         ret=$?
         if [ ${ret} -ne $OCF_SUCCESS ] ; then
             if ocf_is_probe ; then
                 exit $OCF_NOT_RUNNING
             fi
             exit $?
         fi
         ;;
esac
-----

Above assumes that validation function does not call exit (and thus uses 
have_binary instead of check_binary, etc.) but returns an error code.

The main difference to the current ocf_rarun implementation is that 
changes to machine environment (deleted binaries, configs, etc.) still 
result in stop failure (and thus fencing) if that changes were made 
after the successful validation on resource start.

I plan to extensively test such approach in my RAs shortly.

Comments are welcome.

Best,
Vladislav

>
>
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org