[ClusterLabs] proftpd resource agent - fix for a start/monitor race condition

Wed Mar 25 13:26:04 UTC 2015

Hi,

On Wed, Mar 25, 2015 at 11:40:32AM +0100, Matthias Ferdinand wrote:
> Hello,
> 
> the proftpd resource agent sometimes shows a race condition:
> 
> if startup of the proftpd binary is slow, the pacemaker monitor
> operation immediately following the start operation may not yet find
> the pid-file from proftpd, and then it will signal failure. Subsequent
> retries of the start operation then keep failing because the tcp sockets
> are already used by the initial proftpd (which was never stopped).

Yes, that's a common issue with all servers that run as daemons.

> Fix (copied from the apache resource agent): after invoking the proftpd
> binary, do not return to caller until the monitor operation (called
> from within the RA itself) shows "success". Handling startup timeouts is
> left to the cluster manager.

Very good. More below.

> Regards
> Matthias Ferdinand
> -- 
> one4vision GmbH                    Fon +49 681 96727 - 60
> Residenz am Schlossgarten          Fax +49 681 96727 - 69
> Talstraße 34-42                    info at one4vision.de
> D-66119 Saarbrücken                http://www.one4vision.de
> HRB 11751                          verantwortl. Geschäftsführer:
> Amtsgericht Saarbrücken            Christof Allmann, Christoph Harth

> --- 20150226_usr_lib_ocf_resource.d_heartbeat_proftpd	2015-02-26 17:39:19.956590821 +0100
> +++ patched_proftpd	2015-02-26 17:51:06.027695989 +0100
> @@ -163,7 +163,25 @@
>  		exit $OCF_ERR_GENERIC
>  	fi
>  
> -	exit $OCF_SUCCESS
> +        tries=0
> +        while : # wait until the user set timeout
> +        do
> +                proftpd_monitor
> +                ec=$?

Limit scope of ec (add "local ec", somewhere above).

> +                if [ $ec -eq $OCF_NOT_RUNNING ]
> +                then
> +                        tries=`expr $tries + 1`

You can drop the tries variable.

> +                        ocf_log info "waiting for proftpd ${OCF_RESKEY_conffile} to come up"
> +                        sleep 1
> +                else
> +                        break
> +                fi
> +        done
> +
> +        if [ $ec -ne 0 ]; then
> +                proftpd_stop

I'd remove this. The cluster manager should try to stop the
resource in case a start operation fails.

Cheers,

Dejan

> +        fi
> +        return $ec
>  }
>  
>  
> @@ -264,6 +282,7 @@
>  case $1 in
>      start)	proftpd_validate_all
>  			proftpd_start
> +                        exit $?
>  			;;
>  	
>      stop)	proftpd_stop
> @@ -298,4 +317,3 @@
>   		exit $OCF_ERR_UNIMPLEMENTED
>  		;;
>  esac
> -

> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org