[ClusterLabs] CRM managing ADSL connection; failure not handled

Mon Aug 24 10:07:04 EDT 2015

On 08/24/2015 04:52 AM, Andrei Borzenkov wrote:
> 24.08.2015 12:35, Tom Yates пишет:
>> I've got a failover firewall pair where the external interface is ADSL;
>> that is, PPPoE.  i've defined the service thus:
>>
>> primitive ExternalIP lsb:hb-adsl-helper \
>>          op monitor interval="60s"
>>
>> and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:
>>
>> #!/bin/bash
>> RETVAL=0
>> start() {
>>          /sbin/pppoe-start
>> }
>> stop() {
>>          /sbin/pppoe-stop
>> }
>> case "$1" in
>>    start)
>>          start
>>          ;;
>>    stop)
>>          stop
>>          ;;
>>    status)
>>          /sbin/ifconfig ppp0 >& /dev/null && exit 0
>>          exit 1
>>          ;;
>>    *)
>>          echo $"Usage: $0 {start|stop|status}"
>>          exit 3
>> esac
>> exit $?

Pacemaker expects that LSB agents follow the LSB spec for return codes,
and won't be able to behave properly if they don't:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb

However it's just as easy to write an OCF agent, which gives you more
flexibility (accepting parameters, etc.):

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf

>> The problem is that sometimes the ADSL connection falls over, as they
>> do, eg:
>>
>> Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
>> Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
>> Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
>> 164420300 bytes.
>> Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
>> Aug 20 11:42:13 positron pppd[2469]: Modem hangup
>> Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
>> 1735: Input/output error
>> Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
>> Aug 20 11:42:13 positron pppd[2469]: Exit.
>> Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
>> attempting re-connection.
>>
>> CRMd then logs a bunch of stuff, followed by
>>
>> Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
>> Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
>> additional parameters are needed.
>> [...]
>> Aug 20 11:42:18 positron pppoe-stop: Killing pppd
>> Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
>> Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
>> process 28357 exited with return code 1.
>>
>>
>> At this point, the PPPoE connection is down, and stays down.  CRMd
>> doesn't fail the group which contains both internal and external
>> interfaces over to the other node, but nor does it try to restart the
>> service.  I'm fairly sure this is because I've done something
>> boneheaded, but I can't get my bone head around what it might be.
>>
>> Any light anyone can shed is much appreciated.
>>
>>
> 
> If stop operation failed resource state is undefined; pacemaker won't do
> anything with this resource. Either make sure script returns success
> when appropriate or the only option is to make it fence node where
> resource was active.
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org