[ClusterLabs] CRM managing ADSL connection; failure not handled

Thu Aug 27 04:04:21 EDT 2015

On Mon, 24 Aug 2015, Andrei Borzenkov wrote:

> 24.08.2015 13:32, Tom Yates пишет:
>>  if i understand you aright, my problem is that the stop script didn't
>>  return a 0 (OK) exit status, so CRM didn't know where to go.  is the
>>  exit status of the stop script how CRM determines the status of the stop
>>  operation?
>
> correct
>
>>  does CRM also use the output of "/etc/init.d/script status" to determine
>>  continuing successful operation?
>
> It definitely does not use *output* of script - only return code. If the 
> question is whether it probes resource additionally to checking stop exit 
> code - I do not think so (I know it does it in some cases for systemd 
> resources).

i just thought i'd come back and follow-up.  in testing this morning, i 
can confirm that the "pppoe-stop" command returns status 1 if pppd isn't 
running.  that makes a standard init.d script, which passes on the return 
code of the stop command, unhelpful to CRM.

i changed the script so that on stop, having run pppoe-stop, it checks for 
the existence of a working ppp0 interface, and returns 0 IFO there is 
none.

> If resource was previously active and stop was attempted as cleanup after 
> resource failure - yes, it should attempt to start it again.

that is now what happens.  it seems to try three time to bring up pppd, 
then kicks the service over to the other node.

in the case of extended outages (ie, the ISP goes away for more than about 
10 minutes), where both nodes have time to fail, we end up back in the bad 
old state (service failed on both nodes):

[root at positron ~]# crm status
[...]
Online: [ electron positron ]

  Resource Group: BothIPs
      InternalIP (ocf::heartbeat:IPaddr):        Started electron
      ExternalIP (lsb:hb-adsl-helper):   Stopped

Failed actions:
     ExternalIP_monitor_60000 (node=positron, call=15, rc=7, status=complete): not running
     ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): unknown exec error
     ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): unknown exec error

is there any way to configure CRM to keep kicking the service between the 
two nodes forever (ie, try three times on positron, kick service group to 
electron, try three times on electron, kick back to positron, lather rinse 
repeat...)?

for a service like DSL, which can go away for extended periods through no 
local fault then suddenly and with no announcement come back, this would 
be most useful behaviour.

thanks to all for help with this.  thanks also to those who have suggested 
i rewrite this as an OCF agent (especially to ken gaillot who was kind 
enough to point me to documentation); i will look at that if time permits.

-- 

   Tom Yates  -  http://www.teaparty.net