[ClusterLabs] CRM managing ADSL connection; failure not handled

Thu Aug 27 10:14:30 EDT 2015

On 08/27/2015 03:04 AM, Tom Yates wrote:
> On Mon, 24 Aug 2015, Andrei Borzenkov wrote:
> 
>> 24.08.2015 13:32, Tom Yates пишет:
>>>  if i understand you aright, my problem is that the stop script didn't
>>>  return a 0 (OK) exit status, so CRM didn't know where to go.  is the
>>>  exit status of the stop script how CRM determines the status of the
>>> stop
>>>  operation?
>>
>> correct
>>
>>>  does CRM also use the output of "/etc/init.d/script status" to
>>> determine
>>>  continuing successful operation?
>>
>> It definitely does not use *output* of script - only return code. If
>> the question is whether it probes resource additionally to checking
>> stop exit code - I do not think so (I know it does it in some cases
>> for systemd resources).
> 
> i just thought i'd come back and follow-up.  in testing this morning, i
> can confirm that the "pppoe-stop" command returns status 1 if pppd isn't
> running.  that makes a standard init.d script, which passes on the
> return code of the stop command, unhelpful to CRM.
> 
> i changed the script so that on stop, having run pppoe-stop, it checks
> for the existence of a working ppp0 interface, and returns 0 IFO there
> is none.

Nice

>> If resource was previously active and stop was attempted as cleanup
>> after resource failure - yes, it should attempt to start it again.
> 
> that is now what happens.  it seems to try three time to bring up pppd,
> then kicks the service over to the other node.
> 
> in the case of extended outages (ie, the ISP goes away for more than
> about 10 minutes), where both nodes have time to fail, we end up back in
> the bad old state (service failed on both nodes):
> 
> [root at positron ~]# crm status
> [...]
> Online: [ electron positron ]
> 
>  Resource Group: BothIPs
>      InternalIP (ocf::heartbeat:IPaddr):        Started electron
>      ExternalIP (lsb:hb-adsl-helper):   Stopped
> 
> Failed actions:
>     ExternalIP_monitor_60000 (node=positron, call=15, rc=7,
> status=complete): not running
>     ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed
> Out): unknown exec error
>     ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out):
> unknown exec error
> 
> is there any way to configure CRM to keep kicking the service between
> the two nodes forever (ie, try three times on positron, kick service
> group to electron, try three times on electron, kick back to positron,
> lather rinse repeat...)?
> 
> for a service like DSL, which can go away for extended periods through
> no local fault then suddenly and with no announcement come back, this
> would be most useful behaviour.

Yes, see migration-threshold and failure-timeout.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options

> thanks to all for help with this.  thanks also to those who have
> suggested i rewrite this as an OCF agent (especially to ken gaillot who
> was kind enough to point me to documentation); i will look at that if
> time permits.