[ClusterLabs] Multiple OpenSIPS services on one cluster

Tue Nov 3 21:52:07 UTC 2015

On 11/03/2015 01:40 PM, Nuno Pereira wrote:
>> -----Mensagem original-----
>> De: Ken Gaillot [mailto:kgaillot at redhat.com]
>> Enviada: terça-feira, 3 de Novembro de 2015 18:02
>> Para: Nuno Pereira; 'Cluster Labs - All topics related to open-source
> clustering
>> welcomed'
>> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
>>
>> On 11/03/2015 05:38 AM, Nuno Pereira wrote:
>>>> -----Mensagem original-----
>>>> De: Ken Gaillot [mailto:kgaillot at redhat.com]
>>>> Enviada: segunda-feira, 2 de Novembro de 2015 19:53
>>>> Para: users at clusterlabs.org
>>>> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
>>>>
>>>> On 11/02/2015 01:24 PM, Nuno Pereira wrote:
>>>>> Hi all.
>>>>>
>>>>>
>>>>>
>>>>> We have one cluster that has 9 nodes and 20 resources.
>>>>>
>>>>>
>>>>>
>>>>> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive,
>>>>> PSIP-SRV02-active and PSIP-SRV02-active.
>>>>>
>>>>> They should provide an lsb:opensips service, 2 by 2:
>>>>>
>>>>> .         The SRV01-opensips and SRV01-IP resources should be active on
>>> one of
>>>>> PSIP-SRV01-active or PSIP-SRV01-passive;
>>>>>
>>>>> .         The SRV02-opensips and SRV02-IP resources should be active on
>>> one of
>>>>> PSIP-SRV02-active or PSIP-SRV02-passive.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Everything works fine, until the moment that one of those nodes is
>>>> rebooted.
>>>>> In the last case the problem occurred with a reboot of
> PSIP-SRV01-passive,
>>>>> that wasn't providing the service at that moment.
>>>>>
>>>>>
>>>>>
>>>>> To be noted that all opensips nodes had the opensips service to be
> started
>>> on
>>>>> boot by initd, which was removed in the meanwhile.
>>>>>
>>>>> The problem is that the service SRV01-opensips is detected to be started
>>> on
>>>>> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips
>> is
>>>>> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active.
>>>>>
>>>>> After that and several operations done by the cluster, which include
>>> actions
>>>>> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01-
>>>> passive,
>>>>> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active,
>>>> which
>>>>> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes
>>>> unmanaged.
>>>>>
>>>>>
>>>>>
>>>>> Any ideas on how to fix this?
>>>>>
>>>>> Nuno Pereira
>>>>>
>>>>> G9Telecom
>>>>
>>>> Your configuration looks appropriate, so it sounds like something is
>>>> still starting the opensips services outside cluster control. Pacemaker
>>>> recovers from multiple running instances by stopping them all, then
>>>> starting on the expected node.
>>> Yesterday I removed the pacemaker from starting on boot, and
>>> tested it: the problem persists.
>>> Also, I checked the logs and the opensips wasn't started on the
>>> PSIP-SRV01-passive machine, the one that was rebooted.
>>> Is it possible to change that behaviour, as it is undesirable for our
>>> environment?
>>> For example, only to stop it on one of the hosts.
>>>
>>>> You can verify that Pacemaker did not start the extra instances by
>>>> looking for start messages in the logs (they will look like "Operation
>>>> SRV01-opensips_start_0" etc.).
>>> On the rebooted node I don't see 2 starts, but only 2 failed stops, the
> first
>>> failed for the service that wasn't supposed to run there, and a normal one
> for
>>> the service that was supposed to run there:
>>>
>>> Nov 02 23:01:24 [1692] PSIP-SRV01-passive       crmd:    error:
>>> process_lrm_event:      Operation SRV02-opensips_stop_0 (node=PSIP-
>>> SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error
>>> Nov 02 23:01:24 [1692] PSIP-SRV01-passive       crmd:   notice:
>>> process_lrm_event:      Operation SRV01-opensips_stop_0: ok (node=PSIP-
>>> SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true)
>>>
>>>
>>>> The other question is why did the stop command fail. The logs should
>>>> shed some light on that too; look for the equivalent "_stop_0" operation
>>>> and the messages around it. The resource agent might have reported an
>>>> error, or it might have timed out.
>>> I see this:
>>>
>>> Nov 02 23:01:24 [1689] PSIP-SRV01-passive       lrmd:  warning:
>>> operation_finished:     SRV02-opensips_stop_0:1983 - terminated with
> signal
>> 15
>>> Nov 02 23:01:24 [1689] PSIP-BBT01-passive       lrmd:     info:
> log_finished:
>>> finished - rsc: SRV02-opensips action:stop call_id:52 pid:1983 exit-code:1
>>> exec-time:79ms queue-time:0ms
>>>
>>> As it can be seen above, the call_id for the failed stop is greater that
> the
>>> one with success, but ends before.
>>> Also, as both operations are stopping the exact same service, the last one
>>> fails. And on the case of the one that fails, it wasn't supposed to be
> stopped
>>> or started in that host, as was configured.
>>
>> I think I see what's happening. I overlooked that SRV01-opensips and
>> SRV02-opensips are using the same LSB init script. That means Pacemaker
>> can't distinguish one instance from the other. If it runs "status" for
>> one instance, it will return "running" if *either* instance is running.
>> If it tries to stop one instance, that will stop whichever one is running.
>>
>> I don't know what version of Pacemaker you're running, but 1.1.13 has a
>> feature "resource-discovery" that could be used to make Pacemaker ignore
>> SRV01-opensips on the nodes that run SRV02-opensips, and vice versa:
>>
> http://blog.clusterlabs.org/blog/2014/feature-spotlight-controllable-resource-
>> discovery/
> That sounds consistent with what we have seen.
> Unfortunatly I'm using version 1.1.12-4 from yum on CentOS 6.2, and so I don't
> have that option.
> I may test if it's available (is there any pcs command to check it?), but I
> need to clone some hosts and so.

6.2 doesn't have it; I'm not sure about 6.7; I think 7.1 does.

>> Alternatively, you could clone the LSB resource instead of having two,
>> but that would be tricky with your other requirements. What are your
>> reasons for wanting to restrict each instance to two specific nodes,
>> rather than let Pacemaker select any two of the four nodes to run the
>> resources?
> SRV01-opensips and SRV02-opensips are actually the same service with 
> different configurations, created for different purposes, used by 
> different clients, and shouldn't run on the same host.
> If I run SRV01-opensips on a SRV02 host, clients wouldn't have service, and
> vice versa.
> 
> I think that I go for this.

The unusual part would be referencing a particular clone instance in
constraints (opensips:0 and opensips:1 instead of just opensips). I've
never done that, but it's worth trying.

>> Another option would be to write an OCF script to use instead of the LSB
>> one. You'd need to add a parameter to distinguish the two instances
>> (maybe the IP it's bound to?), and make start/stop/status operate only
>> on the specified instance. That way, Pacemaker could run "status" for
>> both instances and get the right result for each. It looks like someone
>> did write one a while back, but it needs work (I notice stop always
>> returns success, which is bad): http://anders.com/cms/259
> I already had it here, and it suffers from branding problems (references to
> OpenSer on OpenSIPS, etc).
> It doesn't seem to work:
> 
> # ocf-tester -o ip=127.0.0.1 -n OpenSIPS
> /usr/lib/ocf/resource.d/anders.com/OpenSIPS
> Beginning tests for /usr/lib/ocf/resource.d/anders.com/OpenSIPS...
> * rc=7: Monitoring an active resource should return 0
> * rc=7: Probing an active resource should return 0
> * rc=7: Monitoring an active resource should return 0
> * rc=7: Monitoring an active resource should return 0
> Tests failed: /usr/lib/ocf/resource.d/anders.com/OpenSIPS failed 4 tests

Yes, it would definitely need some development work.

>>> Might it be related to any problem with the init.d script of opensips,
> like an
>>> invalid result code, or something? I checked
>>> http://refspecs.linuxbase.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-
>> generic/inis
>>> crptact.html and didn't found any problem, but might had miss some use
>> case.
>>
>> You can follow this guide to verify the script's LSB compliance as far
>> as it matters to Pacemaker:
>>
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
>> single/Pacemaker_Explained/index.html#ap-lsb
> I can't fully test it right now, but at least in one case the script doesn't
> work well.
> OpenSIPS requires the IP or it doesn't start. On a host without the HA IP,
> the script returns 0 but the process dies and isn't running one second later.
> That doesn't help.
> 
> 
> Nuno Pereira
> G9Telecom

That's not ideal, but it wouldn't be a problem, because you've got
colocation/ordering constraints that ensure Pacemaker won't try to start
opensips unless the IP is up.