[ClusterLabs] Multiple OpenSIPS services on one cluster

Tue Nov 3 19:40:27 UTC 2015

> -----Mensagem original-----
> De: Ken Gaillot [mailto:kgaillot at redhat.com]
> Enviada: terça-feira, 3 de Novembro de 2015 18:02
> Para: Nuno Pereira; 'Cluster Labs - All topics related to open-source
clustering
> welcomed'
> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
> 
> On 11/03/2015 05:38 AM, Nuno Pereira wrote:
> >> -----Mensagem original-----
> >> De: Ken Gaillot [mailto:kgaillot at redhat.com]
> >> Enviada: segunda-feira, 2 de Novembro de 2015 19:53
> >> Para: users at clusterlabs.org
> >> Assunto: Re: [ClusterLabs] Multiple OpenSIPS services on one cluster
> >>
> >> On 11/02/2015 01:24 PM, Nuno Pereira wrote:
> >>> Hi all.
> >>>
> >>>
> >>>
> >>> We have one cluster that has 9 nodes and 20 resources.
> >>>
> >>>
> >>>
> >>> Four of those hosts are PSIP-SRV01-active, PSIP-SRV01-passive,
> >>> PSIP-SRV02-active and PSIP-SRV02-active.
> >>>
> >>> They should provide an lsb:opensips service, 2 by 2:
> >>>
> >>> .         The SRV01-opensips and SRV01-IP resources should be active on
> > one of
> >>> PSIP-SRV01-active or PSIP-SRV01-passive;
> >>>
> >>> .         The SRV02-opensips and SRV02-IP resources should be active on
> > one of
> >>> PSIP-SRV02-active or PSIP-SRV02-passive.
> >>>
> >>>
> >>>
> >>>
> >>> Everything works fine, until the moment that one of those nodes is
> >> rebooted.
> >>> In the last case the problem occurred with a reboot of
PSIP-SRV01-passive,
> >>> that wasn't providing the service at that moment.
> >>>
> >>>
> >>>
> >>> To be noted that all opensips nodes had the opensips service to be
started
> > on
> >>> boot by initd, which was removed in the meanwhile.
> >>>
> >>> The problem is that the service SRV01-opensips is detected to be started
> > on
> >>> both PSIP-SRV01-active and PSIP-SRV01-passive, and the SRV02-opensips
> is
> >>> detected to be started on both PSIP-SRV01-active and PSIP-SRV02-active.
> >>>
> >>> After that and several operations done by the cluster, which include
> > actions
> >>> to stop both SRV01-opensips on both PSIP-SRV01-active and PSIP-SRV01-
> >> passive,
> >>> and to stop SRV02-opensips on PSIP-SRV01-active and PSIP-SRV02-active,
> >> which
> >>> fail on PSIP-SRV01-passive, the resource SRV01-opensips becomes
> >> unmanaged.
> >>>
> >>>
> >>>
> >>> Any ideas on how to fix this?
> >>>
> >>> Nuno Pereira
> >>>
> >>> G9Telecom
> >>
> >> Your configuration looks appropriate, so it sounds like something is
> >> still starting the opensips services outside cluster control. Pacemaker
> >> recovers from multiple running instances by stopping them all, then
> >> starting on the expected node.
> > Yesterday I removed the pacemaker from starting on boot, and
> > tested it: the problem persists.
> > Also, I checked the logs and the opensips wasn't started on the
> > PSIP-SRV01-passive machine, the one that was rebooted.
> > Is it possible to change that behaviour, as it is undesirable for our
> > environment?
> > For example, only to stop it on one of the hosts.
> >
> >> You can verify that Pacemaker did not start the extra instances by
> >> looking for start messages in the logs (they will look like "Operation
> >> SRV01-opensips_start_0" etc.).
> > On the rebooted node I don't see 2 starts, but only 2 failed stops, the
first
> > failed for the service that wasn't supposed to run there, and a normal one
for
> > the service that was supposed to run there:
> >
> > Nov 02 23:01:24 [1692] PSIP-SRV01-passive       crmd:    error:
> > process_lrm_event:      Operation SRV02-opensips_stop_0 (node=PSIP-
> > SRV01-passive, call=52, status=4, cib-update=23, confirmed=true) Error
> > Nov 02 23:01:24 [1692] PSIP-SRV01-passive       crmd:   notice:
> > process_lrm_event:      Operation SRV01-opensips_stop_0: ok (node=PSIP-
> > SRV01-passive, call=51, rc=0, cib-update=24, confirmed=true)
> >
> >
> >> The other question is why did the stop command fail. The logs should
> >> shed some light on that too; look for the equivalent "_stop_0" operation
> >> and the messages around it. The resource agent might have reported an
> >> error, or it might have timed out.
> > I see this:
> >
> > Nov 02 23:01:24 [1689] PSIP-SRV01-passive       lrmd:  warning:
> > operation_finished:     SRV02-opensips_stop_0:1983 - terminated with
signal
> 15
> > Nov 02 23:01:24 [1689] PSIP-BBT01-passive       lrmd:     info:
log_finished:
> > finished - rsc: SRV02-opensips action:stop call_id:52 pid:1983 exit-code:1
> > exec-time:79ms queue-time:0ms
> >
> > As it can be seen above, the call_id for the failed stop is greater that
the
> > one with success, but ends before.
> > Also, as both operations are stopping the exact same service, the last one
> > fails. And on the case of the one that fails, it wasn't supposed to be
stopped
> > or started in that host, as was configured.
> 
> I think I see what's happening. I overlooked that SRV01-opensips and
> SRV02-opensips are using the same LSB init script. That means Pacemaker
> can't distinguish one instance from the other. If it runs "status" for
> one instance, it will return "running" if *either* instance is running.
> If it tries to stop one instance, that will stop whichever one is running.
> 
> I don't know what version of Pacemaker you're running, but 1.1.13 has a
> feature "resource-discovery" that could be used to make Pacemaker ignore
> SRV01-opensips on the nodes that run SRV02-opensips, and vice versa:
>
http://blog.clusterlabs.org/blog/2014/feature-spotlight-controllable-resource-
> discovery/
That sounds consistent with what we have seen.
Unfortunatly I'm using version 1.1.12-4 from yum on CentOS 6.2, and so I don't
have that option.
I may test if it's available (is there any pcs command to check it?), but I
need to clone some hosts and so.

> Alternatively, you could clone the LSB resource instead of having two,
> but that would be tricky with your other requirements. What are your
> reasons for wanting to restrict each instance to two specific nodes,
> rather than let Pacemaker select any two of the four nodes to run the
> resources?
SRV01-opensips and SRV02-opensips are actually the same service with 
different configurations, created for different purposes, used by 
different clients, and shouldn't run on the same host.
If I run SRV01-opensips on a SRV02 host, clients wouldn't have service, and
vice versa.

I think that I go for this.

> Another option would be to write an OCF script to use instead of the LSB
> one. You'd need to add a parameter to distinguish the two instances
> (maybe the IP it's bound to?), and make start/stop/status operate only
> on the specified instance. That way, Pacemaker could run "status" for
> both instances and get the right result for each. It looks like someone
> did write one a while back, but it needs work (I notice stop always
> returns success, which is bad): http://anders.com/cms/259
I already had it here, and it suffers from branding problems (references to
OpenSer on OpenSIPS, etc).
It doesn't seem to work:

# ocf-tester -o ip=127.0.0.1 -n OpenSIPS
/usr/lib/ocf/resource.d/anders.com/OpenSIPS
Beginning tests for /usr/lib/ocf/resource.d/anders.com/OpenSIPS...
* rc=7: Monitoring an active resource should return 0
* rc=7: Probing an active resource should return 0
* rc=7: Monitoring an active resource should return 0
* rc=7: Monitoring an active resource should return 0
Tests failed: /usr/lib/ocf/resource.d/anders.com/OpenSIPS failed 4 tests

> > Might it be related to any problem with the init.d script of opensips,
like an
> > invalid result code, or something? I checked
> > http://refspecs.linuxbase.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-
> generic/inis
> > crptact.html and didn't found any problem, but might had miss some use
> case.
> 
> You can follow this guide to verify the script's LSB compliance as far
> as it matters to Pacemaker:
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-
> single/Pacemaker_Explained/index.html#ap-lsb
I can't fully test it right now, but at least in one case the script doesn't
work well.
OpenSIPS requires the IP or it doesn't start. On a host without the HA IP,
the script returns 0 but the process dies and isn't running one second later.
That doesn't help.

Nuno Pereira
G9Telecom