[ClusterLabs] Wait until resource is really ready before moving clusterip

Tue Jan 26 11:06:04 UTC 2016

Thanks for the help guys.
I ended up patching together my own RA from the Delay and Dummy RA's and
using curl to request the header of solr's ping request handler on
localhost, which made the resource start return a bit more dynamic.
However, now I have another problem which I don't think is related to my RA.
For some reason when failing over the nodes, the ClusterIP (vIP below)
seems to avoid the node running the fencing agent:

pcs status

Online: [ node01 node02 ]
OFFLINE: [ node03 ]

Full list of resources:

 VMWare-fence   (stonith:fence_vmware_soap):    Started node02
 Clone Set: dlm-clone [dlm]
     Started: [ node01 node02 ]
     Stopped: [ node03 ]
 Clone Set: GFS2-clone [GFS2] (unique)
     GFS2:0     (ocf::heartbeat:Filesystem):    Started node01
     GFS2:1     (ocf::heartbeat:Filesystem):    Stopped
     GFS2:2     (ocf::heartbeat:Filesystem):    Started node02
 Clone Set: Tomcat-clone [Tomcat]
     Started: [ node02 ]
     Stopped: [ node01 node03 ]
 vIP    (ocf::heartbeat:IPaddr2): Stopped

Notice how the tomcat-clone is started on node02 but the vIP remains
stopped.
If I start the fence agent on any of the other nodes the same thing happens
(ie, vIP avoiding the fencing node)
Any idea why this happens?

Output of 'pcs config show':
https://github.com/apepojken/pacemaker/blob/master/Config

Thanks again!

2016-01-20 1:14 GMT+01:00 Jan Pokorný <jpokorny at redhat.com>:

> On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote:
> > Joakim Hansson <joakim.hansson87 at gmail.com> writes:
> >> When adding the Delay RA it starts throwing a bunch of errors and the
> >> cluster starts fencing the nodes one by one.
> >>
> >> The error's I get with "pcs status":
> >>
> >> Failed Actions:
> >> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed
> Out,
> >> exit
> >> reason='none',
> >>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> >> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed
> Out,
> >> exit
> >> reason='none',
> >>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
> >> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed
> Out,
> >> exit
> >> reason='none',
> >>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms
> >>
> >> and in the /var/log/pacemaker.log:
> >>
> >>
> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay
> >>
> >> I added the Delay RA with:
> >>
> >> pcs resource create Delay ocf:heartbeat:Delay \
> >> startdelay="120" meta target-role=Started \
> >> op start timeout="180"
> >>
> >> and my config looks like this:
> >>
> >> https://github.com/apepojken/pacemaker/blob/master/Config
> >>
> >> Am I missing something obvious here?
> >
> > It looks like you have a monitor operation configured for the Delay
> > resource, but you haven't set the mondelay parameter. But either way,
> > there is no reason to monitor the Delay resource, so remove that. Same
> > thing for the stop operation, just remove it.
> >
> > I'm guessing pcs adds these by default.
>
> It's true that pcs adds equivalent of "op monitor interval=60s"
> as an unconditional fallback when defining a new resource.
> Other operations are driven solely by explicit values or by
> defaults for particular resource, and this can be turned off
> via "--no-default-ops" option to pcs.
>
> FWIW, this could be a way to have monitor explicitly deactivated:
>
>     pcs resource create <name> <res> ... op monitor interval=0s
>
> --
> Jan (Poki)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160126/c08c71e0/attachment.htm>