[ClusterLabs] Wait until resource is really ready before moving clusterip

Tue Jan 26 10:32:21 EST 2016

On 01/26/2016 05:06 AM, Joakim Hansson wrote:
> Thanks for the help guys.
> I ended up patching together my own RA from the Delay and Dummy RA's and
> using curl to request the header of solr's ping request handler on
> localhost, which made the resource start return a bit more dynamic.
> However, now I have another problem which I don't think is related to my RA.
> For some reason when failing over the nodes, the ClusterIP (vIP below)
> seems to avoid the node running the fencing agent:
> 
> pcs status
> 
> Online: [ node01 node02 ]
> OFFLINE: [ node03 ]
> 
> Full list of resources:
> 
>  VMWare-fence   (stonith:fence_vmware_soap):    Started node02
>  Clone Set: dlm-clone [dlm]
>      Started: [ node01 node02 ]
>      Stopped: [ node03 ]
>  Clone Set: GFS2-clone [GFS2] (unique)
>      GFS2:0     (ocf::heartbeat:Filesystem):    Started node01
>      GFS2:1     (ocf::heartbeat:Filesystem):    Stopped
>      GFS2:2     (ocf::heartbeat:Filesystem):    Started node02
>  Clone Set: Tomcat-clone [Tomcat]
>      Started: [ node02 ]
>      Stopped: [ node01 node03 ]
>  vIP    (ocf::heartbeat:IPaddr2): Stopped
> 
> Notice how the tomcat-clone is started on node02 but the vIP remains
> stopped.
> If I start the fence agent on any of the other nodes the same thing happens
> (ie, vIP avoiding the fencing node)
> Any idea why this happens?
> 
> Output of 'pcs config show':
> https://github.com/apepojken/pacemaker/blob/master/Config

I notice you have mutliple ordering constraints but only one colocation
constraint. That means, for example, that tomcat-clone must be started
after GFS2, but it does not have to be on the same node. I'm pretty sure
you want colocation constraints as well, to make them start on the same
node.

FYI, a group is like a shorthand for ordering and constraint constraints
for multiple resources that need to be kept together and started/stopped
in order.

I also see you have globally-unique=true on GFS2-clone. You probably do
not want this. globally-unique=false (the default) is more common, and
means that all clone instances are interchangeable, and is usually
configured with clone-node-max=1, because only one instance is ever
needed on any one node. globally-unique=true means that each clone
instance handles a different subset of requests, and is usually
configured with clone-node-max > 1 so that multiple clone instances can
run on a single node if needed.

I don't see from that alone why vIP wouldn't start, but take care of the
above issues first, and see what the behavior is then.

> Thanks again!
> 
> 2016-01-20 1:14 GMT+01:00 Jan Pokorný <jpokorny at redhat.com>:
> 
>> On 14/01/16 14:46 +0100, Kristoffer Grönlund wrote:
>>> Joakim Hansson <joakim.hansson87 at gmail.com> writes:
>>>> When adding the Delay RA it starts throwing a bunch of errors and the
>>>> cluster starts fencing the nodes one by one.
>>>>
>>>> The error's I get with "pcs status":
>>>>
>>>> Failed Actions:
>>>> * Delay_monitor_0 on node03 'unknown error' (1): call=51, status=Timed
>> Out,
>>>> exit
>>>> reason='none',
>>>>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
>>>> * Delay_monitor_0 on node01 'unknown error' (1): call=53, status=Timed
>> Out,
>>>> exit
>>>> reason='none',
>>>>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30002ms
>>>> * Delay_monitor_0 on node02 'unknown error' (1): call=51, status=Timed
>> Out,
>>>> exit
>>>> reason='none',
>>>>     last-rc-change='Thu Jan 14 13:30:14 2016', queued=0ms, exec=30006ms
>>>>
>>>> and in the /var/log/pacemaker.log:
>>>>
>>>>
>> https://github.com/apepojken/pacemaker-errors/blob/master/ocf:heartbeat:Delay
>>>>
>>>> I added the Delay RA with:
>>>>
>>>> pcs resource create Delay ocf:heartbeat:Delay \
>>>> startdelay="120" meta target-role=Started \
>>>> op start timeout="180"
>>>>
>>>> and my config looks like this:
>>>>
>>>> https://github.com/apepojken/pacemaker/blob/master/Config
>>>>
>>>> Am I missing something obvious here?
>>>
>>> It looks like you have a monitor operation configured for the Delay
>>> resource, but you haven't set the mondelay parameter. But either way,
>>> there is no reason to monitor the Delay resource, so remove that. Same
>>> thing for the stop operation, just remove it.
>>>
>>> I'm guessing pcs adds these by default.
>>
>> It's true that pcs adds equivalent of "op monitor interval=60s"
>> as an unconditional fallback when defining a new resource.
>> Other operations are driven solely by explicit values or by
>> defaults for particular resource, and this can be turned off
>> via "--no-default-ops" option to pcs.
>>
>> FWIW, this could be a way to have monitor explicitly deactivated:
>>
>>     pcs resource create <name> <res> ... op monitor interval=0s
>>
>> --
>> Jan (Poki)