[ClusterLabs] Clone Issue

Vladislav Bogdanov bubble at hoster-ok.com
Mon Feb 15 12:50:40 UTC 2016


15.02.2016 15:18, Frank D. Engel, Jr. wrote:
> Good tip on the status url - I did go ahead and made that update.

This mostly depends on how resolver is configured on a given node.
Look at you /etc/hosts - I'd bet you have two localhost records there - 
one for IPv6 and one for IPv4. On the other hand, your apache is 
probably configured to listen only on IPv4 addresses, so resource agent 
cannot connect to IPv6 loopback.

>
>
> I'm not sure that I agree with the IP relying on Apache, though.
>
> There could be multiple services hanging off the same IP addresses; if

My experience shows that in most cases this should be avoided as much as 
possible. Do you really want IP to reside on a node where service fails 
to start thus losing client connections?

> the IPs depend on one of those services, then would not stopping that
> one service for maintenance also impact all of the others by stopping
> the IP resource, when they otherwise could have continued to function?
>
>
>
> On 2/15/2016 01:47, Vladislav Bogdanov wrote:
>> "Frank D. Engel, Jr." <fde101 at fjrhome.net> wrote:
>>> I tried working with a few of these suggestions but the issue doesn't
>>> seem to be there.  All of them were configured the same way for the
>>> status page.
>> Try to replace localhost with 127.0.0.1 in the status url param.
>>
>>> After rebooting all of the nodes, two of the ClusterIP resources wound
>>> up on the same node, and "relocate run ClusterIP-clone" would not
>> Unfortunately, with the default placement strategy, cluster spread
>> resources equally over all the nodes. You can play with utilization
>> placement, assigning some attribute on all nodes to the number of
>> globally-unique clone instances, and adding utilization param
>> that_attribute=1 to CloneIP.
>>
>> I raised this issue quite long ago, but it is not solved yet.
>>
>> Last, you probably want to change your ClusterIP-related constraints,
>> so its instances are allocated together with the running apache
>> instance, not vise-versa.
>>
>>
>> Best,
>> Vladislav
>>
>>> resolve this.  I ended up taking the node with the duplicate out of the
>>>
>>> cluster (pcs cluster stop) and then adding it back in - this allowed
>>> that to run, and for some reason, the web site is on all three nodes
>>> now.
>>>
>>> So far the cluster behavior seems a bit flaky; maybe it is something
>>> odd
>>> in the configuration, but while I can understand how two of the IP
>>> resources would wind up on the same node initially, I'm not sure why I
>>> would need to take a node out of the cluster like that to fix it?
>>>
>>> In some cases I've needed to reboot the nodes multiple times to get the
>>>
>>> cluster to start behaving again after reboots of nodes for other
>>> reasons; rebooting one of the three nodes sometimes causes the
>>> cluster-data-clone (file system) to restart or even just be completely
>>> lost on all of the nodes, and I've had to reboot a few times to get it
>>> back.  I could understand that with two nodes down (and it should
>>> effectively take the filesystem down in that case), but with just one
>>> going down that seems to be a problem.
>>>
>>> Still experimenting and exploring.
>>>
>>>
>>> Thank you!
>>>
>>>
>>>
>>> On 2/14/2016 10:23, Ken Gaillot wrote:
>>>> On 02/13/2016 08:09 PM, Frank D. Engel, Jr. wrote:
>>>>> Hi,
>>>>>
>>>>> I'm new to the software, and with the list - just started
>>> experimenting
>>>>> with trying to get a cluster working using CentOS 7 and the pcs
>>> utility,
>>>>> and I've made some progress, but I can't quite figure out why I'm
>>> seeing
>>>>> this one behavior - hoping someone can help, might be something
>>> simple I
>>>>> haven't picked up on yet.
>>>>>
>>>>> I have three nodes configured (running under VirtualBox) with shared
>>>>> storage using GFS2 - that much seems to be working ok.
>>>>>
>>>>> I have a service called "WebSite" representing the Apache
>>> configuration,
>>>>> and I cloned that to create "WebSite-clone", which I would expect to
>>> run
>>>>> instances of on all three nodes.
>>>>>
>>>>> However, if I leave "globally-unique" off, it will only run on one
>>> node,
>>>>> where if I turn it on, it will run on two, but never on all three.
>>> I've
>>>>> tried a number of things to get this working.  I did verify that I
>>> can
>>>>> manually start and stop Apache on all three nodes and it works on
>>> any of
>>>>> them that way.
>>>> You don't want globally-unique=true; that's for cases where you want
>>> to
>>>> be able to run multiple instances of the service on the same machine
>>> if
>>>> necessary, because each clone handles different requests.
>>>>
>>>>> Currently my status looks like this (with globally-unique set to
>>> false;
>>>>> "cluster-data" is my GFS2 filesystem):
>>>>>
>>>>> Cluster name: lincl
>>>>> Last updated: Sat Feb 13 20:58:26 2016        Last change: Sat Feb
>>> 13
>>>>> 20:45:08 2016 by root via crm_resource on lincl2-hb
>>>>> Stack: corosync
>>>>> Current DC: lincl2-hb (version 1.1.13-10.el7-44eb2dd) - partition
>>> with
>>>>> quorum
>>>>> 3 nodes and 13 resources configured
>>>>>
>>>>> Online: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>>>
>>>>> Full list of resources:
>>>>>
>>>>>    kdump    (stonith:fence_kdump):    Started lincl0-hb
>>>>>    Clone Set: dlm-clone [dlm]
>>>>>        Started: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>>>    Master/Slave Set: cluster-data-clone [cluster-data]
>>>>>        Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>>>    Clone Set: ClusterIP-clone [ClusterIP] (unique)
>>>>>        ClusterIP:0    (ocf::heartbeat:IPaddr2):    Started lincl2-hb
>>>>>        ClusterIP:1    (ocf::heartbeat:IPaddr2):    Started lincl0-hb
>>>>>        ClusterIP:2    (ocf::heartbeat:IPaddr2):    Started lincl1-hb
>>>>>    Clone Set: WebSite-clone [WebSite]
>>>>>        Started: [ lincl0-hb ]
>>>>>        Stopped: [ lincl1-hb lincl2-hb ]
>>>> The above says that the cluster successfully started a WebSite
>>> instance
>>>> on lincl0-hb, but it is for some reason prevented from doing so on
>>> the
>>>> other two nodes.
>>>>
>>>>> Failed Actions:
>>>>> * WebSite:0_start_0 on lincl2-hb 'unknown error' (1): call=142,
>>>>> status=Timed Out, exitreason='Failed to access httpd status page.',
>>>>>       last-rc-change='Sat Feb 13 19:55:45 2016', queued=0ms,
>>> exec=120004ms
>>>> This gives a good bit of info:
>>>>
>>>> * The "start" action on the "WebSite" resource failed no node
>>> lincl2-hb.
>>>> * The failure was a timeout. The start action did not return in the
>>>> configured (or default) time.
>>>>
>>>> * The reason given by the apache resource agent was "Failed to access
>>>> httpd status page".
>>>>
>>>>> * WebSite:2_start_0 on lincl2-hb 'unknown error' (1): call=130,
>>>>> status=Timed Out, exitreason='none',
>>>>>       last-rc-change='Sat Feb 13 19:33:49 2016', queued=0ms,
>>> exec=40003ms
>>>>> * WebSite:1_monitor_60000 on lincl0-hb 'unknown error' (1):
>>> call=101,
>>>>> status=complete, exitreason='Failed to access httpd status page.',
>>>>>       last-rc-change='Sat Feb 13 19:53:53 2016', queued=0ms, exec=0ms
>>>>> * WebSite:0_monitor_60000 on lincl0-hb 'not running' (7): call=77,
>>>>> status=complete, exitreason='none',
>>>>>       last-rc-change='Sat Feb 13 19:34:48 2016', queued=0ms, exec=0ms
>>>>> * WebSite:2_start_0 on lincl1-hb 'unknown error' (1): call=41,
>>>>> status=Timed Out, exitreason='none',
>>>>>       last-rc-change='Sat Feb 13 19:53:41 2016', queued=1ms,
>>> exec=120004ms
>>>>>
>>>>> PCSD Status:
>>>>>     lincl0-hb: Online
>>>>>     lincl1-hb: Online
>>>>>     lincl2-hb: Online
>>>>>
>>>>> Daemon Status:
>>>>>     corosync: active/enabled
>>>>>     pacemaker: active/enabled
>>>>>     pcsd: active/enabled
>>>>>
>>>>>
>>>>>
>>>>> I'm not sure how to further troubleshoot those "Failed Actions" or
>>> how
>>>>> to clear them from the display?
>>>> Pacemaker relies on what the resource agent tells it, so when the
>>>> resource agent fails, you'll have to look at that rather than
>>> pacemaker
>>>> itself. Often, agents will print more detailed messages to the system
>>>> log. Otherwise, just verifying the resource configuration and so
>>> forth
>>>> is a good idea.
>>>>
>>>> In this case, the big hint is the status page. The apache resource
>>> agent
>>>> relies on the /server-status URL to verify that apache is running.
>>>> Double-check that apache's configuration is identical on all nodes,
>>>> particularly the /server-status configuration.
>>>>
>>>> Once you've addressed the root cause of a failed action, you can
>>> clear
>>>> it from the display with "pcs resource cleanup" -- see "man pcs" for
>>> the
>>>> options it takes.
>>>>
>>>> Another good idea is (with the cluster stopped) to ensure you can
>>> start
>>>> apache manually on each node and see the server-status URL from that
>>>> node (using curl or wget or whatever).
>>>>
>>>>> Configuration of the WebSite-clone looks like:
>>>>>
>>>>> [root at lincl2 /]# pcs resource show WebSite-clone
>>>>>    Clone: WebSite-clone
>>>>>     Meta Attrs: globally-unique=false clone-node-max=1 clone-max=3
>>>>> interleave=true
>>>>>     Resource: WebSite (class=ocf provider=heartbeat type=apache)
>>>>>      Attributes: configfile=/etc/httpd/conf/httpd.conf
>>>>> statusurl=http://localhost/server-status
>>>>>      Operations: start interval=0s timeout=120s
>>> (WebSite-start-interval-0s)
>>>>>                  stop interval=0s timeout=60s
>>> (WebSite-stop-interval-0s)
>>>>>                  monitor interval=1min
>>> (WebSite-monitor-interval-1min)
>>>>>
>>>>>
>>>>> Now I change globally-unique to true, and this happens:
>>>>>
>>>>> [root at lincl2 /]# pcs resource update WebSite-clone
>>> globally-unique=true
>>>>> [root at lincl2 /]# pcs resource
>>>>>    Clone Set: dlm-clone [dlm]
>>>>>        Started: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>>>    Master/Slave Set: cluster-data-clone [cluster-data]
>>>>>        Slaves: [ lincl0-hb lincl1-hb lincl2-hb ]
>>>>>    Clone Set: ClusterIP-clone [ClusterIP] (unique)
>>>>>        ClusterIP:0    (ocf::heartbeat:IPaddr2):    Started lincl2-hb
>>>>>        ClusterIP:1    (ocf::heartbeat:IPaddr2):    Started lincl0-hb
>>>>>        ClusterIP:2    (ocf::heartbeat:IPaddr2):    Started lincl1-hb
>>>>>    Clone Set: WebSite-clone [WebSite] (unique)
>>>>>        WebSite:0    (ocf::heartbeat:apache):    Started lincl0-hb
>>>>>        WebSite:1    (ocf::heartbeat:apache):    Started lincl1-hb
>>>>>        WebSite:2    (ocf::heartbeat:apache):    Stopped
>>>>>
>>>>>
>>>>> Constraints are set up as follows:
>>>>>
>>>>> [root at lincl2 /]# pcs constraint
>>>>> Location Constraints:
>>>>> Ordering Constraints:
>>>>>     start dlm-clone then start cluster-data-clone (kind:Mandatory)
>>>>>     start ClusterIP-clone then start WebSite-clone (kind:Mandatory)
>>>>>     start cluster-data-clone then start WebSite-clone
>>> (kind:Mandatory)
>>>>> Colocation Constraints:
>>>>>     cluster-data-clone with dlm-clone (score:INFINITY)
>>>>>     WebSite-clone with ClusterIP-clone (score:INFINITY)
>>>>>     WebSite-clone with cluster-data-clone (score:INFINITY)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> As far as I can tell, there is no activity in the Apache log files
>>> from
>>>>> pcs trying to start it and it failing or taking too long - it seems
>>> that
>>>>> it never gets far enough for Apache itself to be trying to start.
>>>>>
>>>>>
>>>>> Can someone give me ideas on how to further troubleshoot this?
>>> Ideally
>>>>> I'd like it running one instance on each available node.
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>





More information about the Users mailing list