[ClusterLabs] service flap as nodes join and leave

Adam Spiers aspiers at suse.com
Thu Apr 14 11:35:24 EDT 2016


Ken Gaillot <kgaillot at redhat.com> wrote:
> On 04/14/2016 09:33 AM, Christopher Harvey wrote:
> > MsgBB-Active is a dummy resource that simply returns OCF_SUCCESS on
> > every operation and logs to a file.
> 
> That's a common mistake, and will confuse the cluster. The cluster
> checks the status of resources both where they're supposed to be running
> and where they're not. If status always returns success, the cluster
> won't try to start it where it should,, and will continuously try to
> stop it elsewhere, because it thinks it's already running everywhere.
> 
> It's essential that an RA distinguish between running
> (OCF_SUCCESS/OCF_RUNNING_MASTER), cleanly not running (OCF_NOT_RUNNING),
> and unknown/failed (OCF_ERR_*/OCF_FAILED_MASTER).
> 
> See pacemaker's Dummy agent as an example/template:
> 
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/resources/Dummy
> 
> It touches a temporary file to know whether it is "running" or not.

Yes, I very recently discovered we had made a similar mistake which
was confusing Pacemaker into thinking a pseudo-resource was running
everywhere, whereas we actually only wanted it running active/passive.
This was the fix:

  https://review.openstack.org/#/c/291286/

> ocf-shellfuncs has a ha_pseudo_resource() function that does the same
> thing. See the ocf:heartbeat:Delay agent for example usage.

Interesting thanks, I didn't know that.




More information about the Users mailing list