[Pacemaker] asymmetric clusters, remote nodes, and monitor operations

Lindsay Todd rltodd.ml1 at gmail.com
Wed Sep 11 13:44:53 EDT 2013


What I am seeing in the syslog are messages like:

Sep 11 13:19:52 db02 pacemaker_remoted[1736]:   notice: operation_finished:
p-my
sql_monitor_20000:19398:stderr [ 2013/09/11_13:19:52 INFO: MySQL monitor
succeed
ed ]
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processin
g failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysq
l-slurm from re-starting on cvmh02: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processin
g failed op monitor for p-mysql-slurm on cvmh03: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysq
l-slurm from re-starting on cvmh03: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql-slurm on cvmh01: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysql-slurm from re-starting on cvmh01: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysql-slurm from re-starting on cvmh02: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql-slurm on cvmh03: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysql-slurm from re-starting on cvmh03: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql-slurm on cvmh01: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysql-slurm from re-starting on cvmh01: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: LogActions: Start
p-mysql#011(db02)
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating
action 48: monitor p-mysql_monitor_0 on cvmh03 (local)
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating
action 46: monitor p-mysql_monitor_0 on cvmh02
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating
action 44: monitor p-mysql_monitor_0 on cvmh01
Sep 11 13:20:08 cvmh03 mysql(p-mysql)[12476]: ERROR: Setup problem:
couldn't find command: /usr/bin/mysqld_safe
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: process_lrm_event: LRM
operation p-mysql_monitor_0 (call=907, rc=5, cib-update=701,
confirmed=true) not installed
Sep 11 13:20:08 cvmh02 mysql(p-mysql)[17158]: ERROR: Setup problem:
couldn't find command: /usr/bin/mysqld_safe
Sep 11 13:20:08 cvmh01 mysql(p-mysql)[5968]: ERROR: Setup problem: couldn't
find command: /usr/bin/mysqld_safe
Sep 11 13:20:08 cvmh02 crmd[5081]:   notice: process_lrm_event: LRM
operation p-mysql_monitor_0 (call=332, rc=5, cib-update=164,
confirmed=true) not installed
Sep 11 13:20:08 cvmh01 crmd[5169]:   notice: process_lrm_event: LRM
operation p-mysql_monitor_0 (call=319, rc=5, cib-update=188,
confirmed=true) not installed
Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 48
(p-mysql_monitor_0) on cvmh03 failed (target: 7 vs. rc: 5): Error
Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 46
(p-mysql_monitor_0) on cvmh02 failed (target: 7 vs. rc: 5): Error
Sep 11 13:20:08 cvmh03 crmd[4833]:  warning: status_from_rc: Action 44
(p-mysql_monitor_0) on cvmh01 failed (target: 7 vs. rc: 5): Error
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql-slurm on cvmh02: not installed (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:   notice: unpack_rsc_op: Preventing
p-mysql-slurm from re-starting on cvmh02: operation monitor failed 'not
installed' (5)
Sep 11 13:20:08 cvmh03 pengine[4832]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for p-mysql on cvmh02: not installed (5)
*...*
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating
action 150: start p-mysql_start_0 on db02
Sep 11 13:20:08 db02 pacemaker_remoted[1736]:   notice: operation_finished:
p-mysql_start_0:19427:stderr [ 2013/09/11_13:20:08 INFO: MySQL already
running ]
Sep 11 13:20:08 cvmh02 crmd[5081]:   notice: process_lrm_event: LRM
operation p-mysql_start_0 (call=2600, rc=0, cib-update=165, confirmed=true)
ok
Sep 11 13:20:08 cvmh03 crmd[4833]:   notice: te_rsc_command: Initiating
action 151: monitor p-mysql_monitor_20000 on db02
Sep 11 13:20:09 db02 pacemaker_remoted[1736]:   notice: operation_finished:
p-mysql_monitor_20000:19454:stderr [ 2013/09/11_13:20:09 INFO: MySQL
monitor succeeded ]

So I guess they aren't "error", but rather warnings, which is what we see
in unpack_rcs_op_failure, and I do see that is makes OCF_NOT_INSTALLED when
asymmetric a special case -- after logging the warning.  Should the test
move earlier in this function, and maybe return in that case?  Also crm_mon
reports errors:

Failed actions:
    p-mysql-slurm_monitor_0 on cvmh02 'not installed' (5): call=69,
status=compl
ete, last-rc-change='Tue Sep 10 15:52:57 2013', queued=31ms, exec=0ms
    s-ldap_monitor_0 on cvmh02 'not installed' (5): call=289, status=Not
install
ed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
    p-mysql_monitor_0 on cvmh02 'not installed' (5): call=332,
status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=40ms,
exec=0ms
    p-mysql-slurm_monitor_0 on cvmh03 'not installed' (5): call=325,
status=complete, last-rc-change='Wed Sep  4 13:44:15 2013', queued=35ms,
exec=0ms
    s-ldap_monitor_0 on cvmh03 'not installed' (5): call=869, status=Not
installed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
    p-mysql_monitor_0 on cvmh03 'not installed' (5): call=907,
status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=36ms,
exec=0ms
    p-mysql-slurm_monitor_0 on cvmh01 'not installed' (5): call=95,
status=complete, last-rc-change='Tue Sep 10 15:48:15 2013', queued=95ms,
exec=0ms
    fence-cvmh02_start_0 on (null) 'unknown error' (1): call=-1,
status=Timed Out, last-rc-change='Tue Sep 10 15:49:38 2013', queued=0ms,
exec=0ms
    fence-cvmh02_start_0 on cvmh01 'unknown error' (1): call=-1,
status=Timed Out, last-rc-change='Tue Sep 10 15:49:38 2013', queued=0ms,
exec=0ms
    s-ldap_monitor_0 on cvmh01 'not installed' (5): call=279, status=Not
installed, last-rc-change='Tue Sep 10 16:15:19 2013', queued=0ms, exec=0ms
    p-mysql_monitor_0 on cvmh01 'not installed' (5): call=319,
status=complete, last-rc-change='Wed Sep 11 13:20:08 2013', queued=42ms,
exec=0ms

Almost all of these are instances of resources being probed on nodes that
they shouldn't be running on, aren't installed on, and aren't really
errors.  (I assume the crm_report has captured the location rules, as well
as confirmed that the symmetric-cluster property is false.)  The resources
do also start up on the nodes they should run on.

Previously I'd noticed that LSB resources probed on nodes that don't have
the associated init script would fail; looks like that is also getting
reported as OCF_NOT_INSTALLED, so perhaps is the same problem.


On Wed, Sep 4, 2013 at 12:49 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>
> On 04/09/2013, at 6:18 AM, Lindsay Todd <rltodd.ml1 at gmail.com> wrote:
>
> > We've been attempting to set up an asymmetric pacemaker cluster using
> remote cluster nodes, with pacemaker 1.1.10 (actually, building from git
> lately, currently at a4eb44f).  We use location constraints to enable
> resources to start on nodes they should start on, and rely on asymmetry to
> otherwise keep resources from starting.
>
> You set symmetric-cluster=false or assumed that was the default
>
> >
> > But we get many monitor operation failures.
> >
> > Resource monitor operations run on the physical real hosts, and
> frequently fail because not all the components are present on those hosts.
>  For instance, the mysql resource agent's monitor operation fails as "not
> installed", since, well, mysql isn't installed on those systems, so the
> validate operation, which most or every path through that agent runs,
> always fails.  I don't see failures on the remote nodes, even ones without
> mysql installed.
> >
> > Previously I'd noticed LSB resources had failed monitor operations on
> systems that didn't have the LSB init script installed.
> >
> > Presumably these monitor operations are happening to ensure the resource
> is NOT running where it should not be???
>
> Correct. Although with symmetric-cluster=false it shouldn't show up as an
> error.
> Logs? crm_mon output?
>
> >  There doesn't seem to be a way to set up location constraints to
> prevent this from happening, at least that I've found.  I wrote an OCF
> wrapper RA to help me with LSB init scripts, but not sure what to do about
> other RA's like mysql short of maintaining my own version, unless there is
> a way to tune where "monitor" runs.  Or more likely I'm missing something
> ...
> >
> > It would seem to me that a "not installed" failure, OCF_ERR_INSTALLED,
> would not really be an error on a node that shouldn't be running that
> resource agent anyway, and is probably a pretty good indication that it
> isn't running.
> >
> > /Lindsay
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130911/76c48298/attachment-0003.html>


More information about the Pacemaker mailing list