[Pacemaker] migration-threshold causing unnecessary restart of underlying resources

Sat Aug 14 00:26:58 EDT 2010

  Hi,

and first of all thanks for answering so far.

Am 12.08.2010 18:46, schrieb Dejan Muhamedagic:
>
> The migration-threshold shouldn't in any way influence resources
> which don't depend on the resource which fails over. Couldn't
> reproduce it here with our example RAs.
Well, I now - just to clearly assure that something's wrong there; 
whatever it is, simple misconfiguration or possible bug - did crm 
configure erase, completely restarted both nodes, and then setup this 
new, very simple, dummy-based configuration:
v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v 
v v v v
node alpha \
         attributes standby="off"
node beta \
         attributes standby="off"
primitive dlm ocf:heartbeat:Dummy
primitive drbd ocf:heartbeat:Dummy
primitive mount ocf:heartbeat:Dummy
primitive mysql ocf:heartbeat:Dummy \
         meta migration-threshold="3" failure-timeout="40"
primitive o2cb ocf:heartbeat:Dummy
location cli-prefer-mount mount \
         rule $id="cli-prefer-rule-mount" inf: #uname eq alpha
colocation colocMysql inf: mysql mount
order orderMysql inf: mount mysql
property $id="cib-bootstrap-options" \
         dc-version="1.0.9-unknown" \
         cluster-infrastructure="openais" \
         expected-quorum-votes="2" \
         stonith-enabled="false" \
         cluster-recheck-interval="150" \
         last-lrm-refresh="1281751924"
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
^ ^ ^ ^
...and then, with picking on the resource "mysql", got this:

1) alpha: FC(mysql)=0, crm_resource -F -r mysql -H alpha
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_asyncmon_0 (call=48, rc=1, cib-update=563, 
confirmed=false) unknown error
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_stop_0 (call=49, rc=0, cib-update=565, confirmed=true) ok
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_start_0 (call=50, rc=0, cib-update=567, confirmed=true) ok

2) alpha: FC(mysql)=1, crm_resource -F -r mysql -H alpha
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_asyncmon_0 (call=51, rc=1, cib-update=568, 
confirmed=false) unknown error
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_stop_0 (call=52, rc=0, cib-update=572, confirmed=true) ok
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_start_0 (call=53, rc=0, cib-update=573, confirmed=true) ok

3) alpha: FC(mysql)=2, crm_resource -F -r mysql -H alpha
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_asyncmon_0 (call=54, rc=1, cib-update=574, 
confirmed=false) unknown error
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_stop_0 (call=55, rc=0, cib-update=576, confirmed=true) ok
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mount_stop_0 (call=56, rc=0, cib-update=578, confirmed=true) ok
beta: (FC(mysql)=3
Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM operation 
mount_start_0 (call=36, rc=0, cib-update=92, confirmed=true) ok
Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM operation 
mysql_start_0 (call=37, rc=0, cib-update=93, confirmed=true) ok
Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM operation 
mysql_stop_0 (call=38, rc=0, cib-update=94, confirmed=true) ok
Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM operation 
mount_stop_0 (call=39, rc=0, cib-update=95, confirmed=true) ok
alpha: FC(mysql)=3
Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mount_start_0 (call=57, rc=0, cib-update=580, confirmed=true) ok
Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_start_0 (call=58, rc=0, cib-update=581, confirmed=true) ok

So it seems that - for what reason ever - those constrainted resources 
are considered and treated just as they were in a resource-group, 
because they move to where they all can run, instead of the "eat or die" 
for the dependent resource (mysql) to the underlying resource (mount) 
that I had expected with such constraints as I set them... shouldn't I?! o_O

And - concerning the failure-timeout - quite a while later, without 
having resetted mysql's failure counter or having done anything else in 
the meantime:

4) alpha: FC(mysql)=3, crm_resource -F -r mysql -H alpha
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_asyncmon_0 (call=59, rc=1, cib-update=592, 
confirmed=false) unknown error
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_stop_0 (call=60, rc=0, cib-update=596, confirmed=true) ok
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mount_stop_0 (call=61, rc=0, cib-update=597, confirmed=true) ok
beta: FC(mysql)=0
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM operation 
mount_start_0 (call=40, rc=0, cib-update=96, confirmed=true) ok
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM operation 
mysql_start_0 (call=41, rc=0, cib-update=97, confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM operation 
mysql_stop_0 (call=42, rc=0, cib-update=98, confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM operation 
mount_stop_0 (call=43, rc=0, cib-update=99, confirmed=true) ok
alpha: FC(mysql)=4
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mount_start_0 (call=62, rc=0, cib-update=599, confirmed=true) ok
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM 
operation mysql_start_0 (call=63, rc=0, cib-update=600, confirmed=true) ok

> BTW, what's the point of cloneMountMysql? If it can run only
> where drbd is master, then it can run on one node only:
>
> colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master
> order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start
It's a dual-primary-DRBD-configuration, so there are actually - when 
everything is ok (-; - 2 masters of each DRBD-multistate-resource... 
even though I admit that at least the dual primary respectively master 
for msDrbdMysql is currently (quite) redundant, since in the current 
cluster configuration there's only one, primitive MySQL-resource and 
thus there'd be no inevitable need for MySQL's data-dir being mounted 
all time on both nodes.
But since it's not harmful to have it mounted on the other node too, and 
since msDrbdOpencms and msDrbdShared need to be mounted on both nodes 
and since I put the complete installation and configuration of the 
cluster into flexibly configurable shell-scripts, it's easier 
respectively done with less typing to just put all DRBD- and 
mount-resources' configuration into just one common loop. (-;

>> d) I also have the impression that fail-counters don't get reset
>> after their failure-timeout, because when migration-threshold=3 is
>> set, upon every(!) following picking-on those issues occure, even
>> when I've waited for nearly 5 minutes (with failure-timeout=90)
>> without any touching the cluster
> That seems to be a bug though I couldn't reproduce it with a
> simple configuration.
I just also tested this once again: It seems like that failure-timeout 
only sets back scores from -inf to around 0 (whereever they should 
normally be), allowing the resources to return back to the node. I 
tested with setting a location constraint for the underlying resource 
(see configuration): After the failure-timeout has been completed, on 
the next cluster-recheck (and only then!) the underlying resource and 
its dependants return to the underlying resource's prefered location, as 
you see in logs above.