[Pacemaker] postponed execution of operations

Dejan Muhamedagic dejanmm at fastmail.fm
Fri Sep 16 11:25:07 EDT 2011


Hi,

On Thu, Sep 08, 2011 at 08:04:03PM +0200, Thilo Uttendorfer wrote:
> Hello,
> 
> I have sometimes failed resoucres in a heartbeat/pacemaker/drbd cluster. It seems that it happens randomly. The log 
> file for such a resource looks like this (I did a grep on "34401" in this example):
> 
> Sep  7 13:28:13 server-v2 lrmd: [25059]: debug: on_msg_perform_op: add an operation operation monitor[34401] on 
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] 
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  to the operation list.
> Sep  7 13:28:13 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000 
> (call=34401, rc=0, cib-update=73376, confirmed=false) complete ok
> Sep  7 13:41:56 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> [this message repeats every second...]
> Sep  7 13:42:03 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:19:30 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> [this message repeats every second...]
> Sep  7 14:19:44 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:19:45 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on 
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] 
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  stayed in operation list for 15010 ms (longer than 10000 ms)
> Sep  7 14:20:05 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> [this message repeats every second...]
> Sep  7 14:20:34 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:20:35 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on 
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] 
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  stayed in operation list for 30010 ms (longer than 10000 ms)
> Sep  7 14:21:15 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:21:16 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:21:17 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation 
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] 
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  by 1000 ms
> Sep  7 14:32:01 server-v2 crmd: [25062]: debug: cancel_op: Cancelling op 34401 for FS-win2k3FS 
> (FS-win2k3FS:34401)
> Sep  7 14:32:01 server-v2 lrmd: [25059]: debug: cancel_op: operation monitor[34401] on 
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] 
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000] 
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3]  cancelled
> Sep  7 14:32:01 server-v2 crmd: [25062]: debug: process_lrm_event: Op FS-win2k3FS_monitor_20000 (call=34401): 
> no delete event required
> Sep  7 14:32:01 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000 
> (call=34401, rc=-2, cib-update=0, confirmed=true) Cancelled unknown exec error
> 
> My questions are:
>  - crm says "complete ok", so why is this operation postponed at all?

complete ok is somewhat misleading, I think. I think that
confirmed=false means that there was no callback yet for the
operation.

>  - Why does more than 13 minutes nothing happen between first "add an operation operation monitor" and 
> first "max_child_count"?

The repeating monitor operation is silent, unless there's
something important to report. Within those 13 minutes there
should've been n monitor operations. If not, then it's really
strange.

>  - Why are there some huge leaps between the "max_child_count"-messages (for example 13:42:03 and 14:19:30)
>  - Is this a bug or am I misreading the log files?

It's sort of dynamic :) If there are more than max_child_count
resource operations running at the time another operation is
supposed to be scheduled, you'll see this message.

> The system had not unusual high load in that time. We use pacemaker version 1.0.5+hg20090915-1, I know that's 
> quite ancient. If I should update to a more recent version, which version would you recommend? Can somebody 
> confirm that there were fixes for these problems?

Apart from upgrading, you can increase the LRM max children.
That's a fairly new addition, not sure if your installation
supports that. Try:

# lrmadmin -p max-children 8

I'm also not sure if your init script supports setting it in
/etc/default files. That's also a relatively recent addition.

Thanks,

Dejan

> Thanks for any help!
> 
> Thilo
> 
> 
> -- 
> Thilo Uttendorfer
> Linux Information Systems AG
> Putzbrunner Str. 71, 81739 München
> 
> Fon: +49 89 993412-11, Fax: +49 89 993412-99
> t.uttendorfer at linux-ag.com, http://www.linux-ag.com
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




More information about the Pacemaker mailing list