[Pacemaker] ClusterMon

Sun Dec 5 20:26:40 EST 2010

Hi folks,

I'd like to use crm_mon for monitoring & email notifications, but I've 
hit a snag when it comes to incorporating it into the crm 
configuration.  When I run crm_mon manually from the command line (with 
no cluster crm configurations), it all works great, but obviously 
running crm_mon on every cluster member manually would result in a 
litany of duplicated messages for each resource migration, which is why 
I'm looking to incorporate it into the cluster.   Unfortunately, the 
exact same crm_mon configuration, when entered into the cib, fails to 
work, and doesn't print out any errors.  To get the crm_mon 
configuration into the cib, I first tried using the scriptable crm 
utility, but it didn't seem to like that very much:

# crm configure primitive ResourceMonitor ocf:pacemaker:ClusterMon 
params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html" 
extra_options="-T ops at example.com -F 'Cluster Monitor 
<ClusterMonitor at example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]: 
Resource Changes Detected'" op monitor interval="10s" timeout="20s"
element nvpair: Relax-NG validity error : Type ID doesn't allow value 
'ResourceMonitor-instance_attributes-ops at example.com'
element nvpair: Relax-NG validity error : Element nvpair failed to 
validate attributes
Relax-NG validity error : Extra element nvpair in interleave
element nvpair: Relax-NG validity error : Element instance_attributes 
failed to validate content
Relax-NG validity error : Extra element instance_attributes in interleave
element cib: Relax-NG validity error : Element cib failed to validate 
content
crm_verify[1762]: 2010/12/05_19:23:03 ERROR: main: CIB did not pass 
DTD/schema validation
Errors found during check: config not valid
ERROR: ResourceMonitor: parameter -F does not exist
ERROR: ResourceMonitor: parameter [LDAP Cluster]: Resource Changes 
Detected does not exist
ERROR: ResourceMonitor: parameter Cluster Monitor 
<ClusterMonitor at example.com> does not exist
ERROR: ResourceMonitor: parameter -H does not exist
ERROR: ResourceMonitor: parameter smtp.example.com:25 does not exist
ERROR: ResourceMonitor: parameter ops at example.com does not exist
ERROR: ResourceMonitor: parameter -P does not exist
WARNING: ResourceMonitor: default timeout 20s for start is smaller than 
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than 
the advised 100

I know those are valid options since it works from the CLI, so I tried 
going through the crm shell instead, hoping it was just an interpolation 
issue or something like that.  That approach appeared to work (albeit 
with a timeout threshold warning):

crm(live)configure# primitive ResourceMonitor ocf:pacemaker:ClusterMon 
params pidfile="/var/run/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html" 
extra_options="-T ops at example.com -F 'Cluster Monitor 
<ClusterMonitor at example.com>' -H smtp.example.com:25 -P '[LDAP Cluster]: 
Resource Changes Detected'" op monitor interval="10s" timeout="20s"
WARNING: ResourceMonitor: default timeout 20s for start is smaller than 
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than 
the advised 100
crm(live)configure# commit
WARNING: ResourceMonitor: default timeout 20s for start is smaller than 
the advised 90
WARNING: ResourceMonitor: default timeout 20s for stop is smaller than 
the advised 100
crm(live)configure# exit
bye

After adding it via the crm shell, the crm_mon daemon is definitely 
running (and migrates to another node if I shut down or restart corosync 
on the node currently running crm_mon), but I'm not getting any email 
messages.  My mail sever logs confirm the message never gets there when 
the crm_mon configuration is in the cluster.  This same command works 
when run manually from the command line, and there are no errors or 
warnings in the logs, so I'm not sure what to attribute the problem to.  
Here are the cluster log messages resulting from a simple resource 
migration on the host running the crm_mon daemon that was spawned by the 
cluster:

Dec  5 20:05:00 ldap3 external/ipmi[7032]: [7041]: debug: ipmitool 
output: Chassis Power is on
Dec  5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation 
complete: op cib_delete for section constraints 
(origin=ldap4/crm_resource/3, version=0.78.4): ok (rc=0)
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<cib admin_epoch="0" epoch="78" num_updates="4" >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<configuration >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<constraints >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<rsc_location id="cli-prefer-ClusterIP" >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<rule id="cli-prefer-rule-ClusterIP" >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
<expression value="ldap4" id="cli-prefer-expr-ClusterIP" />
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
</rule>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
</rsc_location>
Dec  5 20:05:03 ldap3 crmd: [6500]: info: abort_transition_graph: 
need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
</constraints>
Dec  5 20:05:03 ldap3 crmd: [6500]: info: need_abort: Aborting on change 
to admin_epoch
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
</configuration>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: - 
</cib>
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<cib admin_epoch="0" epoch="79" num_updates="1" >
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: All 2 
cluster nodes are eligible to run resources.
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<configuration >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<constraints >
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke: Query 63: 
Requesting the current CIB: S_POLICY_ENGINE
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<rsc_location id="cli-prefer-ClusterIP" >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<rule id="cli-prefer-rule-ClusterIP" >
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
<expression value="ldap3" id="cli-prefer-expr-ClusterIP" />
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
</rule>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
</rsc_location>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
</constraints>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
</configuration>
Dec  5 20:05:03 ldap3 cib: [6496]: info: log_data_element: cib:diff: + 
</cib>
Dec  5 20:05:03 ldap3 cib: [6496]: info: cib_process_request: Operation 
complete: op cib_modify for section constraints 
(origin=ldap4/crm_resource/4, version=0.79.1): ok (rc=0)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_pe_invoke_callback: 
Invoking the PE: query=63, ref=pe_calc-dc-1291597503-34, seq=88, quorate=1
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: unpack_config: On loss of 
CCM Quorum: Ignore
Dec  5 20:05:03 ldap3 pengine: [6499]: info: unpack_config: Node scores: 
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Dec  5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status: 
Node ldap3 is online
Dec  5 20:05:03 ldap3 pengine: [6499]: info: determine_online_status: 
Node ldap4 is online
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: native_print: 
ClusterIP    (ocf::heartbeat:IPaddr2):    Started ldap4
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: native_print: 
ldap3-stonith    (stonith:external/ipmi):    Started ldap4
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: native_print: 
ldap4-stonith    (stonith:external/ipmi):    Started ldap3
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: native_print: 
ResourceMonitor    (ocf::pacemaker:ClusterMon):    Started ldap3
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: RecurringOp:  Start 
recurring monitor (10s) for ClusterIP on ldap3
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Move resource 
ClusterIP    (Started ldap4 -> ldap3)
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave 
resource ldap3-stonith    (Started ldap4)
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave 
resource ldap4-stonith    (Started ldap3)
Dec  5 20:05:03 ldap3 pengine: [6499]: notice: LogActions: Leave 
resource ResourceMonitor    (Started ldap3)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Dec  5 20:05:03 ldap3 crmd: [6500]: info: unpack_graph: Unpacked 
transition 3: 4 actions in 4 synapses
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_te_invoke: Processing graph 
3 (ref=pe_calc-dc-1291597503-34) derived from 
/var/lib/pengine/pe-input-100.bz2
Dec  5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating 
action 9: stop ClusterIP_stop_0 on ldap4
Dec  5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Archived 
previous version as /var/lib/heartbeat/crm/cib-64.raw
Dec  5 20:05:03 ldap3 pengine: [6499]: info: process_pe_message: 
Transition 3: PEngine Input stored in: /var/lib/pengine/pe-input-100.bz2
Dec  5 20:05:03 ldap3 cib: [7044]: info: write_cib_contents: Wrote 
version 0.79.0 of the CIB to disk (digest: 8689b11ceba2dad1a9d93d704ff47580)
Dec  5 20:05:03 ldap3 cib: [7044]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.DN94W1 (digest: 
/var/lib/heartbeat/crm/cib.vicJ0i)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action 
ClusterIP_stop_0 (9) confirmed on ldap4 (rc=0)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating 
action 10: start ClusterIP_start_0 on ldap3 (local)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing 
key=10:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_start_0 )
Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:13: start
Dec  5 20:05:03 ldap3 crmd: [6500]: info: te_pseudo_action: Pseudo 
action 5 fired and confirmed
Dec  5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip -f inet addr add 
10.1.1.163/32 brd 10.1.1.163 dev eth1
Dec  5 20:05:03 ldap3 IPaddr2[7045]: INFO: ip link set eth1 up
Dec  5 20:05:03 ldap3 IPaddr2[7045]: INFO: /usr/lib/heartbeat/send_arp 
-i 200 -r 5 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-10.1.1.163 
eth1 10.1.1.163 auto not_used not_used
Dec  5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM 
operation ClusterIP_start_0 (call=13, rc=0, cib-update=64, 
confirmed=true) ok
Dec  5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action 
ClusterIP_start_0 (10) confirmed on ldap3 (rc=0)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: te_rsc_command: Initiating 
action 11: monitor ClusterIP_monitor_10000 on ldap3 (local)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_lrm_rsc_op: Performing 
key=11:3:0:32629690-f4fb-43b9-a251-8f6b25e60220 op=ClusterIP_monitor_10000 )
Dec  5 20:05:03 ldap3 lrmd: [6497]: info: rsc:ClusterIP:14: monitor
Dec  5 20:05:03 ldap3 crmd: [6500]: info: process_lrm_event: LRM 
operation ClusterIP_monitor_10000 (call=14, rc=0, cib-update=65, 
confirmed=false) ok
Dec  5 20:05:03 ldap3 crmd: [6500]: info: match_graph_event: Action 
ClusterIP_monitor_10000 (11) confirmed on ldap3 (rc=0)
Dec  5 20:05:03 ldap3 crmd: [6500]: info: run_graph: 
====================================================
Dec  5 20:05:03 ldap3 crmd: [6500]: notice: run_graph: Transition 3 
(Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pengine/pe-input-100.bz2): Complete
Dec  5 20:05:03 ldap3 crmd: [6500]: info: te_graph_trigger: Transition 3 
is now complete
Dec  5 20:05:03 ldap3 crmd: [6500]: info: notify_crmd: Transition 3 
status: done - <null>
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Dec  5 20:05:03 ldap3 crmd: [6500]: info: do_state_transition: Starting 
PEngine Recheck Timer

Here is the output of 'crm configure show':
node ldap3
node ldap4
primitive ClusterIP ocf:heartbeat:IPaddr2 \
     params ip="10.1.1.163" cidr_netmask="32" \
     op monitor interval="10s"
primitive ResourceMonitor ocf:pacemaker:ClusterMon \
     params pidfile="/var/run/crm_mon.pid" 
htmlfile="/var/tmp/crm_mon.html" extra_options="-T ops at example.com -F 
'Cluster Monitor <ClusterMonitor at example.com>' -H smtp.example.com:25 -P 
'[LDAP Cluster]: Resource Changes Detected'" \
     op monitor interval="10s" timeout="20s"
primitive ldap3-stonith stonith:external/ipmi \
     params hostname="ldap3" ipaddr="10.1.0.5" userid="****" 
passwd="****" interface="lan" \
     op monitor interval="60s" timeout="30s"
primitive ldap4-stonith stonith:external/ipmi \
     params hostname="ldap4" ipaddr="10.1.0.6" userid="****" 
passwd="****" interface="lan" \
     op monitor interval="60s" timeout="30s"
location cli-prefer-ClusterIP ClusterIP \
     rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq ldap3
location ldap3-stonith-cmdsrc ldap3-stonith -inf: ldap3
location ldap4-stonith-cmdsrc ldap4-stonith -inf: ldap4
property $id="cib-bootstrap-options" \
     dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
     cluster-infrastructure="openais" \
     expected-quorum-votes="2" \
     stonith-enabled="true" \
     no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
     resource-stickiness="100"

Other than the monitoring, everything seems to work pretty well, but I 
don't want to deploy this in production without a good real-time monitor 
of the resource changes, so I'd appreciate any suggestions as to why 
crm_mon works when run manually, but not when configured in the 
cluster.  For reference, I'm running on Ubuntu Server 10.04 LTS (Lucid), 
and these are the packages I'm using:

cluster-agents   1:1.0.3-2ubuntu1
cluster-glue   1.0.5-1
corosync   1.2.0-0ubuntu1
libcluster-glue   1.0.5-1
libcorosync-dev   1.2.0-0ubuntu1
libcorosync4   1.2.0-0ubuntu1
libopenais3   1.1.2-0ubuntu1
openais   1.1.2-0ubuntu1
pacemaker   1.0.8+hg15494-2ubuntu2
pacemaker-dev   1.0.8+hg15494-2ubuntu2

Thanks!