[ClusterLabs] Regular pengine warnings after a transient failure

Mon Mar 7 08:31:53 EST 2016

Hi,

A couple of days ago the nodes of our Pacemaker 1.1.14 cluster
(vhbl0[3-7]) experienced temporary storage outage, leading to processes
stucking randomly for a couple of minutes and big load spikes.  There
were 30 monitor operation timeouts altogether on vhbl05, and an internal
error on the DC.  What follows is my longish analysis of the logs, which
may be wrong, which I'd be glad to learn about.  Knowledgeable people
may skip to the end for the main question and a short mention of the
side questions.  So, Pacemaker logs start as:

12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_60000 process (PID 1867) timed out
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_60000 process (PID 1868) timed out
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_60000:1867 - timed out after 20000ms
12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_60000:1868 - timed out after 20000ms
12:53:51 vhbl05 crmd[9445]:    error: Operation vm-niifdc_monitor_60000: Timed Out (node=vhbl05, call=720, timeout=20000ms)
12:53:52 vhbl05 crmd[9445]:    error: Operation vm-niiffs_monitor_60000: Timed Out (node=vhbl05, call=717, timeout=20000ms)

(precise interleaving is impossible, as the vhbl05 logs arrived at the
log server with a delay of 78 s -- probably the syslog daemon was stuck)

12:53:51 vhbl03 crmd[8530]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
12:53:52 vhbl03 pengine[8529]:  warning: Processing failed op monitor for vm-niifdc on vhbl05: unknown error (1)
12:53:52 vhbl03 pengine[8529]:   notice: Recover vm-niifdc#011(Started vhbl05)
12:53:52 vhbl03 pengine[8529]:   notice: Calculated Transition 909: /var/lib/pacemaker/pengine/pe-input-262.bz2

The other nodes report in:

12:53:57 vhbl04 crmd[9031]:   notice: High CPU load detected: 74.949997
12:54:16 vhbl06 crmd[8676]:   notice: High CPU load detected: 93.540001

while monitor operations keep timing out on vhbl05:

12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_60000 process (PID 1976) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_60000:1976 - timed out after 20000ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_60000 process (PID 1977) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_60000:1977 - timed out after 20000ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_60000 process (PID 1978) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_60000:1978 - timed out after 20000ms
12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_60000 process (PID 2015) timed out
12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_60000:2015 - timed out after 20000ms
12:54:13 vhbl05 crmd[9445]:    error: Operation vm-FCcontroller_monitor_60000: Timed Out (node=vhbl05, call=640, timeout=20000ms)
12:54:13 vhbl05 crmd[9445]:    error: Operation vm-dwdm_monitor_60000: Timed Out (node=vhbl05, call=636, timeout=20000ms)
12:54:13 vhbl05 crmd[9445]:    error: Operation vm-eiffel_monitor_60000: Timed Out (node=vhbl05, call=633, timeout=20000ms)
12:54:13 vhbl05 crmd[9445]:    error: Operation vm-web7_monitor_60000: Timed Out (node=vhbl05, call=638, timeout=20000ms)
12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_60000 process (PID 2101) timed out
12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_60000:2101 - timed out after 20000ms
12:54:17 vhbl05 crmd[9445]:    error: Operation vm-ftp.pws_monitor_60000: Timed Out (node=vhbl05, call=637, timeout=20000ms)
12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_60000 process (PID 2104) timed out
12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_60000:2104 - timed out after 20000ms
12:54:17 vhbl05 crmd[9445]:    error: Operation vm-cirkusz_monitor_60000: Timed Out (node=vhbl05, call=650, timeout=20000ms)

Back on the DC:

12:54:22 vhbl03 crmd[8530]:  warning: Request 3308 to pengine (0x7f88810214a0) failed: Resource temporarily unavailable (-11)
12:54:22 vhbl03 crmd[8530]:    error: Could not contact the pengine: -11
12:54:22 vhbl03 crmd[8530]:    error: FSA: Input I_ERROR from do_pe_invoke_callback() received in state S_POLICY_ENGINE
12:54:22 vhbl03 crmd[8530]:  warning: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback ]
12:54:22 vhbl03 crmd[8530]:  warning: Fast-tracking shutdown in response to errors
12:54:22 vhbl03 crmd[8530]:  warning: Not voting in election, we're in state S_RECOVERY
12:54:22 vhbl03 crmd[8530]:    error: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
12:54:22 vhbl03 crmd[8530]:   notice: Stopped 0 recurring operations at shutdown (32 ops remaining)
12:54:22 vhbl03 crmd[8530]:   notice: Recurring action vm-phm6:654 (vm-phm6_monitor_60000) incomplete at shutdown
[31 more similar lines]
12:54:22 vhbl03 crmd[8530]:    error: 32 resources were active at shutdown.
12:54:22 vhbl03 crmd[8530]:   notice: Disconnected from the LRM
12:54:22 vhbl03 crmd[8530]:   notice: Disconnecting from Corosync
12:54:22 vhbl03 cib[8525]:  warning: new_event_notification (8525-8530-11): Broken pipe (32)
12:54:22 vhbl03 cib[8525]:  warning: A-Sync reply to crmd failed: No message of desired type
12:54:22 vhbl03 crmd[8530]:    error: Could not recover from internal error

The other nodes executed a new election at once:

12:54:22 vhbl04 crmd[9031]:   notice: Our peer on the DC (vhbl03) is dead
12:54:22 vhbl07 crmd[8484]:   notice: Our peer on the DC (vhbl03) is dead
12:54:22 vhbl06 crmd[8676]:   notice: Our peer on the DC (vhbl03) is dead
12:54:22 vhbl07 crmd[8484]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
12:54:22 vhbl06 crmd[8676]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
12:54:22 vhbl04 crmd[9031]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
12:54:22 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
12:54:22 vhbl04 crmd[9031]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]

while vhbl03 restared its crmd process and also joined the party:

12:54:22 vhbl03 pacemakerd[8521]:    error: The crmd process (8530) exited: Generic Pacemaker error (201)
12:54:22 vhbl03 pacemakerd[8521]:   notice: Respawning failed child process: crmd
12:54:22 vhbl03 crmd[2206]:   notice: Additional logging available in /var/log/pacemaker.log
12:54:22 vhbl03 crmd[2206]:   notice: CRM Git Version: 1.1.14 (70404b0)
12:54:22 vhbl03 crmd[2206]:   notice: Connecting to cluster infrastructure: corosync
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773705
12:54:22 vhbl03 crmd[2206]:   notice: Defaulting to uname -n for the local corosync node name
12:54:22 vhbl03 crmd[2206]:   notice: Quorum acquired
12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node vhbl03[167773705] - state is now member (was (null))
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773706] - state is now member (was (null))
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773707] - state is now member (was (null))
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773708] - state is now member (was (null))
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773709] - state is now member (was (null))
12:54:22 vhbl03 crmd[2206]:   notice: Defaulting to uname -n for the local corosync node name
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
12:54:22 vhbl03 crmd[2206]:   notice: The local CRM is operational
12:54:22 vhbl03 crmd[2206]:   notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]

except for vhbl05, which was still struggling with stuck I/O:

12:54:28 vhbl05 lrmd[9442]:  warning: vm-niifidp_monitor_60000 process (PID 2187) timed out
12:54:28 vhbl05 lrmd[9442]:  warning: vm-db2.iir_monitor_60000 process (PID 2188) timed out
12:54:28 vhbl05 lrmd[9442]:  warning: vm-niifidp_monitor_60000:2187 - timed out after 20000ms
12:54:28 vhbl05 lrmd[9442]:  warning: vm-db2.iir_monitor_60000:2188 - timed out after 20000ms
12:54:28 vhbl05 lrmd[9442]:  warning: vm-jt_monitor_60000 process (PID 2210) timed out
12:54:28 vhbl05 lrmd[9442]:  warning: vm-eduidp_monitor_60000 process (PID 2212) timed out
12:54:28 vhbl05 lrmd[9442]:  warning: vm-jt_monitor_60000:2210 - timed out after 20000ms
12:54:28 vhbl05 lrmd[9442]:  warning: vm-eduidp_monitor_60000:2212 - timed out after 20000ms
12:54:32 vhbl05 lrmd[9442]:  warning: vm-web9_monitor_60000 process (PID 2301) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-web5_monitor_60000 process (PID 2302) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-front.vh.videotorium_monitor_60000 process (PID 2303) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-web9_monitor_60000:2301 - timed out after 20000ms
12:54:32 vhbl05 lrmd[9442]:  warning: vm-web5_monitor_60000:2302 - timed out after 20000ms
12:54:32 vhbl05 lrmd[9442]:  warning: vm-front.vh.videotorium_monitor_60000:2303 - timed out after 20000ms
12:54:32 vhbl05 lrmd[9442]:  warning: vm-stream1.vh.videotorium_monitor_60000 process (PID 2307) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-stream1.vh.videotorium_monitor_60000:2307 - timed out after 20000ms
12:54:32 vhbl05 lrmd[9442]:  warning: vm-pub1_monitor_60000 process (PID 2327) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-cerberus_monitor_60000 process (PID 2334) timed out
12:54:32 vhbl05 lrmd[9442]:  warning: vm-pub1_monitor_60000:2327 - timed out after 20000ms
12:54:38 vhbl05 lrmd[9442]:  warning: vm-ms-tr_monitor_60000:2479 - timed out after 20000ms
12:54:39 vhbl05 lrmd[9442]:  warning: vm-fs.vh.videotorium_monitor_60000 process (PID 2512) timed out
12:54:39 vhbl05 lrmd[9442]:  warning: vm-mazsola_monitor_60000 process (PID 2514) timed out
12:54:39 vhbl05 lrmd[9442]:  warning: vm-elm_monitor_60000 process (PID 2519) timed out
12:54:41 vhbl05 lrmd[9442]:  warning: vm-test1.pc_monitor_60000 process (PID 2713) timed out
12:54:41 vhbl05 lrmd[9442]:  warning: vm-test1.pc_monitor_60000:2713 - timed out after 20000ms

but reacted later, after a timer pop:

12:54:42 vhbl06 crmd[8676]:  warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
12:54:42 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
12:54:47 vhbl05 crmd[9445]:   notice: Our peer on the DC (vhbl03) is dead
12:54:47 vhbl05 crmd[9445]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
12:54:47 vhbl05 crmd[9445]:    error: Operation vm-niifidp_monitor_60000: Timed Out (node=vhbl05, call=641, timeout=20000ms)
12:54:47 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-db2.iir_monitor_60000: Timed Out (node=vhbl05, call=635, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-jt_monitor_60000: Timed Out (node=vhbl05, call=662, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-eduidp_monitor_60000: Timed Out (node=vhbl05, call=642, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-web9_monitor_60000: Timed Out (node=vhbl05, call=639, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-web5_monitor_60000: Timed Out (node=vhbl05, call=659, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-front.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=653, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-stream1.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=652, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-pub1_monitor_60000: Timed Out (node=vhbl05, call=657, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-cerberus_monitor_60000: Timed Out (node=vhbl05, call=654, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-nfs_monitor_60000: Timed Out (node=vhbl05, call=634, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-ms-tr_monitor_60000: Timed Out (node=vhbl05, call=661, timeout=20000ms)
12:54:48 vhbl05 crmd[9445]:    error: Operation vm-up4_monitor_60000: Timed Out (node=vhbl05, call=651, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-n2.bajnok_monitor_60000: Timed Out (node=vhbl05, call=658, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-ansible_monitor_60000: Timed Out (node=vhbl05, call=660, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-fs.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=643, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-mazsola_monitor_60000: Timed Out (node=vhbl05, call=648, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-elm_monitor_60000: Timed Out (node=vhbl05, call=645, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-mma_monitor_60000: Timed Out (node=vhbl05, call=655, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-rs31_monitor_60000: Timed Out (node=vhbl05, call=646, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-eternusmgm_monitor_60000: Timed Out (node=vhbl05, call=649, timeout=20000ms)
12:54:49 vhbl05 crmd[9445]:    error: Operation vm-test1.pc_monitor_60000: Timed Out (node=vhbl05, call=684, timeout=20000ms)

Meanwhile, all storage paths got reinstated, and the election concluded:

12:54:49 vhbl05 crmd[9445]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
12:54:49 vhbl07 crmd[8484]:   notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
12:54:49 vhbl07 crmd[8484]:  warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
12:54:49 vhbl03 crmd[2206]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
12:54:49 vhbl06 crmd[8676]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
12:54:50 vhbl04 crmd[9031]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
12:54:52 vhbl05 crmd[9445]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]

and these messages followed (30 of them mentioning the resources which
experienced monitor timeouts on vhbl05 during the storage outage):

12:54:54 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
12:54:54 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web7 on vhbl05: unknown error (1)
[...]

The cluster thus planned to recover these resources:

12:54:54 vhbl07 pengine[8483]:   notice: Recover vm-nfs#011(Started vhbl05)
[...]

and decided to start those 32 which were running on vhbl03 at the time
of the crmd shutdown (and which were still running undisturbed):

12:54:54 vhbl07 pengine[8483]:   notice: Start   vm-dogwood#011(vhbl03)
[...]
12:54:54 vhbl07 pengine[8483]:   notice: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-430.bz2

The cluster commenced stopping the resources it planned to recover.
Suddenly (where did this come from?):

12:55:13 vhbl07 crmd[8484]:   notice: Transition aborted by vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)

Apparently, the cluster waited for the already running operations to
complete before stopping the transition:

[...]
12:59:05 vhbl05 crmd[9445]:   notice: Operation vm-web9_stop_0: ok (node=vhbl05, call=783, rc=0, cib-update=364, confirmed=true)
12:59:05 vhbl07 crmd[8484]:   notice: Transition 0 (Complete=17, Pending=0, Fired=0, Skipped=140, Incomplete=292, Source=/var/lib/pacemaker/pengine/pe-input-430.bz2): Stopped
12:59:05 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
[...]
12:59:05 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)

Calculating the new transition emitted the same 30 warning lines as
above, even though vm-nfs (and vm-web9 etc.) was stopped during the
previous (aborted) transition.  Anyway, the new plan correctly contained

12:59:05 vhbl07 pengine[8483]:   notice: Start   vm-nfs#011(vhbl06)

and similarly, recovery actions turned into start actions for the
resources stopped during the previous transition.  However, almost all
other recovery actions just disappeared without any comment.  This was
actually correct, but I really wonder why the cluster decided to paper
over the previous monitor operation timeouts.  Maybe the operations
finished meanwhile and got accounted somehow, just not logged?

We were left with 36 start and 2 recover operations in the end, most of
them concerning running resources on vhbl03.  Accordingly, the probes
reported unexpected (to the cluster) status:

12:59:05 vhbl07 crmd[8484]:   notice: Initiating action 112: monitor vm-phm6_monitor_0 on vhbl03
12:59:05 vhbl03 crmd[2206]:   notice: Operation vm-phm6_monitor_0: ok (node=vhbl03, call=830, rc=0, cib-update=19, confirmed=true)
12:59:05 vhbl07 crmd[8484]:  warning: Action 112 (vm-phm6_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
12:59:05 vhbl07 crmd[8484]:   notice: Transition aborted by vm-phm6_monitor_0 'create' on vhbl03: Event failed (magic=0:0;112:1:7:7f6ad33d-a843-4cde-b152-ad548e3a5a0f, cib=0.613.128, source=match_graph_event:381, 0)
12:59:05 vhbl07 crmd[8484]:  warning: Action 112 (vm-phm6_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error

The transition indeed aborted soon, and the usual warnings got logged
computing the new transition:

12:59:06 vhbl07 crmd[8484]:   notice: Transition 1 (Complete=71, Pending=0, Fired=0, Skipped=82, Incomplete=153, Source=/var/lib/pacemaker/pengine/pe-input-431.bz2): Stopped
12:59:06 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
[...]
12:59:06 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)

Now it contained 14 start and 2 recovery operations only.  Unexpected
probe results stoke again:

12:59:08 vhbl07 crmd[8484]:  warning: Action 109 (vm-vidyoreplay_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
12:59:08 vhbl07 crmd[8484]:   notice: Transition aborted by vm-vidyoreplay_monitor_0 'create' on vhbl03: Event failed (magic=0:0;109:2:7:7f6ad33d-a843-4cde-b152-ad548e3a5a0f, cib=0.613.215, source=match_graph_event:381, 0)
12:59:08 vhbl07 crmd[8484]:  warning: Action 109 (vm-vidyoreplay_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error

After the already running operations finished:

12:59:55 vhbl07 crmd[8484]:   notice: Transition 2 (Complete=92, Pending=0, Fired=0, Skipped=18, Incomplete=46, Source=/var/lib/pacemaker/pengine/pe-input-432.bz2): Stopped
12:59:55 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
[...]
12:59:55 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)

Then unexpected probe results caused another transition abort, which
finally concluded:

12:59:57 vhbl07 crmd[8484]:   notice: Transition 3 (Complete=27, Pending=0, Fired=0, Skipped=0, Incomplete=22, Source=/var/lib/pacemaker/pengine/pe-input-433.bz2): Complete
12:59:57 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
[...]
12:59:57 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)
12:59:57 vhbl07 crmd[8484]:   notice: Initiating action 274: monitor vm-repo_monitor_60000 on vhbl03
[some similar periodic monitor op initiations]
12:59:57 vhbl07 crmd[8484]:   notice: Transition 4 (Complete=16, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-434.bz2): Complete
12:59:57 vhbl07 crmd[8484]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Basically, the cluster responded beyond my expectations, sparing lots of
unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
software!  But I'm left with these "Processing failed op monitor"
warnings emitted every 15 minutes (timer pops).  Is it safe and clever
to cleanup the affected resources?  Would that get rid of them without
invoking other effects, like recoveries for example?

And I'm also curious.  What was that "foreign event" at 12:55:13?  How
did the cluster avoid recovering lots of resources (as detailed above)?
-- 
Thanks,
Feri