[ClusterLabs] Regular pengine warnings after a transient failure

Mon Mar 7 17:01:13 UTC 2016

On 03/07/2016 07:31 AM, Ferenc Wágner wrote:
> Hi,
> 
> A couple of days ago the nodes of our Pacemaker 1.1.14 cluster
> (vhbl0[3-7]) experienced temporary storage outage, leading to processes
> stucking randomly for a couple of minutes and big load spikes.  There
> were 30 monitor operation timeouts altogether on vhbl05, and an internal
> error on the DC.  What follows is my longish analysis of the logs, which
> may be wrong, which I'd be glad to learn about.  Knowledgeable people
> may skip to the end for the main question and a short mention of the
> side questions.  So, Pacemaker logs start as:
> 
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_60000 process (PID 1867) timed out
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_60000 process (PID 1868) timed out
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niifdc_monitor_60000:1867 - timed out after 20000ms
> 12:53:51 vhbl05 lrmd[9442]:  warning: vm-niiffs_monitor_60000:1868 - timed out after 20000ms
> 12:53:51 vhbl05 crmd[9445]:    error: Operation vm-niifdc_monitor_60000: Timed Out (node=vhbl05, call=720, timeout=20000ms)
> 12:53:52 vhbl05 crmd[9445]:    error: Operation vm-niiffs_monitor_60000: Timed Out (node=vhbl05, call=717, timeout=20000ms)
> 
> (precise interleaving is impossible, as the vhbl05 logs arrived at the
> log server with a delay of 78 s -- probably the syslog daemon was stuck)
> 
> 12:53:51 vhbl03 crmd[8530]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> 12:53:52 vhbl03 pengine[8529]:  warning: Processing failed op monitor for vm-niifdc on vhbl05: unknown error (1)
> 12:53:52 vhbl03 pengine[8529]:   notice: Recover vm-niifdc#011(Started vhbl05)
> 12:53:52 vhbl03 pengine[8529]:   notice: Calculated Transition 909: /var/lib/pacemaker/pengine/pe-input-262.bz2
> 
> The other nodes report in:
> 
> 12:53:57 vhbl04 crmd[9031]:   notice: High CPU load detected: 74.949997
> 12:54:16 vhbl06 crmd[8676]:   notice: High CPU load detected: 93.540001
> 
> while monitor operations keep timing out on vhbl05:
> 
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_60000 process (PID 1976) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-FCcontroller_monitor_60000:1976 - timed out after 20000ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_60000 process (PID 1977) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-dwdm_monitor_60000:1977 - timed out after 20000ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_60000 process (PID 1978) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-eiffel_monitor_60000:1978 - timed out after 20000ms
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_60000 process (PID 2015) timed out
> 12:54:13 vhbl05 lrmd[9442]:  warning: vm-web7_monitor_60000:2015 - timed out after 20000ms
> 12:54:13 vhbl05 crmd[9445]:    error: Operation vm-FCcontroller_monitor_60000: Timed Out (node=vhbl05, call=640, timeout=20000ms)
> 12:54:13 vhbl05 crmd[9445]:    error: Operation vm-dwdm_monitor_60000: Timed Out (node=vhbl05, call=636, timeout=20000ms)
> 12:54:13 vhbl05 crmd[9445]:    error: Operation vm-eiffel_monitor_60000: Timed Out (node=vhbl05, call=633, timeout=20000ms)
> 12:54:13 vhbl05 crmd[9445]:    error: Operation vm-web7_monitor_60000: Timed Out (node=vhbl05, call=638, timeout=20000ms)
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_60000 process (PID 2101) timed out
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-ftp.pws_monitor_60000:2101 - timed out after 20000ms
> 12:54:17 vhbl05 crmd[9445]:    error: Operation vm-ftp.pws_monitor_60000: Timed Out (node=vhbl05, call=637, timeout=20000ms)
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_60000 process (PID 2104) timed out
> 12:54:17 vhbl05 lrmd[9442]:  warning: vm-cirkusz_monitor_60000:2104 - timed out after 20000ms
> 12:54:17 vhbl05 crmd[9445]:    error: Operation vm-cirkusz_monitor_60000: Timed Out (node=vhbl05, call=650, timeout=20000ms)
> 
> Back on the DC:
> 
> 12:54:22 vhbl03 crmd[8530]:  warning: Request 3308 to pengine (0x7f88810214a0) failed: Resource temporarily unavailable (-11)
> 12:54:22 vhbl03 crmd[8530]:    error: Could not contact the pengine: -11
> 12:54:22 vhbl03 crmd[8530]:    error: FSA: Input I_ERROR from do_pe_invoke_callback() received in state S_POLICY_ENGINE
> 12:54:22 vhbl03 crmd[8530]:  warning: State transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=do_pe_invoke_callback ]
> 12:54:22 vhbl03 crmd[8530]:  warning: Fast-tracking shutdown in response to errors
> 12:54:22 vhbl03 crmd[8530]:  warning: Not voting in election, we're in state S_RECOVERY
> 12:54:22 vhbl03 crmd[8530]:    error: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
> 12:54:22 vhbl03 crmd[8530]:   notice: Stopped 0 recurring operations at shutdown (32 ops remaining)
> 12:54:22 vhbl03 crmd[8530]:   notice: Recurring action vm-phm6:654 (vm-phm6_monitor_60000) incomplete at shutdown
> [31 more similar lines]
> 12:54:22 vhbl03 crmd[8530]:    error: 32 resources were active at shutdown.
> 12:54:22 vhbl03 crmd[8530]:   notice: Disconnected from the LRM
> 12:54:22 vhbl03 crmd[8530]:   notice: Disconnecting from Corosync
> 12:54:22 vhbl03 cib[8525]:  warning: new_event_notification (8525-8530-11): Broken pipe (32)
> 12:54:22 vhbl03 cib[8525]:  warning: A-Sync reply to crmd failed: No message of desired type
> 12:54:22 vhbl03 crmd[8530]:    error: Could not recover from internal error
> 
> The other nodes executed a new election at once:
> 
> 12:54:22 vhbl04 crmd[9031]:   notice: Our peer on the DC (vhbl03) is dead
> 12:54:22 vhbl07 crmd[8484]:   notice: Our peer on the DC (vhbl03) is dead
> 12:54:22 vhbl06 crmd[8676]:   notice: Our peer on the DC (vhbl03) is dead
> 12:54:22 vhbl07 crmd[8484]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> 12:54:22 vhbl06 crmd[8676]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> 12:54:22 vhbl04 crmd[9031]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> 12:54:22 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> 12:54:22 vhbl04 crmd[9031]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> 
> while vhbl03 restared its crmd process and also joined the party:
> 
> 12:54:22 vhbl03 pacemakerd[8521]:    error: The crmd process (8530) exited: Generic Pacemaker error (201)
> 12:54:22 vhbl03 pacemakerd[8521]:   notice: Respawning failed child process: crmd
> 12:54:22 vhbl03 crmd[2206]:   notice: Additional logging available in /var/log/pacemaker.log
> 12:54:22 vhbl03 crmd[2206]:   notice: CRM Git Version: 1.1.14 (70404b0)
> 12:54:22 vhbl03 crmd[2206]:   notice: Connecting to cluster infrastructure: corosync
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773705
> 12:54:22 vhbl03 crmd[2206]:   notice: Defaulting to uname -n for the local corosync node name
> 12:54:22 vhbl03 crmd[2206]:   notice: Quorum acquired
> 12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node vhbl03[167773705] - state is now member (was (null))
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
> 12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773706] - state is now member (was (null))
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
> 12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773707] - state is now member (was (null))
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
> 12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773708] - state is now member (was (null))
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
> 12:54:22 vhbl03 crmd[2206]:   notice: pcmk_quorum_notification: Node (null)[167773709] - state is now member (was (null))
> 12:54:22 vhbl03 crmd[2206]:   notice: Defaulting to uname -n for the local corosync node name
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773706
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773707
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773708
> 12:54:22 vhbl03 crmd[2206]:   notice: Could not obtain a node name for corosync nodeid 167773709
> 12:54:22 vhbl03 crmd[2206]:   notice: The local CRM is operational
> 12:54:22 vhbl03 crmd[2206]:   notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> 
> except for vhbl05, which was still struggling with stuck I/O:
> 
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-niifidp_monitor_60000 process (PID 2187) timed out
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-db2.iir_monitor_60000 process (PID 2188) timed out
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-niifidp_monitor_60000:2187 - timed out after 20000ms
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-db2.iir_monitor_60000:2188 - timed out after 20000ms
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-jt_monitor_60000 process (PID 2210) timed out
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-eduidp_monitor_60000 process (PID 2212) timed out
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-jt_monitor_60000:2210 - timed out after 20000ms
> 12:54:28 vhbl05 lrmd[9442]:  warning: vm-eduidp_monitor_60000:2212 - timed out after 20000ms
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-web9_monitor_60000 process (PID 2301) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-web5_monitor_60000 process (PID 2302) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-front.vh.videotorium_monitor_60000 process (PID 2303) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-web9_monitor_60000:2301 - timed out after 20000ms
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-web5_monitor_60000:2302 - timed out after 20000ms
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-front.vh.videotorium_monitor_60000:2303 - timed out after 20000ms
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-stream1.vh.videotorium_monitor_60000 process (PID 2307) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-stream1.vh.videotorium_monitor_60000:2307 - timed out after 20000ms
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-pub1_monitor_60000 process (PID 2327) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-cerberus_monitor_60000 process (PID 2334) timed out
> 12:54:32 vhbl05 lrmd[9442]:  warning: vm-pub1_monitor_60000:2327 - timed out after 20000ms
> 12:54:38 vhbl05 lrmd[9442]:  warning: vm-ms-tr_monitor_60000:2479 - timed out after 20000ms
> 12:54:39 vhbl05 lrmd[9442]:  warning: vm-fs.vh.videotorium_monitor_60000 process (PID 2512) timed out
> 12:54:39 vhbl05 lrmd[9442]:  warning: vm-mazsola_monitor_60000 process (PID 2514) timed out
> 12:54:39 vhbl05 lrmd[9442]:  warning: vm-elm_monitor_60000 process (PID 2519) timed out
> 12:54:41 vhbl05 lrmd[9442]:  warning: vm-test1.pc_monitor_60000 process (PID 2713) timed out
> 12:54:41 vhbl05 lrmd[9442]:  warning: vm-test1.pc_monitor_60000:2713 - timed out after 20000ms
> 
> but reacted later, after a timer pop:
> 
> 12:54:42 vhbl06 crmd[8676]:  warning: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> 12:54:42 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> 12:54:47 vhbl05 crmd[9445]:   notice: Our peer on the DC (vhbl03) is dead
> 12:54:47 vhbl05 crmd[9445]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> 12:54:47 vhbl05 crmd[9445]:    error: Operation vm-niifidp_monitor_60000: Timed Out (node=vhbl05, call=641, timeout=20000ms)
> 12:54:47 vhbl06 crmd[8676]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-db2.iir_monitor_60000: Timed Out (node=vhbl05, call=635, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-jt_monitor_60000: Timed Out (node=vhbl05, call=662, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-eduidp_monitor_60000: Timed Out (node=vhbl05, call=642, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-web9_monitor_60000: Timed Out (node=vhbl05, call=639, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-web5_monitor_60000: Timed Out (node=vhbl05, call=659, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-front.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=653, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-stream1.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=652, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-pub1_monitor_60000: Timed Out (node=vhbl05, call=657, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-cerberus_monitor_60000: Timed Out (node=vhbl05, call=654, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-nfs_monitor_60000: Timed Out (node=vhbl05, call=634, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-ms-tr_monitor_60000: Timed Out (node=vhbl05, call=661, timeout=20000ms)
> 12:54:48 vhbl05 crmd[9445]:    error: Operation vm-up4_monitor_60000: Timed Out (node=vhbl05, call=651, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-n2.bajnok_monitor_60000: Timed Out (node=vhbl05, call=658, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-ansible_monitor_60000: Timed Out (node=vhbl05, call=660, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-fs.vh.videotorium_monitor_60000: Timed Out (node=vhbl05, call=643, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-mazsola_monitor_60000: Timed Out (node=vhbl05, call=648, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-elm_monitor_60000: Timed Out (node=vhbl05, call=645, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-mma_monitor_60000: Timed Out (node=vhbl05, call=655, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-rs31_monitor_60000: Timed Out (node=vhbl05, call=646, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-eternusmgm_monitor_60000: Timed Out (node=vhbl05, call=649, timeout=20000ms)
> 12:54:49 vhbl05 crmd[9445]:    error: Operation vm-test1.pc_monitor_60000: Timed Out (node=vhbl05, call=684, timeout=20000ms)
> 
> Meanwhile, all storage paths got reinstated, and the election concluded:
> 
> 12:54:49 vhbl05 crmd[9445]:   notice: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> 12:54:49 vhbl07 crmd[8484]:   notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
> 12:54:49 vhbl07 crmd[8484]:  warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> 12:54:49 vhbl03 crmd[2206]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 12:54:49 vhbl06 crmd[8676]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 12:54:50 vhbl04 crmd[9031]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 12:54:52 vhbl05 crmd[9445]:   notice: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> 
> and these messages followed (30 of them mentioning the resources which
> experienced monitor timeouts on vhbl05 during the storage outage):
> 
> 12:54:54 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
> 12:54:54 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web7 on vhbl05: unknown error (1)
> [...]
> 
> The cluster thus planned to recover these resources:
> 
> 12:54:54 vhbl07 pengine[8483]:   notice: Recover vm-nfs#011(Started vhbl05)
> [...]
> 
> and decided to start those 32 which were running on vhbl03 at the time
> of the crmd shutdown (and which were still running undisturbed):
> 
> 12:54:54 vhbl07 pengine[8483]:   notice: Start   vm-dogwood#011(vhbl03)
> [...]
> 12:54:54 vhbl07 pengine[8483]:   notice: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-430.bz2
> 
> The cluster commenced stopping the resources it planned to recover.
> Suddenly (where did this come from?):
> 
> 12:55:13 vhbl07 crmd[8484]:   notice: Transition aborted by vm-eiffel_monitor_60000 'create' on vhbl05: Foreign event (magic=0:0;521:0:0:634eef05-39c1-4093-94d4-8d624b423bb7, cib=0.613.98, source=process_graph_event:600, 0)

That means the action was initiated by a different node (the previous DC
presumably), so the new DC wants to recalculate everything.

> Apparently, the cluster waited for the already running operations to
> complete before stopping the transition:
> 
> [...]
> 12:59:05 vhbl05 crmd[9445]:   notice: Operation vm-web9_stop_0: ok (node=vhbl05, call=783, rc=0, cib-update=364, confirmed=true)
> 12:59:05 vhbl07 crmd[8484]:   notice: Transition 0 (Complete=17, Pending=0, Fired=0, Skipped=140, Incomplete=292, Source=/var/lib/pacemaker/pengine/pe-input-430.bz2): Stopped
> 12:59:05 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
> [...]
> 12:59:05 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)
> 
> Calculating the new transition emitted the same 30 warning lines as
> above, even though vm-nfs (and vm-web9 etc.) was stopped during the
> previous (aborted) transition.  Anyway, the new plan correctly contained
> 
> 12:59:05 vhbl07 pengine[8483]:   notice: Start   vm-nfs#011(vhbl06)
> 
> and similarly, recovery actions turned into start actions for the
> resources stopped during the previous transition.  However, almost all
> other recovery actions just disappeared without any comment.  This was
> actually correct, but I really wonder why the cluster decided to paper
> over the previous monitor operation timeouts.  Maybe the operations
> finished meanwhile and got accounted somehow, just not logged?

I'm not sure why the PE decided recovery was not necessary. Operation
results wouldn't be accepted without being logged. The pe-input files
from the transitions around here should help.

> We were left with 36 start and 2 recover operations in the end, most of
> them concerning running resources on vhbl03.  Accordingly, the probes
> reported unexpected (to the cluster) status:
> 
> 12:59:05 vhbl07 crmd[8484]:   notice: Initiating action 112: monitor vm-phm6_monitor_0 on vhbl03
> 12:59:05 vhbl03 crmd[2206]:   notice: Operation vm-phm6_monitor_0: ok (node=vhbl03, call=830, rc=0, cib-update=19, confirmed=true)
> 12:59:05 vhbl07 crmd[8484]:  warning: Action 112 (vm-phm6_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
> 12:59:05 vhbl07 crmd[8484]:   notice: Transition aborted by vm-phm6_monitor_0 'create' on vhbl03: Event failed (magic=0:0;112:1:7:7f6ad33d-a843-4cde-b152-ad548e3a5a0f, cib=0.613.128, source=match_graph_event:381, 0)
> 12:59:05 vhbl07 crmd[8484]:  warning: Action 112 (vm-phm6_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
> 
> The transition indeed aborted soon, and the usual warnings got logged
> computing the new transition:
> 
> 12:59:06 vhbl07 crmd[8484]:   notice: Transition 1 (Complete=71, Pending=0, Fired=0, Skipped=82, Incomplete=153, Source=/var/lib/pacemaker/pengine/pe-input-431.bz2): Stopped
> 12:59:06 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
> [...]
> 12:59:06 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)
> 
> Now it contained 14 start and 2 recovery operations only.  Unexpected
> probe results stoke again:
> 
> 12:59:08 vhbl07 crmd[8484]:  warning: Action 109 (vm-vidyoreplay_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
> 12:59:08 vhbl07 crmd[8484]:   notice: Transition aborted by vm-vidyoreplay_monitor_0 'create' on vhbl03: Event failed (magic=0:0;109:2:7:7f6ad33d-a843-4cde-b152-ad548e3a5a0f, cib=0.613.215, source=match_graph_event:381, 0)
> 12:59:08 vhbl07 crmd[8484]:  warning: Action 109 (vm-vidyoreplay_monitor_0) on vhbl03 failed (target: 7 vs. rc: 0): Error
> 
> After the already running operations finished:
> 
> 12:59:55 vhbl07 crmd[8484]:   notice: Transition 2 (Complete=92, Pending=0, Fired=0, Skipped=18, Incomplete=46, Source=/var/lib/pacemaker/pengine/pe-input-432.bz2): Stopped
> 12:59:55 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
> [...]
> 12:59:55 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)
> 
> Then unexpected probe results caused another transition abort, which
> finally concluded:
> 
> 12:59:57 vhbl07 crmd[8484]:   notice: Transition 3 (Complete=27, Pending=0, Fired=0, Skipped=0, Incomplete=22, Source=/var/lib/pacemaker/pengine/pe-input-433.bz2): Complete
> 12:59:57 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-web5 on vhbl05: unknown error (1)
> [...]
> 12:59:57 vhbl07 pengine[8483]:  warning: Processing failed op monitor for vm-nfs on vhbl05: unknown error (1)
> 12:59:57 vhbl07 crmd[8484]:   notice: Initiating action 274: monitor vm-repo_monitor_60000 on vhbl03
> [some similar periodic monitor op initiations]
> 12:59:57 vhbl07 crmd[8484]:   notice: Transition 4 (Complete=16, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-434.bz2): Complete
> 12:59:57 vhbl07 crmd[8484]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> Basically, the cluster responded beyond my expectations, sparing lots of
> unnecessary recoveries or fencing.  I'm happy, thanks for this wonderful
> software!  But I'm left with these "Processing failed op monitor"
> warnings emitted every 15 minutes (timer pops).  Is it safe and clever
> to cleanup the affected resources?  Would that get rid of them without
> invoking other effects, like recoveries for example?

That's normal; it's how the cluster maintains the effect of a failure
that has not been cleared. The logs can be confusing, because it's not
apparent from that message alone whether the failure is new or old.

Cleaning up the resource will end the failure condition, so what happens
next depends on the configuration and state of the cluster. If the
failure was preventing a preferred node from running the resource, the
resource could move, depending on other factors such as stickiness.

> And I'm also curious.  What was that "foreign event" at 12:55:13?  How
> did the cluster avoid recovering lots of resources (as detailed above)?
>