[ClusterLabs] Failover When Host is Up, Out of Order Logs

Ken Gaillot kgaillot at redhat.com
Wed Feb 1 16:26:24 EST 2017


On 01/31/2017 11:44 AM, Corey Moullas wrote:
> I have been getting extremely strange behavior from a Corosync/Pacemaker
> install on OVH Public Cloud servers.
> 
>  
> 
> After hours of Googling, I thought I would try posting here to see if
> somebody knows what to do.
> 
>  
> 
> I see this in my logs very frequently:
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 24334.5645 ms (threshold is 2400.0000 ms).
> Consider token timeout increase.
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] A processor
> failed, forming new configuration.
> 
>  
> 
> I have increased token time to 10s and this still occurs regularly even
> though both hosts are always up.

The "was not scheduled" message means the kernel is not giving corosync
enough CPU time to keep the token alive. The message indicates that it
didn't get scheduled for 24 seconds, which is why 10 seconds wouldn't
help -- but simply raising the timeout isn't a good idea at that point.
You need to figure out why you're starved for CPU. Public clouds don't
generally provide any guarantees, so it may be on their end.

> There are also times when the floating IP script is fired, but corosync
> does not seem aware. When you run crm status it will show the ip being
> bound the fusion01-1 when in fact the script fired and moved it to
> fusion01-2.
> 
>  
> 
> Finally, the timing of the logs seems very odd. This is what the end of
> my corosync log file looks like. Notice the times appear out of order in
> the logs. I’m ripping my hair out with these issues. Anybody have a clue
> what may be going on here?

Both pacemaker and corosync write to the same log (which is probably a
bad idea and will be changed one day, but that's not of concern here).
Each is using its own time source. If pacemaker is getting scheduled
more frequently than corosync, it's certainly possible log messages will
be out of order, since corosync isn't able to write out buffers as often.

> 
>  
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/crmd/245, version=0.81.123)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: --- 0.81.123 2
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: +++ 0.81.124 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        -- /cib/status/node_state[@id='1']/lrm[@id='1']
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib:  @num_updates=124
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_delete operation for
> section //node_state[@uname='fusion01-1']/lrm: OK (rc=0,
> origin=fusion01-1/crmd/246, version=0.81.124)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: --- 0.81.124 2
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: +++ 0.81.125 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib:  @num_updates=125
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib/status/node_state[@id='1']: 
> @crm-debug-origin=do_lrm_query_internal
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++ /cib/status/node_state[@id='1']:  <lrm id="1"/>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                    
> <lrm_resources>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                      
> <lrm_resource id="FloatIP" type="floatip-ocf" class="ocf" provider="ovh">
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++
>                                         <lrm_rsc_op id="FloatIP_last_0"
> operation_key="FloatIP_start_0" operation="start"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="5:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> transition-magic="0:0;5:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> on_node="fusion01-1" call-id="17" rc-code="0" op-status="0" interval="0"
> last-run="1485859189" last-rc-change="1485859189" e
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++
>                                       </lrm_resource>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                      
> <lrm_resource id="FS" type="FSSofia" class="lsb">
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                        
> <lrm_rsc_op id="FS_last_failure_0" operation_key="FS_monitor_0"
> operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.10"
> transition-key="4:1:7:1fe20aa3-b305-4282-99a3-b1f8190d3c2c"
> transition-magic="0:0;4:1:7:1fe20aa3-b305-4282-99a3-b1f8190d3c2c"
> on_node="fusion01-1" call-id="9" rc-code="0" op-status="0" interval="0"
> last-run="1485760597" last-rc-change="1485760597" ex
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                        
> <lrm_rsc_op id="FS_last_0" operation_key="FS_start_0" operation="start"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="7:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> transition-magic="0:0;7:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> on_node="fusion01-1" call-id="18" rc-code="0" op-status="0" interval="0"
> last-run="1485859191" last-rc-change="1485859191" exec-time="
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                        
> <lrm_rsc_op id="FS_monitor_1000" operation_key="FS_monitor_1000"
> operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.10"
> transition-key="1:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> transition-magic="0:0;1:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> on_node="fusion01-1" call-id="19" rc-code="0" op-status="0"
> interval="1000" last-rc-change="1485859191" exec-time="20" qu
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                      
> </lrm_resource>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                    
> </lrm_resources>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        ++                                   </lrm>
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/crmd/247, version=0.81.125)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section nodes: OK (rc=0, origin=fusion01-1/crmd/250, version=0.81.125)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: --- 0.81.125 2
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: +++ 0.81.126 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib:  @num_updates=126
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib/status/node_state[@id='2']: 
> @crm-debug-origin=do_state_transition
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib/status/node_state[@id='1']: 
> @crm-debug-origin=do_state_transition
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/crmd/251, version=0.81.126)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section cib: OK (rc=0, origin=fusion01-1/crmd/252, version=0.81.126)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/attrd/16, version=0.81.126)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: --- 0.81.126 2
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        Diff: +++ 0.81.127 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_perform_op:        +  /cib:  @num_updates=127
> 
> Jan 31 10:31:47 [21062] fusion01-2        cib:     info:
> cib_process_request:               Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/attrd/17, version=0.81.127)
> 
> Jan 31 10:31:52 [21062] fusion01-2        cib:     info:
> cib_process_ping:     Reporting our current digest to fusion01-1:
> 8f48b8c10ce54828f87b27bea1b50d20 for 0.81.127 (0x7ff1485dadc0 0)
> 
> Jan 31 10:32:16 [21067] fusion01-2       crmd:     info:
> throttle_send_command:   New throttle mode: 0000 (was 0010)
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 24334.5645 ms (threshold is 2400.0000 ms).
> Consider token timeout increase.
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] A processor
> failed, forming new configuration.
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 9222.6709 ms (threshold is 2400.0000 ms).
> Consider token timeout increase.
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] A new
> membership (192.168.128.21:6120) was formed. Members left: 1
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] Failed to
> receive the leave message. failed: 1
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [QUORUM] Members[1]: 2
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [MAIN  ] Completed
> service synchronization, ready to provide service.
> 
> Jan 31 10:31:46 [581] fusion01-2 corosync notice  [TOTEM ] A new
> membership (192.168.128.20:6128) was formed. Members joined: 1
> 
> Jan 31 10:31:46 [581] fusion01-2 corosync notice  [QUORUM] Members[2]: 1 2
> 
> Jan 31 10:31:46 [581] fusion01-2 corosync notice  [MAIN  ] Completed
> service synchronization, ready to provide service.
> 
>  
> 
> *Cheers,
> Corey Moullas
> 
> **Network Administrator
> EMAK**|**TECH*
> 
>  
> 
> *T:*(514)-4000-226 x101
> 
> *E:***cmoullas at emak.tech <mailto:cmoullas at emak.tech>
> 
>  
> 
> *IMPORTANT:*For all support-related requests, please
> emailsupport at emak.tech <mailto:support at emak.tech>/ Pour toute assistance
> technique, veuillez envoyer un courriel à support at emak.tech
> <mailto:support at emak.tech>.




More information about the Users mailing list