[ClusterLabs] Antw: Antw: corosync taking almost 30 secs to detect node failure in case of kernel panic

Wed Jan 10 06:11:29 EST 2018

Hi!

That's odd indeed. We have a server that indicates network traffic via LED on the front, and using a dedicated cluster NIC, we see a constant flickering on the LED. Also the nodes detect node failures via corosync almost immediately.
In your case it seems a cluster recheck is needed to detect a node failure. I don't know why. Sorry!
Do you have any special configuration parameter in corosync? Maybe show the config.

Regards,
Ulrich

>>> ashutosh tiwari <ashutosh.kvas at gmail.com> schrieb am 10.01.2018 um 11:47 in
Nachricht
<CA+vEgjhQ6yErz1Fo4SN312e6hXc-AgMeABRicuEwTBWVqVMavg at mail.gmail.com>:
> Hi Ulrich,
> 
> Thanks for the response.
> 
>  30 sec is the time for detection only as confirmed by logs.
> 
> ++++++++++++++++++++++++++++++++++++
> Jan 10 11:06:18 [19261] orana       crmd:     info: crm_timer_popped:
>  PEngine Recheck Timer (I_PE_CALC) just popped (30000ms)
> Jan 10 11:06:18 [19261] orana       crmd:   notice: do_state_transition:
>     State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_state_transition:
>     Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jan 10 11:06:18 [19260] orana    pengine:     info: process_pe_message:
> Input has not changed since last time, not saving to disk
> Jan 10 11:06:18 [19260] orana    pengine:   notice: unpack_config:      On
> loss of CCM Quorum: Ignore
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status_fencing:    Node tigana is active
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status:    Node tigana is online
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status_fencing:    Node orana is active
> Jan 10 11:06:18 [19260] orana    pengine:     info:
> determine_online_status:    Node orana is online
> Jan 10 11:06:18 [19260] orana    pengine:     info: clone_print:
>  Master/Slave Set: unicloud-master [unicloud]
> Jan 10 11:06:18 [19260] orana    pengine:     info: short_print:
>  Masters: [ tigana ]
> Jan 10 11:06:18 [19260] orana    pengine:     info: short_print:
>  Slaves: [ orana ]
> Jan 10 11:06:18 [19260] orana    pengine:     info: native_print:
>  fence-uc-orana  (stonith:fence_ilo4):   Started tigana
> Jan 10 11:06:18 [19260] orana    pengine:     info: native_print:
>  fence-uc-tigana (stonith:fence_ilo4):   Started tigana
> Jan 10 11:06:18 [19260] orana    pengine:     info: master_color:
>  Promoting unicloud:0 (Master tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: master_color:
>  unicloud-master: Promoted 1 instances of a possible 1 to master
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions: Leave
>  unicloud:0      (Master tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions: Leave
>  unicloud:1      (Slave orana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions: Leave
>  fence-uc-orana  (Started tigana)
> Jan 10 11:06:18 [19260] orana    pengine:     info: LogActions: Leave
>  fence-uc-tigana (Started tigana)
> Jan 10 11:06:18 [19260] orana    pengine:   notice: process_pe_message:
> Calculated Transition 2390: /var/lib/pacemaker/pengine/pe-input-1655.bz2
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_state_transition:
>     State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_te_invoke:
>  Processing graph 2390 (ref=pe_calc-dc-1515562578-2650) derived from
> /var/lib/pacemaker/pengine/pe-input-1655.bz2
> Jan 10 11:06:18 [19261] orana       crmd:   notice: run_graph:  Transition
> 2390 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1655.bz2): Complete
> Jan 10 11:06:18 [19261] orana       crmd:     info: do_log:     FSA: Input
> I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
> Jan 10 11:06:18 [19261] orana       crmd:   notice: do_state_transition:
>     State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 10 11:06:31 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Jan 10 11:06:33 corosync [QUORUM] Members[1]: 1
> Jan 10 11:06:33 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Jan 10 11:06:33 [19250] orana pacemakerd:     info: cman_event_callback:
>     Membership 2064: quorum retained
> Jan 10 11:06:33 [19261] orana       crmd:     info: cman_event_callback:
>     Membership 2064: quorum retained
> Jan 10 11:06:33 [19250] orana pacemakerd:   notice:
> crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is
> now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:   notice:
> crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is
> now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:     info: peer_update_callback:
>      tigana is now lost (was member)
> Jan 10 11:06:33 [19261] orana       crmd:  warning: match_down_event:   No
> match for shutdown action on tigana
> Jan 10 11:06:33 [19261] orana       crmd:   notice: peer_update_callback:
>      Stonith/shutdown of tigana not matched
> Jan 10 11:06:33 [19261] orana       crmd:     info: crm_update_peer_join:
>      peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0
> Jan 10 11:06:33 [19261] orana       crmd:     info:
> abort_transition_graph:     Transition aborted: Node failure
> (source=peer_update_callback:240, 1)
> Jan 10 11:06:33 corosync [CPG   ] chosen downlist: sender r(0) ip(7.7.7.1)
> ; members(old:2 left:1)
> ++++++++++++++++++++++++++
> 
> this is the logs from standby node(new active).
> kernel panic was triggered at 11:06:00 at the other node and here totem
> change is reported at 11:06:31.
> 
> 30 secs is the cluster recheck timer.
> 
> Regards,
> Ashutosh
> 
> 
> On Wed, Jan 10, 2018 at 3:12 PM, <users-request at clusterlabs.org> wrote:
> 
>> Send Users mailing list submissions to
>>         users at clusterlabs.org 
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         http://lists.clusterlabs.org/mailman/listinfo/users 
>> or, via email, send a message with subject or body 'help' to
>>         users-request at clusterlabs.org 
>>
>> You can reach the person managing the list at
>>         users-owner at clusterlabs.org 
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Users digest..."
>>
>>
>> Today's Topics:
>>
>>    1. corosync taking almost 30 secs to detect node failure in case
>>       of kernel panic (ashutosh tiwari)
>>    2. Antw: corosync taking almost 30 secs to detect node failure
>>       in case of kernel panic (Ulrich Windl)
>>    3. pacemaker reports monitor timeout while CPU is high (???)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Wed, 10 Jan 2018 12:43:46 +0530
>> From: ashutosh tiwari <ashutosh.kvas at gmail.com>
>> To: users at clusterlabs.org 
>> Subject: [ClusterLabs] corosync taking almost 30 secs to detect node
>>         failure in case of kernel panic
>> Message-ID:
>>         <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb
>> 3Q at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi,
>>
>> We have two node cluster running in active/standby mode and having IPMI
>> fencing configured.
>>
>> In case of kernel panic at Active node, standby node is detecting node
>> failure in around 30 secs which leads to delay in standby node taking the
>> active role.
>>
>> we have totem token timeout as 10000 msecs.
>> Please let us know in case there is any more configuration controlling
>> membership detection.
>>
>> s/w versions.
>>
>> centos 6.7
>> corosync-1.4.7-5.el6.x86_64
>> pacemaker-1.1.14-8.el6.x86_64
>>
>> Thanks and Regards,
>> Ashutosh Tiwari
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ 
>> 20180110/235f148d/attachment-0001.html>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Wed, 10 Jan 2018 08:32:16 +0100
>> From: "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de>
>> To: <users at clusterlabs.org>
>> Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect
>>         node failure in case of kernel panic
>> Message-ID: <5A55C180020000A100029BD1 at gwsmtp1.uni-regensburg.de>
>> Content-Type: text/plain; charset=US-ASCII
>>
>> Hi!
>>
>> Maybe define "detecting node failure". Culkd it be your 30 seconds are
>> between detection and reaction? Logs would help here, too.
>>
>> Regards,
>> Ulrich
>>
>>
>> >>> ashutosh tiwari <ashutosh.kvas at gmail.com> schrieb am 10.01.2018 um
>> 08:13 in
>> Nachricht
>> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb3Q at mail.gmail.com>:
>> > Hi,
>> >
>> > We have two node cluster running in active/standby mode and having IPMI
>> > fencing configured.
>> >
>> > In case of kernel panic at Active node, standby node is detecting node
>> > failure in around 30 secs which leads to delay in standby node taking the
>> > active role.
>> >
>> > we have totem token timeout as 10000 msecs.
>> > Please let us know in case there is any more configuration controlling
>> > membership detection.
>> >
>> > s/w versions.
>> >
>> > centos 6.7
>> > corosync-1.4.7-5.el6.x86_64
>> > pacemaker-1.1.14-8.el6.x86_64
>> >
>> > Thanks and Regards,
>> > Ashutosh Tiwari
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Wed, 10 Jan 2018 09:40:51 +0000
>> From: ??? <fanguoteng at highgo.com>
>> To: Cluster Labs - All topics related to open-source clustering
>>         welcomed        <users at clusterlabs.org>
>> Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is
>>         high
>> Message-ID: <4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello,
>>
>> This issue only appears when we run performance test and the CPU is high.
>> The cluster and log is as below. The Pacemaker will restart the Slave Side
>> pgsql-ha resource about every two minutes.
>>
>> Take the following scenario for example:?when the pgsqlms RA is called, we
>> print the log ?execute the command start (command)?. When the command is
>> returned, we print the log ?execute the command stop (Command) (result)??
>>
>> 1.     We could see that pacemaker call ?pgsqlms monitor? about every 15
>> seconds. And it return $OCF_SUCCESS
>>
>> 2.     In calls monitor command again at 13:56:16, and then it reports
>> timeout error error 13:56:18. It is only 2 seconds but it reports
>> ?timeout=10000ms?
>>
>> 3.     In other logs, sometimes after 15 minutes, there is no ?execute the
>> command start monitor? printed and it reports timeout error directly.
>>
>> Could you please tell how to debug or resolve such issue?
>>
>> The log:
>>
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command
>> start monitor
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop
>> monitor 0
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command
>> start monitor
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop
>> monitor 0
>> Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU load detected:
>> 426.779999
>> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command
>> start monitor
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process
>> (PID 5606) timed out
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 -
>> timed out after 10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for
>> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: db2-pgsqld_monitor_16000:102 [
>> /tmp:5432 - accepting connections\n ]
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE ->
>> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor
>> for pgsqld:0 on db2: unknown error (1)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start
>> for pgsqld:1 on db1: unknown error (1)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from
>> db1 after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from
>> db1 after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover pgsqld:0#011(Slave
>> db2)
>> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 37,
>> saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
>>
>>
>> The Cluster Configuration:
>> 2 nodes and 13 resources configured
>>
>> Online: [ db1 db2 ]
>>
>> Full list of resources:
>>
>> Clone Set: dlm-clone [dlm]
>>      Started: [ db1 db2 ]
>> Clone Set: clvmd-clone [clvmd]
>>      Started: [ db1 db2 ]
>> ipmi_node1     (stonith:fence_ipmilan):        Started db2
>> ipmi_node2     (stonith:fence_ipmilan):        Started db1
>> Clone Set: clusterfs-clone [clusterfs]
>>      Started: [ db1 db2 ]
>> Master/Slave Set: pgsql-ha [pgsqld]>
>>
>>       Masters: [ db1 ]
>>
>> Slaves: [ db2 ]
>> Resource Group: mastergroup
>>      db1-vip    (ocf::heartbeat:IPaddr2):       Started
>>      rep-vip    (ocf::heartbeat:IPaddr2):       Started
>> Resource Group: slavegroup
>>      db2-vip    (ocf::heartbeat:IPaddr2):       Started
>>
>>
>> pcs resource show pgsql-ha
>> Master: pgsql-ha
>>   Meta Attrs: interleave=true notify=true
>>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>>    Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>>    Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
>>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>>                promote interval=0s timeout=130s
>> (pgsqld-promote-interval-0s)
>>                demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>>                monitor interval=15s role=Master timeout=10s
>> (pgsqld-monitor-interval-15s)
>>                monitor interval=16s role=Slave timeout=10s
>> (pgsqld-monitor-interval-16s)
>>                notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ 
>> 20180110/88e7c872/attachment.html>
>>
>> ------------------------------
>>
>> _______________________________________________
>> Users mailing list
>> Users at clusterlabs.org 
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>
>>
>> End of Users Digest, Vol 36, Issue 8
>> ************************************
>>