[ClusterLabs] Antw: Re: Why is node fenced ?

Mon Oct 14 02:16:19 EDT 2019

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 10.10.2019 um 21:19 in
Nachricht
<53c0a1ef1ac0d83d7e3d67dbd0251602bbdd82d1.camel at redhat.com>:
> On Thu, 2019‑10‑10 at 17:22 +0200, Lentes, Bernd wrote:
>> HI,
>> 
>> i have a two node cluster running on SLES 12 SP4.
>> I did some testing on it.
>> I put one into standby (ha‑idg‑2), the other (ha‑idg‑1) got fenced a
>> few minutes later because i made a mistake.
>> ha‑idg‑2 was DC. ha‑idg‑1 made a fresh boot and i started
>> corosync/pacemaker on it.
>> It seems ha‑idg‑1 didn't find the DC after starting cluster and some
>> sec later elected itself  to the DC, 
>> afterwards fenced ha‑idg‑2.
> 
> For some reason, corosync on the two nodes was not able to communicate
> with each other.
> 
> This type of situation is why corosync's two_node option normally
> includes wait_for_all.
> 
>> 
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice  [MAIN  ] Corosync
>> Cluster Engine ('2.3.6'): started and ready to provide service.
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync info    [MAIN  ] Corosync
>> built‑in features: debug testagents augeas systemd pie relro bindnow
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice  [TOTEM ]
>> Initializing transport (UDP/IP Multicast).
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice  [TOTEM ]
>> Initializing transmit/receive security (NSS) crypto: aes256 hash:
>> sha1
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice  [TOTEM ] The network
>> interface [192.168.100.10] is now up.
>> 
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped
>> (20000ms)
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:  warning: do_log:   Input
>> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> do_state_transition:      State transition S_PENDING ‑> S_ELECTION |
>> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> election_check:   election‑DC won by local node
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info: do_log:   Input
>> I_ELECTION_DC received in state S_ELECTION from election_win_cb
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:   notice:
>> do_state_transition:      State transition S_ELECTION ‑>
>> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL
>> origin=election_win_cb
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> do_te_control:    Registering TE UUID: f302e1d4‑a1aa‑4a3e‑b9dd‑
>> 71bd17047f82
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> set_graph_functions:      Setting custom graph functions
>> Oct 09 18:05:06 [9565] ha‑idg‑1       crmd:     info:
>> do_dc_takeover:   Taking over DC status for this partition
>> 
>> Oct 09 18:05:07 [9564] ha‑idg‑1    pengine:  warning:
>> stage6:   Scheduling Node ha‑idg‑2 for STONITH
>> Oct 09 18:05:07 [9564] ha‑idg‑1    pengine:   notice:
>> LogNodeActions:    * Fence (Off) ha‑idg‑2 'node is unclean'
>> 
>> Is my understanding correct ?
> 
> Yes
> 
>> In the log of ha‑idg‑2 i don't find anything for this period:
>> 
>> Oct 09 17:58:46 [12504] ha‑idg‑2 stonith‑ng:     info:
>> cib_device_update:       Device fence_ilo_ha‑idg‑2 has been disabled
>> on ha‑idg‑2: score=‑10000
>> Oct 09 17:58:51 [12503] ha‑idg‑2        cib:     info:
>> cib_process_ping:        Reporting our current digest to ha‑idg‑2:
>> 59c4cfb14defeafbeb3417e222242cd9 for 2.9506.36 (0x242b110 0)
>> 
>> Oct 09 18:00:42 [12508] ha‑idg‑2       crmd:     info:
>> throttle_send_command:   New throttle mode: 0001 (was 0000)
>> Oct 09 18:01:12 [12508] ha‑idg‑2       crmd:     info:
>> throttle_check_thresholds:       Moderate CPU load detected:
>> 32.220001
>> Oct 09 18:01:12 [12508] ha‑idg‑2       crmd:     info:
>> throttle_send_command:   New throttle mode: 0010 (was 0001)
>> Oct 09 18:01:42 [12508] ha‑idg‑2       crmd:     info:
>> throttle_send_command:   New throttle mode: 0001 (was 0010)
>> Oct 09 18:02:42 [12508] ha‑idg‑2       crmd:     info:
>> throttle_send_command:   New throttle mode: 0000 (was 0001)
>> 
>> ha‑idg‑2 is fenced and after a reboot i started corosync/pacmeaker on
>> it again:
>> 
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice  [MAIN  ] Corosync
>> Cluster Engine ('2.3.6'): started and ready to provide service.
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync info    [MAIN  ] Corosync
>> built‑in features: debug testagents augeas systemd pie relro bindnow
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice  [TOTEM ]
>> Initializing transport (UDP/IP Multicast).
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice  [TOTEM ]
>> Initializing transmit/receive security (NSS) crypto: aes256 hash:
>> sha1
>> 
>> What is the meaning of the lines with the throttle ?
> 
> Those messages could definitely be improved. The particular mode values
> indicate no significant CPU load (0000), low load (0001), medium
> (0010), high (0100), or extreme (1000).

Funny: save a few bytes here, but waste many elsewhere ;-)

> 
> I wouldn't expect a CPU spike to lock up corosync for very long, but it
> could be related somehow.
> 
>> 
>> Thanks.
>> 
>> 
>> Bernd
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/