[ClusterLabs] Antw: Re: Why is node fenced ?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Oct 14 02:16:19 EDT 2019
>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 10.10.2019 um 21:19 in
Nachricht
<53c0a1ef1ac0d83d7e3d67dbd0251602bbdd82d1.camel at redhat.com>:
> On Thu, 2019‑10‑10 at 17:22 +0200, Lentes, Bernd wrote:
>> HI,
>>
>> i have a two node cluster running on SLES 12 SP4.
>> I did some testing on it.
>> I put one into standby (ha‑idg‑2), the other (ha‑idg‑1) got fenced a
>> few minutes later because i made a mistake.
>> ha‑idg‑2 was DC. ha‑idg‑1 made a fresh boot and i started
>> corosync/pacemaker on it.
>> It seems ha‑idg‑1 didn't find the DC after starting cluster and some
>> sec later elected itself to the DC,
>> afterwards fenced ha‑idg‑2.
>
> For some reason, corosync on the two nodes was not able to communicate
> with each other.
>
> This type of situation is why corosync's two_node option normally
> includes wait_for_all.
>
>>
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [MAIN ] Corosync
>> Cluster Engine ('2.3.6'): started and ready to provide service.
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync info [MAIN ] Corosync
>> built‑in features: debug testagents augeas systemd pie relro bindnow
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ]
>> Initializing transport (UDP/IP Multicast).
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ]
>> Initializing transmit/receive security (NSS) crypto: aes256 hash:
>> sha1
>> Oct 09 18:04:43 [9550] ha‑idg‑1 corosync notice [TOTEM ] The network
>> interface [192.168.100.10] is now up.
>>
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped
>> (20000ms)
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: warning: do_log: Input
>> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> do_state_transition: State transition S_PENDING ‑> S_ELECTION |
>> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> election_check: election‑DC won by local node
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info: do_log: Input
>> I_ELECTION_DC received in state S_ELECTION from election_win_cb
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: notice:
>> do_state_transition: State transition S_ELECTION ‑>
>> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL
>> origin=election_win_cb
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> do_te_control: Registering TE UUID: f302e1d4‑a1aa‑4a3e‑b9dd‑
>> 71bd17047f82
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> set_graph_functions: Setting custom graph functions
>> Oct 09 18:05:06 [9565] ha‑idg‑1 crmd: info:
>> do_dc_takeover: Taking over DC status for this partition
>>
>> Oct 09 18:05:07 [9564] ha‑idg‑1 pengine: warning:
>> stage6: Scheduling Node ha‑idg‑2 for STONITH
>> Oct 09 18:05:07 [9564] ha‑idg‑1 pengine: notice:
>> LogNodeActions: * Fence (Off) ha‑idg‑2 'node is unclean'
>>
>> Is my understanding correct ?
>
> Yes
>
>> In the log of ha‑idg‑2 i don't find anything for this period:
>>
>> Oct 09 17:58:46 [12504] ha‑idg‑2 stonith‑ng: info:
>> cib_device_update: Device fence_ilo_ha‑idg‑2 has been disabled
>> on ha‑idg‑2: score=‑10000
>> Oct 09 17:58:51 [12503] ha‑idg‑2 cib: info:
>> cib_process_ping: Reporting our current digest to ha‑idg‑2:
>> 59c4cfb14defeafbeb3417e222242cd9 for 2.9506.36 (0x242b110 0)
>>
>> Oct 09 18:00:42 [12508] ha‑idg‑2 crmd: info:
>> throttle_send_command: New throttle mode: 0001 (was 0000)
>> Oct 09 18:01:12 [12508] ha‑idg‑2 crmd: info:
>> throttle_check_thresholds: Moderate CPU load detected:
>> 32.220001
>> Oct 09 18:01:12 [12508] ha‑idg‑2 crmd: info:
>> throttle_send_command: New throttle mode: 0010 (was 0001)
>> Oct 09 18:01:42 [12508] ha‑idg‑2 crmd: info:
>> throttle_send_command: New throttle mode: 0001 (was 0010)
>> Oct 09 18:02:42 [12508] ha‑idg‑2 crmd: info:
>> throttle_send_command: New throttle mode: 0000 (was 0001)
>>
>> ha‑idg‑2 is fenced and after a reboot i started corosync/pacmeaker on
>> it again:
>>
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [MAIN ] Corosync
>> Cluster Engine ('2.3.6'): started and ready to provide service.
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync info [MAIN ] Corosync
>> built‑in features: debug testagents augeas systemd pie relro bindnow
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [TOTEM ]
>> Initializing transport (UDP/IP Multicast).
>> Oct 09 18:29:05 [11795] ha‑idg‑2 corosync notice [TOTEM ]
>> Initializing transmit/receive security (NSS) crypto: aes256 hash:
>> sha1
>>
>> What is the meaning of the lines with the throttle ?
>
> Those messages could definitely be improved. The particular mode values
> indicate no significant CPU load (0000), low load (0001), medium
> (0010), high (0100), or extreme (1000).
Funny: save a few bytes here, but waste many elsewhere ;-)
>
> I wouldn't expect a CPU spike to lock up corosync for very long, but it
> could be related somehow.
>
>>
>> Thanks.
>>
>>
>> Bernd
> ‑‑
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list