[ClusterLabs] Why is node fenced ?
Ken Gaillot
kgaillot at redhat.com
Thu Oct 10 15:19:20 EDT 2019
On Thu, 2019-10-10 at 17:22 +0200, Lentes, Bernd wrote:
> HI,
>
> i have a two node cluster running on SLES 12 SP4.
> I did some testing on it.
> I put one into standby (ha-idg-2), the other (ha-idg-1) got fenced a
> few minutes later because i made a mistake.
> ha-idg-2 was DC. ha-idg-1 made a fresh boot and i started
> corosync/pacemaker on it.
> It seems ha-idg-1 didn't find the DC after starting cluster and some
> sec later elected itself to the DC,
> afterwards fenced ha-idg-2.
For some reason, corosync on the two nodes was not able to communicate
with each other.
This type of situation is why corosync's two_node option normally
includes wait_for_all.
>
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [MAIN ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:04:43 [9550] ha-idg-1 corosync info [MAIN ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
> Oct 09 18:04:43 [9550] ha-idg-1 corosync notice [TOTEM ] The network
> interface [192.168.100.10] is now up.
>
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped
> (20000ms)
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: warning: do_log: Input
> I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> do_state_transition: State transition S_PENDING -> S_ELECTION |
> input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> election_check: election-DC won by local node
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info: do_log: Input
> I_ELECTION_DC received in state S_ELECTION from election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: notice:
> do_state_transition: State transition S_ELECTION ->
> S_INTEGRATION | input=I_ELECTION_DC cause=C_FSA_INTERNAL
> origin=election_win_cb
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> do_te_control: Registering TE UUID: f302e1d4-a1aa-4a3e-b9dd-
> 71bd17047f82
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> set_graph_functions: Setting custom graph functions
> Oct 09 18:05:06 [9565] ha-idg-1 crmd: info:
> do_dc_takeover: Taking over DC status for this partition
>
> Oct 09 18:05:07 [9564] ha-idg-1 pengine: warning:
> stage6: Scheduling Node ha-idg-2 for STONITH
> Oct 09 18:05:07 [9564] ha-idg-1 pengine: notice:
> LogNodeActions: * Fence (Off) ha-idg-2 'node is unclean'
>
> Is my understanding correct ?
Yes
> In the log of ha-idg-2 i don't find anything for this period:
>
> Oct 09 17:58:46 [12504] ha-idg-2 stonith-ng: info:
> cib_device_update: Device fence_ilo_ha-idg-2 has been disabled
> on ha-idg-2: score=-10000
> Oct 09 17:58:51 [12503] ha-idg-2 cib: info:
> cib_process_ping: Reporting our current digest to ha-idg-2:
> 59c4cfb14defeafbeb3417e222242cd9 for 2.9506.36 (0x242b110 0)
>
> Oct 09 18:00:42 [12508] ha-idg-2 crmd: info:
> throttle_send_command: New throttle mode: 0001 (was 0000)
> Oct 09 18:01:12 [12508] ha-idg-2 crmd: info:
> throttle_check_thresholds: Moderate CPU load detected:
> 32.220001
> Oct 09 18:01:12 [12508] ha-idg-2 crmd: info:
> throttle_send_command: New throttle mode: 0010 (was 0001)
> Oct 09 18:01:42 [12508] ha-idg-2 crmd: info:
> throttle_send_command: New throttle mode: 0001 (was 0010)
> Oct 09 18:02:42 [12508] ha-idg-2 crmd: info:
> throttle_send_command: New throttle mode: 0000 (was 0001)
>
> ha-idg-2 is fenced and after a reboot i started corosync/pacmeaker on
> it again:
>
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [MAIN ] Corosync
> Cluster Engine ('2.3.6'): started and ready to provide service.
> Oct 09 18:29:05 [11795] ha-idg-2 corosync info [MAIN ] Corosync
> built-in features: debug testagents augeas systemd pie relro bindnow
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ]
> Initializing transport (UDP/IP Multicast).
> Oct 09 18:29:05 [11795] ha-idg-2 corosync notice [TOTEM ]
> Initializing transmit/receive security (NSS) crypto: aes256 hash:
> sha1
>
> What is the meaning of the lines with the throttle ?
Those messages could definitely be improved. The particular mode values
indicate no significant CPU load (0000), low load (0001), medium
(0010), high (0100), or extreme (1000).
I wouldn't expect a CPU spike to lock up corosync for very long, but it
could be related somehow.
>
> Thanks.
>
>
> Bernd
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list