[ClusterLabs] Node lost early in HA startup --> no STONITH
lists at alteeve.ca
Sun Aug 2 23:11:34 EDT 2015
On 02/08/15 11:10 PM, Chris Walker wrote:
> We recently had an unfortunate sequence on our two-node cluster (nodes
> n02 and n03) that can be summarized as:
> 1. n03 became pathologically busy and was STONITHed by n02
> 2. The heavy load migrated to n02, which also became pathologically busy
> 3. n03 was rebooted
> 4. During the startup of HA on n03, n02 was initially seen by n03:
> Jul 26 15:23:43 n03 crmd: : info: crm_update_peer_proc: n02.ais
> is now online
> 5. But later during the startup sequence (after DC election and CIB
> sync) we see n02 die (n02 is really wrapped around the axle, many stuck
> threads, etc)
> Jul 26 15:27:44 n03 heartbeat: : WARN: node n02: is dead
> Jul 26 15:27:45 n03 crmd: : info: ais_status_callback: status:
> n02 is now lost (was member)
> our deadtime is 240 seconds, so n02 became unresponsive almost
> immediately after n03 reported it up at 15:23:43
> 6. The troubling aspect of this incident is that even though there are
> multiple STONITH resources configured for n03, none of them was engaged
> and n03 then mounted filesystems that were also active on n02.
> I'm wondering whether the fact that no STONITH resources were started by
> this time explains why n02 was not STONITHed. Shortly after n02 is
> declared dead we see STONITH resources begin starting, e.g.,
> Jul 26 15:27:47 n03 pengine: : notice: LogActions: Start
> n03-3-ipmi-stonith (n03)
> Does the fact that since there were no active STONITH resources when n02
> was declared dead, no STONITH action was taken against this node? Is
> there a fix/workaround for this scenario (we're using heartbeat 3.0.5
> and pacemaker 3.1.6 (RHEL6.2))?
> Thanks very much!
Please share your full config and the logs from both nodes through the
duration of the events.
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Users