[Pacemaker] Crash this afternoon, want to verify I'm understanding this configuration correctly

Thu Feb 23 18:49:15 EST 2012

On Fri, Feb 24, 2012 at 10:37 AM, mark - pacemaker list
<m+pacemaker at nerdish.us> wrote:
> Hello,
>
> I have a pretty simple cluster running with three nodes, xen1, xen2, and
> qnode (which runs in standby at all times and only exists for quorum).  This
> afternoon xen1 reset out of the blue.  There is nothing in its logs, in fact
> there's a gap from 15:37 to 15:47:
>
> Feb 23 15:36:18 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
> for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
> 0x1d96080)
> Feb 23 15:36:18 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
> 2571780644 should have started at 2571780544
> Feb 23 15:37:39 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
> for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
> 0x1d96080)
> Feb 23 15:37:39 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
> 2571788778 should have started at 2571788678
> Feb 23 15:47:53 xen1 xenstored: Checking store ...
> Feb 23 15:47:54 xen1 acpid: starting up with netlink and the input layer
> Feb 23 15:47:54 xen1 xenstored: Checking store complete.
> ....
>
>
> I'm kicking myself for lack of a serial console, so I really don't know what
> happened prior to the reboot.  The two surviving nodes logs these very
> similar events, so it appears they were both seeing each other fine and
> neither could see anything from xen1:
>
> Feb 23 15:32:26 qnode pengine: [2096]: info: process_pe_message: Transition
> 10271: PEngine Input stored in: /var/lib/pengine/pe-input-4479.bz2
> Feb 23 15:36:51 qnode cib: [2093]: info: cib_stats: Processed 1 operations
> (10000.00us average, 0% utilization) in the last 10min
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 1772: memb=2, new=0, lost=1
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> memb: xen2 1241622720
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> memb: qnode 1308731584
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> lost: xen1 1224845504
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
> Stable membership event on ring 1772: memb=2, new=0, lost=0
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: xen2 1241622720
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
> MEMB: qnode 1308731584
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
> ais_mark_unseen_peer_dead: Node xen1 was not seen in the previous transition
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: update_member: Node
> 1224845504/xen1 is now: lost
> Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
> send_member_notification: Sending membership update 1772 to 2 children
> Feb 23 15:42:02 qnode corosync[2081]:   [TOTEM ] A processor joined or left
> the membership and a new membership was formed.
> Feb 23 15:42:02 qnode crmd: [2097]: info: ais_dispatch: Membership 1772:
> quorum retained
> Feb 23 15:42:02 qnode cib: [2093]: info: ais_dispatch: Membership 1772:
> quorum retained
> Feb 23 15:42:02 qnode crmd: [2097]: info: ais_status_callback: status: xen1
> is now lost (was member)
>
>
> Those are followed by very verbose logs covering fencing attempts (via SBD)
> by both qnode and xen2 which never succeeded, leading me to belive that at
> this point xen1 had fenced itself either due to SBD or softdog and was in
> the process of POST/bootup.  When xen1 finished booting it rejoined the
> cluster just fine.  There are no failure- or alert-type messages in the logs
> on any of the three hosts prior to the 15:42:02 corosync entries above,
> where xen1 has disappeared.
>
> Based on these entries from corosync.conf:
>
>         # How long before declaring a token lost (ms)
>         token: 3000
>         # How many token retransmits before forming a new configuration
>         token_retransmits_before_loss_const: 10
>
> ... am I correct in thinking that the two healthy nodes went at least 30
> seconds without seeing anything from xen1, then started their fencing
> attempts?

Certainly sounds reasonable.  Not 100% sure if the timeout is a simple
"token * retransmits" calculation but it wouldn't be far off.

>  In the complete absence of any logs with clues as to what
> happend, I'm just trying to build a likely scenario for what happened.  My
> guess is that xen1 had some kind of a panic or loss of network on all four
> interfaces (two iSCSI interfaces to the SBD and storage device, two bonded
> interfaces facing the service-side of things, where corosync talks).  The
> xen servers are PowerEdge R815s and OpenManage is running, but there are no
> events/errors logged with it either, no hardware issues logged at all.
>
> In any case, this isn't a complaint or issue report, simply an "am I
> thinking clearly about what may have happened?" message.  If this ever
> happens in the future, hopefully a serial console will provide answers.
>  (Side note:  For that, I'm thinking minicom connected to both and writing
> logs, is there something nicer that I just don't know about?).
>
> Regards,
> Mark
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>