[Pacemaker] Crash this afternoon, want to verify I'm understanding this configuration correctly

Thu Feb 23 18:37:03 EST 2012

Hello,

I have a pretty simple cluster running with three nodes, xen1, xen2, and
qnode (which runs in standby at all times and only exists for quorum).
 This afternoon xen1 reset out of the blue.  There is nothing in its logs,
in fact there's a gap from 15:37 to 15:47:

Feb 23 15:36:18 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
0x1d96080)
Feb 23 15:36:18 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
2571780644 should have started at 2571780544
Feb 23 15:37:39 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function
for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource:
0x1d96080)
Feb 23 15:37:39 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at
2571788778 should have started at 2571788678
Feb 23 15:47:53 xen1 xenstored: Checking store ...
Feb 23 15:47:54 xen1 acpid: starting up with netlink and the input layer
Feb 23 15:47:54 xen1 xenstored: Checking store complete.
....

I'm kicking myself for lack of a serial console, so I really don't know
what happened prior to the reboot.  The two surviving nodes logs these very
similar events, so it appears they were both seeing each other fine and
neither could see anything from xen1:

Feb 23 15:32:26 qnode pengine: [2096]: info: process_pe_message: Transition
10271: PEngine Input stored in: /var/lib/pengine/pe-input-4479.bz2
Feb 23 15:36:51 qnode cib: [2093]: info: cib_stats: Processed 1 operations
(10000.00us average, 0% utilization) in the last 10min
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
Transitional membership event on ring 1772: memb=2, new=0, lost=1
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
memb: xen2 1241622720
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
memb: qnode 1308731584
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
lost: xen1 1224845504
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] notice: pcmk_peer_update:
Stable membership event on ring 1772: memb=2, new=0, lost=0
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
MEMB: xen2 1241622720
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: pcmk_peer_update:
MEMB: qnode 1308731584
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
ais_mark_unseen_peer_dead: Node xen1 was not seen in the previous transition
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info: update_member: Node
1224845504/xen1 is now: lost
Feb 23 15:42:02 qnode corosync[2081]:   [pcmk  ] info:
send_member_notification: Sending membership update 1772 to 2 children
Feb 23 15:42:02 qnode corosync[2081]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Feb 23 15:42:02 qnode crmd: [2097]: info: ais_dispatch: Membership 1772:
quorum retained
Feb 23 15:42:02 qnode cib: [2093]: info: ais_dispatch: Membership 1772:
quorum retained
Feb 23 15:42:02 qnode crmd: [2097]: info: ais_status_callback: status: xen1
is now lost (was member)

Those are followed by very verbose logs covering fencing attempts (via SBD)
by both qnode and xen2 which never succeeded, leading me to belive that at
this point xen1 had fenced itself either due to SBD or softdog and was in
the process of POST/bootup.  When xen1 finished booting it rejoined the
cluster just fine.  There are no failure- or alert-type messages in the
logs on any of the three hosts prior to the 15:42:02 corosync entries
above, where xen1 has disappeared.

Based on these entries from corosync.conf:

        # How long before declaring a token lost (ms)
        token: 3000
        # How many token retransmits before forming a new configuration
        token_retransmits_before_loss_const: 10

... am I correct in thinking that the two healthy nodes went at least 30
seconds without seeing anything from xen1, then started their fencing
attempts?  In the complete absence of any logs with clues as to what
happend, I'm just trying to build a likely scenario for what happened.  My
guess is that xen1 had some kind of a panic or loss of network on all four
interfaces (two iSCSI interfaces to the SBD and storage device, two bonded
interfaces facing the service-side of things, where corosync talks).  The
xen servers are PowerEdge R815s and OpenManage is running, but there are no
events/errors logged with it either, no hardware issues logged at all.

In any case, this isn't a complaint or issue report, simply an "am I
thinking clearly about what may have happened?" message.  If this ever
happens in the future, hopefully a serial console will provide answers.
 (Side note:  For that, I'm thinking minicom connected to both and writing
logs, is there something nicer that I just don't know about?).

Regards,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120223/1feedd17/attachment-0002.html>