[Pacemaker] Trouble getting node to re-join two node cluster (OCFS2/DRBD Primary/Primary)

Sun Sep 25 23:10:45 EDT 2011

On Fri, Sep 16, 2011 at 6:24 AM, Mike Reid <mbreid at thepei.com> wrote:
> Hello all,
>
> We have a two-node cluster still in development that has been running fine
> for weeks (little to no traffic). I made some updates to our CIB recently,
> and everything seemed just fine.
>
> Yesterday I attempted to untar ~1.5GB to the OCFS2/DRBD volume, and once it
> was complete one of the nodes had become completely disconnected and I
> haven't been able to reconnect since.
>
> DRBD is working fine,

OCFS2 doesn't seem to think so:

write(2, "mount.ocfs2", 11)             = 11
write(2, ": ", 2)                       = 2
write(2, "I/O error on channel", 20)    = 20
write(2, " ", 1)                        = 1
write(2, "while opening device /dev/drbd0", 31) = 31

I guess if one side got corrupted, then drbd would have sync'd that to
the other side too and this is what you'd see.
Maybe someone on the ocfs2 list can help you look at the on-disk
metadata to see what shape the FS is in.

> everything is UpToDate and I can get both nodes in
> Primary/Primary, but when it comes down to starting OCFS2 and mounting the
> volume, I'm left with:
>
>> resFS:0_start_0 (node=node1, call=21, rc=1, status=complete): unknown error
>
> I am using "pcmk" as the cluster_stack, and letting Pacemaker control
> everything...
>
> The last time this happened the only way I was able to resolve it was to
> reformat the device (via mkfs.ocfs2 -F). I don't think I should have to do
> this, underlying blocks seem fine, and one of the nodes is running just
> fine. The (currently) unmounted node is staying in sync as far as DRBD is
> concerned.
>
> Here's some detail that hopefully will help, please let me know if there's
> anything else I can provide to help know the best way to get this node back
> "online":
>
>
> Ubuntu 10.10 / Kernel 2.6.35
>
> Pacemaker 1.0.9.1
> Corosync 1.2.1
> Cluster Agents 1.0.3 (Heartbeat)
> Cluster Glue 1.0.6
> OpenAIS 1.1.2
>
> DRBD 8.3.10
> OCFS2 1.5.0
>
> cat /sys/fs/ocfs2/cluster_stack = pcmk
>
> node1: mounted.ocfs2 -d
>
> Device                FS     UUID                                  Label
> /dev/sda3             ocfs2  fe4273e1-f866-4541-bbcf-66c5dfd496d6
>
> node2: mounted.ocfs2 -d
>
> Device                FS     UUID                                  Label
> /dev/sda3             ocfs2  d6f7cc6d-21d1-46d3-9792-bc650736a5ef
> /dev/drbd0            ocfs2  d6f7cc6d-21d1-46d3-9792-bc650736a5ef
>
> * NOTES:
> - Both nodes are identical, in fact one node is a direct mirror (hdd clone)
> - I have attached the CIB (crm configure edit contents) and mount trace
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>