[ClusterLabs] 2 node cluster dlm/clvm trouble

Thu Sep 6 17:48:44 UTC 2018

06.09.2018 17:36, Patrick Whitney пишет:
> Good Morning Everyone,
> 
> I'm hoping someone with more experience with corosync and pacemaker can see
> what I am doing wrong.
> 
> I've got a test setup of 2 nodes, with dlm and clvm setup as clones, and
> using fence_scsi as my fencing agent.
> 
> I've got it to the point where the cluster is up, and reports it is happy.
> I then began testing fencing.   When issuing 'pcs stonith fence' it appears
> to work; that is, the scsi reservation is pulled and the output of 'pcs
> status' looks sane, and I'm able to access resources on the un-fenced node.
> 
> Things go awry when I shutdown (init 0) the fenced node... my unfenced node
> decides to fence itself (which looks like it was initiated by dlm due to an
> abandoned lockspace).
> 
> I suspect this is due to misconfiguration, since I'm new to the toolset,
> but I'm not quite sure what I need to change.
> 
> Any and all input is appreciated!
> 
> Below is a chronology of events; my corosync config and cib.xml; command
> output; and annotated logs.
> 
> Again, any hints, suggestions, wild guesses, or premonitions are welcomed
> -- I'm stuck!   Please let me know if there is additional information which
> would be helpful.
> 
> Many thanks,
> -Patrick W.
> 
> Sep  6 08:54:14  -- Cluster is up and running; UI reports everything
>                                healthy.
> 
> Sep  6 08:55:44  -- 'pcs stonith fence' called against node 1
> (coro-test-1);
>                                UI reports everything as expected -- that
> is, resources show only running on unfenced node and they're available.
>                                Oddly, although the UI says dlm is stopped
> on fenced node, the dlm_controld is still running.
> 
> Sep  6 09:03:38  -- node 1 is shutdown, and node 2 falls to pieces.
>                                - First, corosync sees lost member -- seems
> like this is appropriate, to me.
>                                - Next, dlm_controld calls to fence
> everything
>                                - stonith-ng tries to fence node 1 (but its
> already fenced!)
>                                - dlm closes connection to "node 2" (does
> dlm "nodes" map to cluster nodes? I'm not sure they do)
>                                - clvmd dlm lockspace is now abandoned;
> cluster attempts to fence the remaining node
>                                  (But can't because scsi_fence doesn't work
> like that).
> 
> ***
> ******   -- Configuration --
> ***
> root at coro-test-2:~# pcs --version
> 0.9.149
> root at coro-test-2:~# pacemakerd --version
> Pacemaker 1.1.14

I wonder if https://github.com/ClusterLabs/pacemaker/pull/839 is
relevant here.