[Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

Thu Sep 29 11:06:10 EDT 2011

Sorry for top-posting, I'm Outlook-afflicted.

This is also my problem; In the full production environment there will be low-level hardware fencing by means of IBM RSA/ASM but this is a VMware test environment. The vmware STONITH plugin is dated and doesn't seem to work correctly (I gave up quickly due to the author of the plugin stating on this list that it probably won't work) and SSH STONITH seems to have been removed, not that it would do much good in this circumstance.

Therefore, there's no way to set up STONITH in a test environment in VMware which is where I believe a lot of people architect solutions these days, so there's no way to prove a solution works.

I'll attempt to modify and improve the VMware STONITH agent but I'm not sure how in this situation where a node has gone away and left a single remaining node, but the remaining node is then failing, how STONITH could help? Is this where the suicide agent comes in?

Regards,
Darren

-----Original Message-----
From: Nick Khamis [mailto:symack at gmail.com] 
Sent: 29 September 2011 15:48
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

Hello Dejan,

Sorry to hijack, I am also working on the same type of setup as a prototype.
What is the best way to get stonith included for VM setups? Maybe an SSH stonith?
Again, this is just for the prototype.

Cheers,

Nick.

On Thu, Sep 29, 2011 at 9:28 AM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi Darren,
>
> On Thu, Sep 29, 2011 at 02:15:34PM +0100, Darren.Mansell at opengi.co.uk wrote:
>> (Originally sent to DRBD-user, reposted here as it may be more 
>> relevant)
>>
>>
>>
>>
>> Hello all.
>>
>>
>>
>> I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2 
>> for dual-primary shared FS.
>>
>>
>>
>> I've followed the instructions on the DRBD applications site and it 
>> works really well.
>>
>>
>>
>> However, if I 'pull the plug' on a node, the other node continues to 
>> operate the clones, but the filesystem is locked and inaccessible 
>> (the monitor op works for the filesystem, but fails for the OCFS2 
>> resource.)
>>
>>
>>
>> If I do a reboot one node, there are no problems and I can continue 
>> to access the OCFS2 FS.
>>
>>
>>
>> After I pull the plug:
>>
>>
>>
>> Online: [ test-odp-02 ]
>>
>> OFFLINE: [ test-odp-01 ]
>>
>>
>>
>> Resource Group: Load-Balancing
>>
>>      Virtual-IP-ODP     (ocf::heartbeat:IPaddr2):       Started
>> test-odp-02
>>
>>      Virtual-IP-ODPWS   (ocf::heartbeat:IPaddr2):       Started
>> test-odp-02
>>
>>      ldirectord (ocf::heartbeat:ldirectord):    Started test-odp-02
>>
>> Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2]
>>
>>      Masters: [ test-odp-02 ]
>>
>>      Stopped: [ p_drbd_ocfs2:1 ]
>>
>> Clone Set: cl-odp [odp]
>>
>>      Started: [ test-odp-02 ]
>>
>>      Stopped: [ odp:1 ]
>>
>> Clone Set: cl-odpws [odpws]
>>
>>      Started: [ test-odp-02 ]
>>
>>      Stopped: [ odpws:1 ]
>>
>> Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
>>
>>      Started: [ test-odp-02 ]
>>
>>      Stopped: [ p_fs_ocfs2:1 ]
>>
>> Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]
>>
>>      Started: [ test-odp-02 ]
>>
>>      Stopped: [ g_ocfs2mgmt:1 ]
>>
>>
>>
>> Failed actions:
>>
>>     p_o2cb:0_monitor_10000 (node=test-odp-02, call=19, rc=-2, 
>> status=Timed Out): unknown
>>
>> exec error
>>
>>
>>
>>
>>
>> test-odp-02:~ # mount
>>
>> /dev/drbd0 on /opt/odp type ocfs2
>> (rw,_netdev,noatime,cluster_stack=pcmk)
>>
>>
>>
>> test-odp-02:~ # ls /opt/odp
>>
>> ...just hangs forever...
>>
>>
>>
>> If I then power test-odp-01 back on, everything fails back fine and 
>> the ls command suddenly completes.
>>
>>
>>
>> It seems to me that OCFS2 is trying to talk to the node that has 
>> disappeared and doesn't time out. Does anyone have any ideas? 
>> (attached CRM and DRBD configs)
>
> With stonith disabled, I doubt that your cluster can behave as it 
> should.
>
> Thanks,
>
> Dejan
>
>>
>>
>> Many thanks.
>>
>>
>>
>> Darren Mansell
>>
>>
>>
>
>
> Content-Description: crm.txt
>> node test-odp-01
>> node test-odp-02 \
>>         attributes standby="off"
>> primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \
>>         params lvs_support="true" ip="2.21.15.100" cidr_netmask="8" 
>> broadcast="2.255.255.255" \
>>         op monitor interval="1m" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \
>>         params lvs_support="true" ip="2.21.15.103" cidr_netmask="8" 
>> broadcast="2.255.255.255" \
>>         op monitor interval="1m" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive ldirectord ocf:heartbeat:ldirectord \
>>         params configfile="/etc/ha.d/ldirectord.cf" \
>>         op monitor interval="2m" timeout="20s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive odp lsb:odp \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive odpwebservice lsb:odpws \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive p_controld ocf:pacemaker:controld \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive p_drbd_ocfs2 ocf:linbit:drbd \
>>         params drbd_resource="r0" \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
>>         params device="/dev/drbd/by-res/r0" directory="/opt/odp" 
>> fstype="ocfs2" options="rw,noatime" \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> primitive p_o2cb ocf:ocfs2:o2cb \
>>         op monitor interval="10s" enabled="true" timeout="10s" \
>>         meta migration-threshold="10" failure-timeout="600"
>> group Load-Balancing Virtual-IP-ODP Virtual-IP-ODPWS ldirectord group 
>> g_ocfs2mgmt p_controld p_o2cb ms ms_drbd_ocfs2 p_drbd_ocfs2 \
>>         meta master-max="2" clone-max="2" notify="true"
>> clone cl-odp odp
>> clone cl-odpws odpws
>> clone cl_fs_ocfs2 p_fs_ocfs2 \
>>         meta target-role="Started"
>> clone cl_ocfs2mgmt g_ocfs2mgmt \
>>         meta interleave="true"
>> location Prefer-Node1 ldirectord \
>>         rule $id="prefer-node1-rule" 100: #uname eq test-odp-01 order 
>> o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start 
>> cl_fs_ocfs2:start order tomcatlast1 inf: cl_fs_ocfs2 cl-odp order 
>> tomcatlast2 inf: cl_fs_ocfs2 cl-odpws property 
>> $id="cib-bootstrap-options" \
>>         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         no-quorum-policy="ignore" \
>>         start-failure-is-fatal="false" \
>>         stonith-action="reboot" \
>>         stonith-enabled="false" \
>>         last-lrm-refresh="1317207361"
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacem
>> aker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacema
> ker
>

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker