[Pacemaker] Configuring LVM and Filesystem resources on top of DRBD

Fri Feb 5 19:33:25 EST 2010

I haven't been able to find any documentation outside of the man pages to help troubleshoot this, so I've come to the experts...

I'm attempting to setup the following:

Services:              NFS and Samba
                 ------------------------
Filesystems:     /mnt/media | /mnt/datusr
                 ------------------------
Replicated LVMs: vgrData0   | vgrData1
                 ------------------------
Block Devices:   drbd0      | drbd1
                 ------------------------
Underlying LVMs: vgData0    | vgData1
                 ------------------------
Disks:           sdb1       | sdb2

I'm able to get all of this to work manually, without the heartbeat service running. My eventual intended configuration is:

crm configure
primitive drbd_data0 ocf:linbit:drbd params drbd_resource="data0" op monitor interval="15s"
primitive drbd_data1 ocf:linbit:drbd params drbd_resource="data1" op monitor interval="15s"
ms ms_drbd_data0 drbd_data0 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_data1 drbd_data1 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
primitive lvm_data0 ocf:heartbeat:LVM params volgrpname="vgrData0" exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive lvm_data1 ocf:heartbeat:LVM params volgrpname="vgrData1" exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive fs_data0 ocf:heartbeat:Filesystem params device="/dev/vgrData0/lvrData0" directory="/mnt/media" fstype="ext4"
primitive fs_data1 ocf:heartbeat:Filesystem params device="/dev/vgrData1/lvrData1" directory="/mnt/datusr" fstype="ext4"
primitive ip_data ocf:heartbeat:IPaddr2 params ip="192.168.67.101" nic="eth0"
primitive svc_nfs lsb:nfs
primitive svc_samba lsb:smb
colocation col_data00 inf: ms_drbd_data0:Master ms_drbd_data1:Master
colocation col_data01 inf: ms_drbd_data0:Master lvm_data0
colocation col_data02 inf: ms_drbd_data0:Master fs_data0
colocation col_data03 inf: ms_drbd_data0:Master lvm_data1
colocation col_data04 inf: ms_drbd_data0:Master fs_data1
colocation col_data05 inf: ms_drbd_data0:Master ip_data
colocation col_data06 inf: ms_drbd_data0:Master svc_nfs
colocation col_data07 inf: ms_drbd_data0:Master svc_samba
order ord_data00 inf: ms_drbd_data0:promote ms_drbd_data1:promote
order ord_data01 inf: ms_drbd_data0:promote lvm_data0:start
order ord_data02 inf: lvm_data0:start fs_data0:start
order ord_data03 inf: ms_drbd_data1:promote lvm_data1:start
order ord_data04 inf: lvm_data1:start fs_data1:start
order ord_data05 inf: fs_data0:start fs_data1:start
order ord_data06 inf: fs_data1:start ip_data:start
order ord_data07 inf: ip_data:start svc_nfs:start
order ord_data08 inf: ip_data:start svc_samba:start
commit
bye

However, I've only been able to get the following so far to work:

crm configure
primitive drbd_data0 ocf:linbit:drbd params drbd_resource="data0" op monitor interval="15s"
primitive drbd_data1 ocf:linbit:drbd params drbd_resource="data1" op monitor interval="15s"
ms ms_drbd_data0 drbd_data0 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
ms ms_drbd_data1 drbd_data1 meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
primitive lvm_data0 ocf:heartbeat:LVM params volgrpname="vgrData0" exclusive="yes" op monitor depth="0" timeout="30" interval="10"
primitive fs_data0 ocf:heartbeat:Filesystem params device="/dev/vgrData0/lvrData0" directory="/mnt/media" fstype="ext4"
colocation col_data00 inf: ms_drbd_data0:Master ms_drbd_data1:Master
colocation col_data01 inf: ms_drbd_data0:Master lvm_data0
colocation col_data02 inf: ms_drbd_data0:Master fs_data0
order ord_data00 inf: ms_drbd_data0:promote ms_drbd_data1:promote
order ord_data01 inf: ms_drbd_data0:promote lvm_data0:start
order ord_data02 inf: lvm_data0:start fs_data0:start
commit
bye

And to get the above to work, I have to issue:

crm resource failcount lvm_data0 delete node01
crm resource failcount lvm_data0 delete node02
crm resource failcount fs_data0 delete node01
crm resource failcount fs_data0 delete node02
crm resource cleanup lvm_data0
crm resource cleanup fs_data0

Then everything starts just fine on it's own. After a reboot however, it pukes again. The only resources that start reliably are the drbd resources.

Looking through the logs (attached), it appears pacemaker may be attempting to verify the replicated LV (/dev/vgrData0/lvrData0) is down before starting the drbd resources. Since the replicated LVs are using the drbd devices as their backing block device, this first scan will fail, and I think this is were the failcounts for the replicated LV is hitting INFINITY before we've ever even tried to start the LV.

So assuming my analysis is correct, what ideas might you have to best address this issue? I believe I either need a way to prevent Pacemaker from attempting to find the replicated LV before the drbd resources are online, or I need a way to have Pacemaker automatically clear/cleanup the failcounts/status for the LV after the drbd resources come online.

Thoughts? Suggestions?

Thanks in advance.

DJ

_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/201469230/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100205/1272c144/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemakerlog01-node01.log
Type: application/octet-stream
Size: 11679 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100205/1272c144/attachment.obj>