[Pacemaker] OCFS2 fails to mount file system on node reboot sometimes

Tue Feb 22 13:56:12 EST 2011

I get the following error after reboot sometimes when mounting the ocfs2
file system.  If I manually stop and restart corosync it mounts fine but
if I just try to run cleanup or crm resource start it fails.  I don't
understand how I am getting no local IP address set when both the bonded
links for DRBD sync and bonded links for the network are up.

corosync.log:

Feb 22 13:12:12 Condor crmd: [1246]: info: do_lrm_rsc_op: Performing
key=66:4:0:927e853c-e0ee-4f67-a9e7-7cbda27cd316 op=resFS:1_start_0 )

Feb 22 13:12:12 Condor lrmd: [1242]: info: rsc:resFS:1:26: start

Feb 22 13:12:12 Condor lrmd: [1242]: info: RA output:
(resFS:1:start:stderr) FATAL: Module scsi_hostadapter not found.

Feb 22 13:12:12 Condor lrmd: [1242]: info: RA output:
(resFS:1:start:stderr) mount.ocfs2: Transport endpoint is not connected

Feb 22 13:12:12 Condor lrmd: [1242]: info: RA output:
(resFS:1:start:stderr) while mounting /dev/drbd0 on /srv. Check 'dmesg'
for more information on this error.

Feb 22 13:12:12 Condor crmd: [1246]: info: process_lrm_event: LRM
operation resFS:1_start_0 (call=26, rc=1, cib-update=33, confirmed=true)
unknown error

Feb 22 13:12:12 Condor attrd: [1243]: info: find_hash_entry: Creating hash
entry for fail-count-resFS:1

Feb 22 13:12:12 Condor attrd: [1243]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-resFS:1 (INFINITY)

Feb 22 13:12:12 Condor attrd: [1243]: info: attrd_perform_update: Sent
update 21: fail-count-resFS:1=INFINITY

Feb 22 13:12:12 Condor attrd: [1243]: info: find_hash_entry: Creating hash
entry for last-failure-resFS:1

Feb 22 13:12:12 Condor attrd: [1243]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-resFS:1 (1298398314)

Feb 22 13:12:12 Condor attrd: [1243]: info: attrd_perform_update: Sent
update 24: last-failure-resFS:1=1298398314

Feb 22 13:12:12 Condor crmd: [1246]: info: do_lrm_rsc_op: Performing
key=5:5:0:927e853c-e0ee-4f67-a9e7-7cbda27cd316 op=resFS:1_stop_0 )

Feb 22 13:12:12 Condor lrmd: [1242]: info: rsc:resFS:1:27: stop

dmesg:

[   23.896124] DLM (built Jan 11 2011 00:00:14) installed

[   23.917418] block drbd0: role( Secondary -> Primary )

[   24.118912] bond1: no IPv6 routers present

[   25.117097] ocfs2: Registered cluster interface user

[   25.144884] OCFS2 Node Manager 1.5.0

[   25.166762] OCFS2 1.5.0

[   27.085394] bond0: no IPv6 routers present

[   27.305886] dlm: no local IP address has been set

[   27.306168] dlm: cannot start dlm lowcomms -107

[   27.306589] (2370,0):ocfs2_dlm_init:2963 ERROR: status = -107

[   27.306959] (2370,0):ocfs2_mount_volume:1792 ERROR: status = -107

[   27.307289] ocfs2: Unmounting device (147,0) on (node 0)

crm_config:

node Condor \

        attributes standby="off"

node Vulture \

        attributes standby="off"

primitive resDLM ocf:pacemaker:controld \

        op monitor interval="120s"

primitive resDRBD ocf:linbit:drbd \

        params drbd_resource="srv" \

        operations $id="resDRBD-operations" \

        op monitor interval="20" role="Master" timeout="20" \

        op monitor interval="30" role="Slave" timeout="20"

primitive resFS ocf:heartbeat:Filesystem \

        params device="/dev/drbd/by-res/srv" directory="/srv"
fstype="ocfs2" \

        op monitor interval="120s"

primitive resIDRAC-CONDOR stonith:ipmilan \

        params hostname="Condor" ipaddr="192.168.2.61" port="623"
auth="md5" priv="admin" login="xxxx" password="xxxx" \

        meta target-role="Started"

primitive resIDRAC-VULTURE stonith:ipmilan \

        params hostname="Vulture" ipaddr="192.168.2.62" port="623"
auth="md5" priv="admin" login="xxxx" password="xxxx" \

        meta target-role="Started"

primitive resO2CB ocf:pacemaker:o2cb \

        op monitor interval="120s"

primitive resSAMBAVIP ocf:heartbeat:IPaddr2 \

        params ip="192.168.2.200" cidr_netmask="32" nic="bond0"
clusterip_hash="sourceip" \

        op monitor interval="30s" \

        meta resource-stickiness="0"

ms msDRBD resDRBD \

        meta resource-stickiness="100" notify="true" master-max="2"
clone-max="2" clone-node-max="1" interleave="true" target-role="Started"

clone cloneDLM resDLM \

        meta globally-unique="false" interleave="true"
target-role="Started"

clone cloneFS resFS \

        meta interleave="true" ordered="true" target-role="Started"

clone cloneO2CB resO2CB \

        meta globally-unique="false" interleave="true"
target-role="Started"

clone cloneSAMBAVIP resSAMBAVIP \

        meta globally-unique="true" clone-max="2" clone-node-max="2"
target-role="Started"

location locIDRAC-CONDOR resIDRAC-CONDOR -inf: Condor

location locIDRAC-VULTURE resIDRAC-VULTURE -inf: Vulture

colocation colDLMDRBD inf: cloneDLM msDRBD:Master

colocation colFSO2CB inf: cloneFS cloneO2CB

colocation colFSSAMBAVIP inf: cloneFS cloneSAMBAVIP

colocation colO2CBDLM inf: cloneO2CB cloneDLM

order ordDLMO2CB 0: cloneDLM cloneO2CB

order ordDRBDDLM 0: msDRBD:promote cloneDLM

order ordFSSAMBAVIP 0: cloneFS cloneSAMBAVIP

order ordO2CBFS 0: cloneO2CB cloneFS

property $id="cib-bootstrap-options" \

        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \

        cluster-infrastructure="openais" \

        expected-quorum-votes="2" \

        stonith-enabled="true" \

        no-quorum-policy="ignore" \

        last-lrm-refresh="1298398491"

rsc_defaults $id="rsc-options" \

        resource-stickiness="100"

Thanks!

Jake Smith

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110222/76ad0b70/attachment-0002.html>