[Pacemaker] Seems to be working but fails to transition to other node.

Wed May 30 22:29:06 EDT 2012

----- Original Message -----
> From: "Steven Silk" <steven.silk at noaa.gov>
> To: pacemaker at oss.clusterlabs.org
> Sent: Wednesday, May 30, 2012 8:16:56 PM
> Subject: [Pacemaker] Seems to be working but fails to transition to other	node.
> 
> 
> All Concerned;
> 
> I have been getting slapped around all day with this problem - I
> can't solve it.
> 
> The system is only half done - I have not yet implemented the nfs
> portion - but drbd part is not yet cooperating with corosync.
> 
> It appears to be working OK - but when I stop corosync on the DC -
> the other node does not start drbd?
> 
> Here is how I am setting things up....
> 
> 
> 
> 
> Configure quorum and stonith property no-quorum-policy="ignore"
> 
property stonith-enabled="false"
> 
> On wms1 onfigure DRBD resource primitive drbd_drbd0 ocf:linbit:drbd \
> 
                    params drbd_resource="drbd0" \
> 
                    op monitor interval="30s"

Should have a monitor op for both "Master" and "Slave" roles i.e.

        op monitor interval="30" role="Slave" \
        op monitor interval="20" role="Master"

> 
> Configure DRBD Master/Slave ms ms_drbd_drbd0 drbd_drbd0 \
> 
                    meta master-max="1" master-node-max="1" \
> 
                         clone-max="2" clone-node-max="1" \
> 
                         notify="true"
> 
> Configure filesystem mountpoint primitive fs_ftpdata
> ocf:heartbeat:Filesystem \
                    params
> device="/dev/drbd0" \
                    directory="/mnt/drbd0"
> fstype="ext3"
> When I check the status on the DC....
> 
> [root at wms2 ~]# crm
> crm(live)# status
> ============
> Last updated: Wed May 30 23:58:43 2012
> Last change: Wed May 30 23:52:42 2012 via cibadmin on wms1
> Stack: openais
> Current DC: wms2 - partition with quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ wms1 wms2 ]
> 
>  Master/Slave Set: ms_drbd_drbd0 [drbd_drbd0]
>      Masters: [ wms2 ]
>      Slaves: [ wms1 ]
>  fs_ftpdata    (ocf::heartbeat:Filesystem):    Started wms2
> 
> [root at wms2 ~]# mount -l | grep drbd
> 
> /dev/drbd0 on /mnt/drbd0 type ext3 (rw)
> 
> So I stop corosync - but the other node...
> 
> [root at wms1 ~]# crm
> crm(live)# status
> ============
> Last updated: Thu May 31 00:12:17 2012
> Last change: Wed May 30 23:52:42 2012 via cibadmin on wms1
> Stack: openais
> Current DC: wms1 - partition WITHOUT quorum
> Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
> 2 Nodes configured, 2 expected votes
> 3 Resources configured.
> ============
> 
> Online: [ wms1 ]
> OFFLINE: [ wms2 ]
> 
>  Master/Slave Set: ms_drbd_drbd0 [drbd_drbd0]
>      Masters: [ wms1 ]
>      Stopped: [ drbd_drbd0:1 ]
> 
> Fails to mount /dev/drbd0?
> 
> Any ideas?
> 
> I tailed /var/log/cluster/corosync.log and get this....
> 
> May 31 00:02:36 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 22 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:03:06 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 25 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:03:10 wms1 crmd: [1268]: WARN: cib_rsc_callback: Resource
> update 15 failed: (rc=-41) Remote node did not respond
> May 31 00:03:36 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 28 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:04:06 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 31 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:04:10 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 34 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:04:10 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 37 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:04:10 wms1 attrd: [1266]: WARN: attrd_cib_callback: Update
> 40 for master-drbd_drbd0:0=5 failed: Remote node did not respond
> May 31 00:08:02 wms1 cib: [1257]: info: cib_stats: Processed 58
> operations (0.00us average, 0% utilization) in the last 10min
> May 31 00:08:02 wms1 cib: [1264]: info: cib_stats: Processed 117
> operations (256.00us average, 0% utilization) in the last 10min
> 
> [root at wms2 ~]# tail /var/log/cluster/corosync.log
> May 31 00:02:16 corosync [pcmk  ] info: update_member: Node wms2 now
> has process list: 00000000000000000000000000000002 (2)
> May 31 00:02:16 corosync [pcmk  ] notice: pcmk_shutdown: Shutdown
> complete
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: Pacemaker
> Cluster Manager 1.1.6
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> extended virtual synchrony service
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> configuration service
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> cluster closed process group service v1.01
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> cluster config database access v1.01
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> profile loading service
> May 31 00:02:16 corosync [SERV  ] Service engine unloaded: corosync
> cluster quorum service v0.1
> May 31 00:02:16 corosync [MAIN  ] Corosync Cluster Engine exiting
> with status 0 at main.c:1858.
> 
> 
> 
> the example that I am working from talks about doing the
> following....
> 
> 
> group services fs_drbd0 But this fails miserable...  services being
> undefined?
> 

Close but you can't put ms resources inside groups and if it would work you're syntax for the group is off a bit.

What you need an ordering constraint to make sure that DRBD is promoted to master before the filesystem is mounted.  And then you need a colocation constraint to make sure that the filesystem is started on the same node as the DRBD master. i.e.:

order o_drbd_master_then_fs inf: ms_drbd_drbd0:promote fs_ftpdata:start

colocation c_fs_with_drbd_master inf: fs_ftpdata ms_drbd_drbd0:Master

HTH

Jake