[ClusterLabs] drbd clone not becoming master

Fri Nov 3 15:49:59 CET 2017

On Thu, 2017-11-02 at 23:18 +0100, Dennis Jacobfeuerborn wrote:
> On 02.11.2017 23:08, Dennis Jacobfeuerborn wrote:
> > Hi,
> > I'm setting up a redundant NFS server for some experiments but
> > almost
> > immediately ran into a strange issue. The drbd clone resource never
> > promotes either of the to clones to the Master state.
> > 
> > The state says this:
> > 
> >  Master/Slave Set: drbd-clone [drbd]
> >      Slaves: [ nfsserver1 nfsserver2 ]
> >  metadata-fs	(ocf::heartbeat:Filesystem):	Stopped
> > 
> > The resource configuration looks like this:
> > 
> > Resources:
> >  Master: drbd-clone
> >   Meta Attrs: master-node-max=1 clone-max=2 notify=true master-
> > max=1
> > clone-node-max=1
> >   Resource: drbd (class=ocf provider=linbit type=drbd)
> >    Attributes: drbd_resource=r0
> >    Operations: demote interval=0s timeout=90 (drbd-demote-interval-
> > 0s)
> >                monitor interval=60s (drbd-monitor-interval-60s)
> >                promote interval=0s timeout=90 (drbd-promote-
> > interval-0s)
> >                start interval=0s timeout=240 (drbd-start-interval-
> > 0s)
> >                stop interval=0s timeout=100 (drbd-stop-interval-0s)
> >  Resource: metadata-fs (class=ocf provider=heartbeat
> > type=Filesystem)
> >   Attributes: device=/dev/drbd/by-res/r0/0
> > directory=/var/lib/nfs_shared
> > fstype=ext4 options=noatime
> >   Operations: monitor interval=20 timeout=40
> > (metadata-fs-monitor-interval-20)
> >               start interval=0s timeout=60 (metadata-fs-start-
> > interval-0s)
> >               stop interval=0s timeout=60 (metadata-fs-stop-
> > interval-0s)
> > 
> > Location Constraints:
> > Ordering Constraints:
> >   promote drbd-clone then start metadata-fs (kind:Mandatory)
> > Colocation Constraints:
> >   metadata-fs with drbd-clone (score:INFINITY) (with-rsc-
> > role:Master)
> > 
> > Shouldn't one of the clones be promoted to the Master state
> > automatically?
> 
> I think the source of the issue is this:
> 
> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Called
> /usr/sbin/crm_master -Q -l reboot -v 10000
> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Exit code 107
> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Command
> output:
> Nov  2 23:12:03 nfsserver1 lrmd[2163]:  notice:
> drbd_monitor_60000:4673:stderr [ Error signing on to the CIB service:
> Transport endpoint is not connected ]
> 
> It seems the drbd resource agent tries to use crm_master to promote
> the
> clone but fails because it cannot "sign on to the CIB service". Does
> anybody know what that means?
> 
> Regards,
>   Dennis
> 

That's odd, it should only happen if the cluster is not running, but
then the agent wouldn't have been called.

The CIB is one of the core daemons of pacemaker; it manages the cluster
configuration and status. If it's not running, the cluster can't do
anything.

Perhaps the CIB is crashing, or something is blocking the communication
between the agent and the CIB.
-- 
Ken Gaillot <kgaillot at redhat.com>