[ClusterLabs] drbd clone not becoming master

Sat Nov 4 03:28:59 UTC 2017

On 03.11.2017 15:49, Ken Gaillot wrote:
> On Thu, 2017-11-02 at 23:18 +0100, Dennis Jacobfeuerborn wrote:
>> On 02.11.2017 23:08, Dennis Jacobfeuerborn wrote:
>>> Hi,
>>> I'm setting up a redundant NFS server for some experiments but
>>> almost
>>> immediately ran into a strange issue. The drbd clone resource never
>>> promotes either of the to clones to the Master state.
>>>
>>> The state says this:
>>>
>>>  Master/Slave Set: drbd-clone [drbd]
>>>      Slaves: [ nfsserver1 nfsserver2 ]
>>>  metadata-fs	(ocf::heartbeat:Filesystem):	Stopped
>>>
>>> The resource configuration looks like this:
>>>
>>> Resources:
>>>  Master: drbd-clone
>>>   Meta Attrs: master-node-max=1 clone-max=2 notify=true master-
>>> max=1
>>> clone-node-max=1
>>>   Resource: drbd (class=ocf provider=linbit type=drbd)
>>>    Attributes: drbd_resource=r0
>>>    Operations: demote interval=0s timeout=90 (drbd-demote-interval-
>>> 0s)
>>>                monitor interval=60s (drbd-monitor-interval-60s)
>>>                promote interval=0s timeout=90 (drbd-promote-
>>> interval-0s)
>>>                start interval=0s timeout=240 (drbd-start-interval-
>>> 0s)
>>>                stop interval=0s timeout=100 (drbd-stop-interval-0s)
>>>  Resource: metadata-fs (class=ocf provider=heartbeat
>>> type=Filesystem)
>>>   Attributes: device=/dev/drbd/by-res/r0/0
>>> directory=/var/lib/nfs_shared
>>> fstype=ext4 options=noatime
>>>   Operations: monitor interval=20 timeout=40
>>> (metadata-fs-monitor-interval-20)
>>>               start interval=0s timeout=60 (metadata-fs-start-
>>> interval-0s)
>>>               stop interval=0s timeout=60 (metadata-fs-stop-
>>> interval-0s)
>>>
>>> Location Constraints:
>>> Ordering Constraints:
>>>   promote drbd-clone then start metadata-fs (kind:Mandatory)
>>> Colocation Constraints:
>>>   metadata-fs with drbd-clone (score:INFINITY) (with-rsc-
>>> role:Master)
>>>
>>> Shouldn't one of the clones be promoted to the Master state
>>> automatically?
>>
>> I think the source of the issue is this:
>>
>> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Called
>> /usr/sbin/crm_master -Q -l reboot -v 10000
>> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Exit code 107
>> Nov  2 23:12:03 nfsserver1 drbd(drbd)[4673]: ERROR: r0: Command
>> output:
>> Nov  2 23:12:03 nfsserver1 lrmd[2163]:  notice:
>> drbd_monitor_60000:4673:stderr [ Error signing on to the CIB service:
>> Transport endpoint is not connected ]
>>
>> It seems the drbd resource agent tries to use crm_master to promote
>> the
>> clone but fails because it cannot "sign on to the CIB service". Does
>> anybody know what that means?
>>
>> Regards,
>>   Dennis
>>
> 
> That's odd, it should only happen if the cluster is not running, but
> then the agent wouldn't have been called.
> 
> The CIB is one of the core daemons of pacemaker; it manages the cluster
> configuration and status. If it's not running, the cluster can't do
> anything.
> 
> Perhaps the CIB is crashing, or something is blocking the communication
> between the agent and the CIB.

SELinux was the culprit. After disabling it the problem went away.

Regards,
  Dennis