[Pacemaker] Trouble with drbd/pacemaker: switch to secondary/secondary

Vlad vovan at vovan.nl
Fri Oct 21 12:20:28 EDT 2016


In your post I didn't see any cluster configuration related to bnx2x
only regarding IP address.

On 18/10/16 10:05, Anne Nicolas wrote:
> 2016-10-18 9:56 GMT+02:00 Vlad <vovan at vovan.nl>:
>> Is something wrong with the network interface?
>>
>> [34114.046443] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
>> [34185.719207] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
>> full duplex, Flow control: ON - receive & transmit
>> [34232.241599] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
>> [34268.637861] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
>> full duplex, Flow control: ON - receive & transmit
> I don't think so. This interface is part of the cluster resource and
> up on master only. So it seems this is due to resource restart rather.
>
>>
>> On 14/10/16 17:54, Anne Nicolas wrote:
>>> Hi!
>>>
>>> I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba
>>> and some other services.
>>>
>>> Whatever I do, it always goes to the following state:
>>>
>>> Last updated: Fri Oct 14 17:41:38 2016
>>> Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr
>>> Stack: corosync
>>> Current DC: bzvairsvr (168430081) - partition with quorum
>>> Version: 1.1.8-9.mga5-394e906
>>> 2 Nodes configured, unknown expected votes
>>> 13 Resources configured.
>>>
>>>
>>> Online: [ bzvairsvr bzvairsvr2 ]
>>>
>>>  Master/Slave Set: drbdservClone [drbdserv]
>>>      Slaves: [ bzvairsvr bzvairsvr2 ]
>>>  Clone Set: fencing [st-ssh]
>>>      Started: [ bzvairsvr bzvairsvr2 ]
>>>
>>> When I reboot bzvairsvr2 this one goes primary again. But after a while
>>> becomes secondary also.
>>> I use a very basic fencing system based on ssh. It's not optimal but
>>> enough for the current tests.
>>>
>>> Here are information about the configuration:
>>>
>>> node 168430081: bzvairsvr
>>> node 168430082: bzvairsvr2
>>> primitive apache apache \
>>>         params configfile="/etc/httpd/conf/httpd.conf" \
>>>         op start interval=0 timeout=120s \
>>>         op stop interval=0 timeout=120s
>>> primitive clusterip IPaddr2 \
>>>         params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \
>>>         meta target-role=Started
>>> primitive clusterroute Route \
>>>         params destination="0.0.0.0/0" gateway=192.168.100.254
>>> primitive drbdserv ocf:linbit:drbd \
>>>         params drbd_resource=server \
>>>         op monitor interval=30s role=Slave \
>>>         op monitor interval=29s role=Master start-delay=30s
>>> primitive fsserv Filesystem \
>>>         params device="/dev/drbd/by-res/server" directory="/Server"
>>> fstype=ext4 \
>>>         op start interval=0 timeout=60s \
>>>         op stop interval=0 timeout=60s \
>>>         meta target-role=Started
>>> primitive libvirt-guests systemd:libvirt-guests
>>> primitive libvirtd systemd:libvirtd
>>> primitive mysql systemd:mysqld
>>> primitive named systemd:named
>>> primitive samba systemd:smb
>>> primitive st-ssh stonith:external/ssh \
>>>         params hostlist="bzvairsvr bzvairsvr2"
>>> group iphd clusterip clusterroute \
>>>         meta target-role=Started
>>> group services libvirtd libvirt-guests apache named mysql samba \
>>>         meta target-role=Started
>>> ms drbdservClone drbdserv \
>>>         meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
>>> notify=true target-role=Started
>>> clone fencing st-ssh
>>> colocation fs_on_drbd inf: fsserv drbdservClone:Master
>>> colocation iphd_on_services inf: iphd services
>>> colocation services_on_fsserv inf: services fsserv
>>> order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start
>>> order services_after_fsserv inf: fsserv services
>>> property cib-bootstrap-options: \
>>>         dc-version=1.1.8-9.mga5-394e906 \
>>>         cluster-infrastructure=corosync \
>>>         no-quorum-policy=ignore \
>>>         stonith-enabled=true \
>>>
>>> cluster logs are flooded by :
>>> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
>>> attrd_trigger_update:    Sending flush op to all hosts for:
>>> master-drbdserv (10000)
>>> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
>>> attrd_perform_update:    Sent update master-drbdserv=10000 failed:
>>> Transport endpoint is not connected
>>> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
>>> attrd_perform_update:    Sent update -107: master-drbdserv=10000
>>> Oct 14 17:42:28 [3445] bzvairsvr      attrd:  warning:
>>> attrd_cib_callback:      Update master-drbdserv=10000 failed: Transport
>>> endpoint is not connected
>>> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
>>> attrd_trigger_update:    Sending flush op to all hosts for:
>>> master-drbdserv (10000)
>>> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
>>> attrd_perform_update:    Sent update master-drbdserv=10000 failed:
>>> Transport endpoint is not connected
>>> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
>>> attrd_perform_update:    Sent update -107: master-drbdserv=10000
>>> Oct 14 17:42:59 [3445] bzvairsvr      attrd:  warning:
>>> attrd_cib_callback:      Update master-drbdserv=10000 failed: Transport
>>> endpoint is not connected
>>>
>>>
>>> And here is dmesg
>>>
>>> [34067.547147] block drbd0: peer( Secondary -> Primary )
>>> [34091.023206] block drbd0: peer( Primary -> Secondary )
>>> [34096.616319] drbd server: peer( Secondary -> Unknown ) conn( Connected
>>> -> TearDown ) pdsk( UpToDate -> DUnknown )
>>> [34096.616353] drbd server: asender terminated
>>> [34096.616358] drbd server: Terminating drbd_a_server
>>> [34096.682874] drbd server: Connection closed
>>> [34096.682894] drbd server: conn( TearDown -> Unconnected )
>>> [34096.682897] drbd server: receiver terminated
>>> [34096.682900] drbd server: Restarting receiver thread
>>> [34096.682902] drbd server: receiver (re)started
>>> [34096.682915] drbd server: conn( Unconnected -> WFConnection )
>>> [34103.311898] drbd server: Handshake successful: Agreed network
>>> protocol version 101
>>> [34103.311903] drbd server: Agreed to support TRIM on protocol level
>>> [34103.311997] drbd server: Peer authenticated using 20 bytes HMAC
>>> [34103.312046] drbd server: conn( WFConnection -> WFReportParams )
>>> [34103.312062] drbd server: Starting asender thread (from drbd_r_server
>>> [4344])
>>> [34103.380311] block drbd0: drbd_sync_handshake:
>>> [34103.380318] block drbd0: self
>>> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
>>> bits:0 flags:0
>>> [34103.380323] block drbd0: peer
>>> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
>>> bits:0 flags:0
>>> [34103.380327] block drbd0: uuid_compare()=0 by rule 40
>>> [34103.380335] block drbd0: peer( Unknown -> Secondary ) conn(
>>> WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
>>> [34114.046443] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
>>> [34123.802580] drbd server: PingAck did not arrive in time.
>>> [34123.802617] drbd server: peer( Secondary -> Unknown ) conn( Connected
>>> -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>>> [34123.802773] drbd server: asender terminated
>>> [34123.802777] drbd server: Terminating drbd_a_server
>>> [34123.932565] drbd server: Connection closed
>>> [34123.932585] drbd server: conn( NetworkFailure -> Unconnected )
>>> [34123.932588] drbd server: receiver terminated
>>> [34123.932590] drbd server: Restarting receiver thread
>>> [34123.932592] drbd server: receiver (re)started
>>> [34123.932605] drbd server: conn( Unconnected -> WFConnection )
>>> [34185.719207] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
>>> full duplex, Flow control: ON - receive & transmit
>>> [34232.241599] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
>>> [34268.637861] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
>>> full duplex, Flow control: ON - receive & transmit
>>> [34318.675122] drbd server: Handshake successful: Agreed network
>>> protocol version 101
>>> [34318.675128] drbd server: Agreed to support TRIM on protocol level
>>> [34318.675218] drbd server: Peer authenticated using 20 bytes HMAC
>>> [34318.675258] drbd server: conn( WFConnection -> WFReportParams )
>>> [34318.675276] drbd server: Starting asender thread (from drbd_r_server
>>> [4344])
>>> [34318.738909] block drbd0: drbd_sync_handshake:
>>> [34318.738916] block drbd0: self
>>> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
>>> bits:0 flags:0
>>> [34318.738921] block drbd0: peer
>>> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
>>> bits:0 flags:0
>>> [34318.738924] block drbd0: uuid_compare()=0 by rule 40
>>> [34318.738933] block drbd0: peer( Unknown -> Secondary ) conn(
>>> WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
>>> [34328.812317] block drbd0: peer( Secondary -> Primary )
>>> [37316.065793] usb 3-11: USB disconnect, device number 3
>>> [52246.642265] block drbd0: peer( Primary -> Secondary )
>>>
>>> Any help would be appreciated
>>>
>>> Cheers
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>





More information about the Pacemaker mailing list