[ClusterLabs] Trouble with drbd/pacemaker: switch to secondary/secondary

Ken Gaillot kgaillot at redhat.com
Tue Oct 18 11:07:26 EDT 2016


On 10/14/2016 03:22 PM, Anne Nicolas wrote:
> Hi!
> 
> I'm having trouble with a 2 nodes cluster used for DRBD / Apache / Samba
> and some other services.
> 
> Whatever I do, it always goes to the following state:
> 
> Last updated: Fri Oct 14 17:41:38 2016
> Last change: Thu Oct 13 10:42:29 2016 via cibadmin on bzvairsvr
> Stack: corosync
> Current DC: bzvairsvr (168430081) - partition with quorum
> Version: 1.1.8-9.mga5-394e906
> 2 Nodes configured, unknown expected votes
> 13 Resources configured.
> 
> 
> Online: [ bzvairsvr bzvairsvr2 ]
> 
>  Master/Slave Set: drbdservClone [drbdserv]
>      Slaves: [ bzvairsvr bzvairsvr2 ]
>  Clone Set: fencing [st-ssh]
>      Started: [ bzvairsvr bzvairsvr2 ]
> 
> When I reboot bzvairsvr2 this one goes primary again. But after a while
> becomes secondary also.
> I use a very basic fencing system based on ssh. It's not optimal but
> enough for the current tests.
> 
> Here are information about the configuration:
> 
> node 168430081: bzvairsvr
> node 168430082: bzvairsvr2
> primitive apache apache \
>         params configfile="/etc/httpd/conf/httpd.conf" \
>         op start interval=0 timeout=120s \
>         op stop interval=0 timeout=120s
> primitive clusterip IPaddr2 \
>         params ip=192.168.100.1 cidr_netmask=24 nic=eno1 \
>         meta target-role=Started
> primitive clusterroute Route \
>         params destination="0.0.0.0/0" gateway=192.168.100.254
> primitive drbdserv ocf:linbit:drbd \
>         params drbd_resource=server \
>         op monitor interval=30s role=Slave \
>         op monitor interval=29s role=Master start-delay=30s
> primitive fsserv Filesystem \
>         params device="/dev/drbd/by-res/server" directory="/Server"
> fstype=ext4 \
>         op start interval=0 timeout=60s \
>         op stop interval=0 timeout=60s \
>         meta target-role=Started
> primitive libvirt-guests systemd:libvirt-guests
> primitive libvirtd systemd:libvirtd
> primitive mysql systemd:mysqld
> primitive named systemd:named
> primitive samba systemd:smb
> primitive st-ssh stonith:external/ssh \
>         params hostlist="bzvairsvr bzvairsvr2"
> group iphd clusterip clusterroute \
>         meta target-role=Started
> group services libvirtd libvirt-guests apache named mysql samba \
>         meta target-role=Started
> ms drbdservClone drbdserv \
>         meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true target-role=Started
> clone fencing st-ssh
> colocation fs_on_drbd inf: fsserv drbdservClone:Master
> colocation iphd_on_services inf: iphd services
> colocation services_on_fsserv inf: services fsserv
> order fsserv-after-drbdserv inf: drbdservClone:promote fsserv:start
> order services_after_fsserv inf: fsserv services
> property cib-bootstrap-options: \
>         dc-version=1.1.8-9.mga5-394e906 \
>         cluster-infrastructure=corosync \
>         no-quorum-policy=ignore \
>         stonith-enabled=true \
> 
> cluster logs are flooded by :
> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
> attrd_trigger_update:    Sending flush op to all hosts for:
> master-drbdserv (10000)
> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
> attrd_perform_update:    Sent update master-drbdserv=10000 failed:
> Transport endpoint is not connected

This is strange, and the cause of the problem. A master/slave resource
agent will try to set node attributes indicating which node should
become the master. Here, we see that this is failing -- it appears attrd
(Pacemaker's node attribute daemon) is unable to talk to any other daemons.

I'm not sure why this would happen, especially if the rest of the
daemons do not have a problem talking to each other. But that's where
you need to investigate.

One thing I would say is that 1.1.8 is really old at this point, which
means you're using the "legacy" attrd, which I'm not very familiar with.

> Oct 14 17:42:28 [3445] bzvairsvr      attrd:   notice:
> attrd_perform_update:    Sent update -107: master-drbdserv=10000
> Oct 14 17:42:28 [3445] bzvairsvr      attrd:  warning:
> attrd_cib_callback:      Update master-drbdserv=10000 failed: Transport
> endpoint is not connected
> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
> attrd_trigger_update:    Sending flush op to all hosts for:
> master-drbdserv (10000)
> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
> attrd_perform_update:    Sent update master-drbdserv=10000 failed:
> Transport endpoint is not connected
> Oct 14 17:42:59 [3445] bzvairsvr      attrd:   notice:
> attrd_perform_update:    Sent update -107: master-drbdserv=10000
> Oct 14 17:42:59 [3445] bzvairsvr      attrd:  warning:
> attrd_cib_callback:      Update master-drbdserv=10000 failed: Transport
> endpoint is not connected
> 
> 
> And here is dmesg
> 
> [34067.547147] block drbd0: peer( Secondary -> Primary )
> [34091.023206] block drbd0: peer( Primary -> Secondary )
> [34096.616319] drbd server: peer( Secondary -> Unknown ) conn( Connected
> -> TearDown ) pdsk( UpToDate -> DUnknown )
> [34096.616353] drbd server: asender terminated
> [34096.616358] drbd server: Terminating drbd_a_server
> [34096.682874] drbd server: Connection closed
> [34096.682894] drbd server: conn( TearDown -> Unconnected )
> [34096.682897] drbd server: receiver terminated
> [34096.682900] drbd server: Restarting receiver thread
> [34096.682902] drbd server: receiver (re)started
> [34096.682915] drbd server: conn( Unconnected -> WFConnection )
> [34103.311898] drbd server: Handshake successful: Agreed network
> protocol version 101
> [34103.311903] drbd server: Agreed to support TRIM on protocol level
> [34103.311997] drbd server: Peer authenticated using 20 bytes HMAC
> [34103.312046] drbd server: conn( WFConnection -> WFReportParams )
> [34103.312062] drbd server: Starting asender thread (from drbd_r_server
> [4344])
> [34103.380311] block drbd0: drbd_sync_handshake:
> [34103.380318] block drbd0: self
> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
> bits:0 flags:0
> [34103.380323] block drbd0: peer
> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
> bits:0 flags:0
> [34103.380327] block drbd0: uuid_compare()=0 by rule 40
> [34103.380335] block drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
> [34114.046443] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
> [34123.802580] drbd server: PingAck did not arrive in time.
> [34123.802617] drbd server: peer( Secondary -> Unknown ) conn( Connected
> -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> [34123.802773] drbd server: asender terminated
> [34123.802777] drbd server: Terminating drbd_a_server
> [34123.932565] drbd server: Connection closed
> [34123.932585] drbd server: conn( NetworkFailure -> Unconnected )
> [34123.932588] drbd server: receiver terminated
> [34123.932590] drbd server: Restarting receiver thread
> [34123.932592] drbd server: receiver (re)started
> [34123.932605] drbd server: conn( Unconnected -> WFConnection )
> [34185.719207] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
> full duplex, Flow control: ON - receive & transmit
> [34232.241599] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Down
> [34268.637861] bnx2x 0000:05:00.0 enp5s0f0: NIC Link is Up, 10000 Mbps
> full duplex, Flow control: ON - receive & transmit
> [34318.675122] drbd server: Handshake successful: Agreed network
> protocol version 101
> [34318.675128] drbd server: Agreed to support TRIM on protocol level
> [34318.675218] drbd server: Peer authenticated using 20 bytes HMAC
> [34318.675258] drbd server: conn( WFConnection -> WFReportParams )
> [34318.675276] drbd server: Starting asender thread (from drbd_r_server
> [4344])
> [34318.738909] block drbd0: drbd_sync_handshake:
> [34318.738916] block drbd0: self
> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
> bits:0 flags:0
> [34318.738921] block drbd0: peer
> 8B500BD87A5D76D4:0000000000000000:A1860E99AC8107A0:A1850E99AC8107A0
> bits:0 flags:0
> [34318.738924] block drbd0: uuid_compare()=0 by rule 40
> [34318.738933] block drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )
> [34328.812317] block drbd0: peer( Secondary -> Primary )
> [37316.065793] usb 3-11: USB disconnect, device number 3
> [52246.642265] block drbd0: peer( Primary -> Secondary )
> 
> Any help would be appreciated
> 
> Cheers
> 





More information about the Users mailing list