[Pacemaker] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Andreas Mock andreas.mock at web.de
Mon Aug 26 13:31:07 EDT 2013


Hi all,

while the linbit drbd resource agent seems to work perfectly on
pacemaker 1.1.8 (standard software repository) we have problems
with the last release 1.1.10 and also with the newest head
1.1.11.xxx. 

As using drbd is not so uncommon I really hope to find interested
people helping me out. I can provide as much debug information as
you want.


Environment:
RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
DRBD 8.4.3 compiled from sources.
64bit

- A drbd resource configured following the linbit documentation.
- Manual start and stop (up/down) and setting primary of drbd resource
working smoothly.
- 2 nodes dis03-test/dis04-test



- Following simple config on pacemaker 1.1.8
configure
    property no-quorum-policy=stop
    property stonith-enabled=true
    rsc_defaults resource-stickiness=2
    primitive r_stonith-dis03-test stonith:fence_mock \
        meta resource-stickiness="INFINITY" target-role="Started" \
        op monitor interval="180" timeout="300" requires="nothing" \
        op start interval="0" timeout="300" \
        op stop interval="0" timeout="300" \
        params vmname=dis03-test pcmk_host_list="dis03-test"
    primitive r_stonith-dis04-test stonith:fence_mock \
        meta resource-stickiness="INFINITY" target-role="Started" \
        op monitor interval="180" timeout="300" requires="nothing" \
        op start interval="0" timeout="300" \
        op stop interval="0" timeout="300" \
        params vmname=dis04-test pcmk_host_list="dis04-test"
    location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
        rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname eq
dis03-test
    location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
        rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname eq
dis04-test
    primitive r_drbd_postfix ocf:linbit:drbd \
        params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" \
        op monitor interval="15s"  timeout="60s" role="Master" \
        op monitor interval="45s"  timeout="60s" role="Slave" \
        op start timeout="240" \
        op stop timeout="240" \
        meta target-role="Stopped" migration-threshold="2"
    ms ms_drbd_postfix r_drbd_postfix \
        meta master-max="1" master-node-max="1" \
        clone-max="2" clone-node-max="1" \
        notify="true" \
        meta target-role="Stopped"
commit

- Pacemaker is started from scratch
- Config above is applied by crm -f <file> where
<file> has the above config snippet.

- After that crm_mon shows the following status
----------------------8<-------------------------
Last updated: Mon Aug 26 18:42:47 2013
Last change: Mon Aug 26 18:42:42 2013 via cibadmin on dis03-test
Stack: cman
Current DC: dis03-test - partition with quorum
Version: 1.1.10-1.el6-9abe687
2 Nodes configured
4 Resources configured


Online: [ dis03-test dis04-test ]

Full list of resources:

 r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
 r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
 Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
     Stopped: [ dis03-test dis04-test ]

Migration summary:
* Node dis04-test:
* Node dis03-test:
----------------------8<-------------------------

cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by root at dis03-test,
2013-07-24 17:19:24

on both nodes. The drbd resource was shutdown previously in a clean state,
so that any node can be the primary.

- Now the weird behaviour when trying to start the drbd
with
   crm resource start ms_drbd_postfix


Output of crm_mon -1rf
----------------------8<-------------------------
Last updated: Mon Aug 26 18:46:33 2013
Last change: Mon Aug 26 18:46:30 2013 via cibadmin on dis04-test
Stack: cman
Current DC: dis03-test - partition with quorum
Version: 1.1.10-1.el6-9abe687
2 Nodes configured
4 Resources configured


Online: [ dis03-test dis04-test ]

Full list of resources:

 r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
 r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
 Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
     Slaves: [ dis03-test ]
     Stopped: [ dis04-test ]

Migration summary:
* Node dis04-test:
   r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon Aug
26 18:46:30 2013'
* Node dis03-test:

Failed actions:
    r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1,
status=complete, last-rc-change=Mon Aug 26 18:46:29 2013
, queued=1212ms, exec=0ms
): unknown error
----------------------8<-------------------------

In the log of the drbd agent I can find the following
when the promoting request is handled on dis03-test

----------------------8<-------------------------
++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
0: State change failed: (-2) Need access to UpToDate data
Command 'drbdsetup primary 0' terminated with exit code 17
+ cmd_out=
+ ret=17
+ '[' 17 '!=' 0 ']'
+ ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf primary
postfix'
+ '[' 2 -lt 2 ']'
+ __OCF_PRIO=err
+ shift
----------------------8<-------------------------

While working without problems on pacemaker 1.1.8 it doesn't work here.
The error message let me assume that there is a kind of
race condition where pacemaker is firing the promotion too early.
Probably it has something to do with applying attributes from the
drbd resource agent.
But this is just a guess and I really don't know.

ONE ADDITIONAL information: As soon as I do a
resource cleanup on the "defective" node the master
is promoted as expected. That means a:
   crm resource cleanup r_drbd_postfix dis03-test
results in the following:

----------------------8<-------------------------
Last updated: Mon Aug 26 19:29:38 2013
Last change: Mon Aug 26 19:29:28 2013 via cibadmin on dis04-test
Stack: cman
Current DC: dis03-test - partition with quorum
Version: 1.1.10-1.el6-9abe687
2 Nodes configured
4 Resources configured


Online: [ dis03-test dis04-test ]

Full list of resources:

 r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
 r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
 Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
     Masters: [ dis03-test ]
     Slaves: [ dis04-test ]

Migration summary:
* Node dis03-test:
* Node dis04-test:
----------------------8<-------------------------



I really hope I can get some attention as pacemaker 1.1.10
is a milestone for Andrew and drbd from linbit is pretty sure
a building block of many pacemaker based clusters.

Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P
DRBD agent log at http://pastebin.com/ceYNEAhH


So, any help welcome.

Best regards
Andreas Mock






More information about the Pacemaker mailing list