[Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

Elmar Marschke elmar.marschke at schenker.at
Wed Aug 21 10:20:47 EDT 2013


Am 19.08.2013 16:25, schrieb Vladislav Bogdanov:
> 16.08.2013 16:04, Elmar Marschke wrote:
>> Hi all,
>>
>> i'm working on a two node pacemaker cluster with dual primary drbd and
>> ocfs2.
>>
>> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting, reading,
>> writing, everything...).
>
> ocfs2 uses own clustering stack by default.
>
>>
>> When i try to make this work in pacemaker, there seems to be a problem
>> to start the o2cb resource.
>>
>> My (already simplified) configuration is:
>> -----------------------------------------
>> node poc1 \
>>      attributes standby="off"
>> node poc2 \
>>      attributes standby="off"
>> primitive res_dlm ocf:pacemaker:controld \
>>      op monitor interval="120"
>> primitive res_drbd ocf:linbit:drbd \
>>      params drbd_resource="r0" \
>>      op stop interval="0" timeout="100" \
>>      op start interval="0" timeout="240" \
>>      op promote interval="0" timeout="90" \
>>      op demote interval="0" timeout="90" \
>>      op notifiy interval="0" timeout="90" \
>>      op monitor interval="40" role="Slave" timeout="20" \
>>      op monitor interval="20" role="Master" timeout="20"
>> primitive res_o2cb ocf:pacemaker:o2cb \
>>      op monitor interval="60"
>> ms ms_drbd res_drbd \
>>      meta notify="true" master-max="2" master-node-max="1"
>> target-role="Started"
>> property $id="cib-bootstrap-options" \
>>      no-quorum-policy="ignore" \
>>      stonith-enabled="false" \
>>      dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>>      cluster-infrastructure="openais" \
>>      expected-quorum-votes="2" \
>>      last-lrm-refresh="1376574860"
>
> Side note: you need to run both dlm and o2cb as clones, and group them
> (either with "group" or with pair of colocation/order statements), so so
> ocfs2_controld is started when dlm_controld already runs. You probably
> already tried that, but do not forget the last part of this.
>
>>
>>
>> First error message in corosync.log as far as i can identify it:
>> ----------------------------------------------------------------
>> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr) dlm_controld.pcmk:
>> no process found
>> [ other stuff ]
>> lrmd: [5547]: info: RA output: (res_dlm:start:stderr) dlm_controld.pcmk:
>> no process found
>> [ other stuff ]
>>   lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
>> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
>>
>> (
>> You can find the whole corosync logfile (starting corosync on node 1
>> from beginning until after starting of resources) on:
>> http://www.marschke.info/corosync_drei.log
>> )
>>
>> syslog shows:
>> -------------
>> ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not exist
>
> How exactly did you start corosync process? As "corosync" or as "openais"?
> Background is that CKPT service is not loaded by corosync by default,
> only if it is started by openais script, you may want to look at it for
> details.

hello vladislav,

thanks for this information. I started it as "corosync2". Just for 
interest, do you know what "CKPT" means? Anyway, currently i think this 
log message isn't so relevant anymore, because my cluster is running 
fine (apart from another "little" issue, but maybe this is more related 
to the virtual machine i'm currently running as a resource on the 
cluster - i have to research that further...).

regards
e.

>>
>>
>> Output of crm_mon:
>> ------------------
>> ============
>> Stack: openais
>> Current DC: poc1 - partition WITHOUT quorum
>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>> 2 Nodes configured, 2 expected votes
>> 4 Resources configured.
>> ============
>>
>> Online: [ poc1 ]
>> OFFLINE: [ poc2 ]
>>
>>   Master/Slave Set: ms_drbd [res_drbd]
>>       Masters: [ poc1 ]
>>       Stopped: [ res_drbd:1 ]
>>   res_dlm    (ocf::pacemaker:controld):    Started poc1
>>
>> Migration summary:
>> * Node poc1:
>>     res_o2cb: migration-threshold=1000000 fail-count=1000000
>>
>> Failed actions:
>>      res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete): unknown
>> error
>>
>> ---------------------------------------------------------------------
>> This is the situation after a reboot of node poc1. For simplification i
>> left pacemaker / corosync unstarted on the second node, and already
>> removed a group and a clone resource where dlm and o2cb already had been
>> in (errors were there also).
>>
>> Is my configuration of the resource agents correct?
>> I checked using "ra meta ...", but as far as i recognized everything is ok.
>>
>> Is some piece of software missing?
>> dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
>> available, i even did additional links in /usr/sbin:
>> root at poc1:~# which ocfs2_controld.pcmk
>> /usr/sbin/ocfs2_controld.pcmk
>> root at poc1:~# which dlm_controld.pcmk
>> /usr/sbin/dlm_controld.pcmk
>> root at poc1:~#
>>
>> I already googled but couldn't find any useful. Thanks for any hints...:)
>>
>> kind regards
>> elmar
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Pacemaker mailing list