[ClusterLabs] pcs create master/slave resource doesn't work (Ken Gaillot)

Thu Nov 30 20:36:31 EST 2017

Hi all,

  I am using the ovndb-servers ocf agent[1] which is a kind of multi-state
resource,when I am creating it(please see my previous email), the monitor
is called only once, and the start operation is never called, according to
below description, the once called monitor operation returned
OCF_NOT_RUNNING, should the pacemaker will decide to execute start action
based this return code? is there any way to check out what is the next
action? Currently in my environment nothing happened and I am almost tried
all I known ways to debug, however, no lucky, could anyone help it out?
thank you very much.

Monitor Return CodeDescription
OCF_NOT_RUNNING Stopped
OCF_SUCCESS Running (Slave)
OCF_RUNNING_MASTER Running (Master)
OCF_FAILED_MASTER Failed (Master)
Other Failed (Slave)

[1]
https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf
Hui.

On Thu, Nov 30, 2017 at 6:39 PM, Hui Xiang <xianghuir at gmail.com> wrote:

> The really weired thing is that the monitor is only called once other than
> expected repeatedly, where should I check for it?
>
> On Thu, Nov 30, 2017 at 4:14 PM, Hui Xiang <xianghuir at gmail.com> wrote:
>
>> Thanks Ken very much for your helpful infomation.
>>
>> I am now blocking on I can't see the pacemaker DC do any further
>> start/promote etc action on my resource agents, no helpful logs founded.
>>
>> So my first question is that in what kind of situation DC will decide do
>> call start action?  does the monitor operation need to be return
>> OCF_SUCCESS? in my case, it will return OCF_NOT_RUNNING, and the monitor
>> operation is not being called any more, which should be wrong as I felt
>> that it should be called intervally.
>>
>> The resource agent monitor logistic:
>> In the xx_monitor function it will call xx_update, and there always hit
>>  "$CRM_MASTER -D;;" , what does it usually mean? will it stopped that
>> start operation being called?
>>
>> ovsdb_server_master_update() {
>>     ocf_log info "ovsdb_server_master_update: $1}"
>>
>>     case $1 in
>>         $OCF_SUCCESS)
>>         $CRM_MASTER -v ${slave_score};;
>>         $OCF_RUNNING_MASTER)
>>             $CRM_MASTER -v ${master_score};;
>>         #*) $CRM_MASTER -D;;
>>     esac
>>     ocf_log info "ovsdb_server_master_update end}"
>> }
>>
>> ovsdb_server_monitor() {
>>     ocf_log info "ovsdb_server_monitor"
>>     ovsdb_server_check_status
>>     rc=$?
>>
>>     ovsdb_server_master_update $rc
>>     ocf_log info "monitor is going to return $rc"
>>     return $rc
>> }
>>
>>
>> Below is my cluster configuration:
>>
>> 1. First I have an vip set.
>> [root at node-1 ~]# pcs resource show
>>  vip__management_old (ocf::es:ns_IPaddr2): Started node-1.domain.tld
>>
>> 2. Use pcs to create ovndb-servers and constraint
>> [root at node-1 ~]# pcs resource create tst-ovndb ocf:ovn:ovndb-servers
>> manage_northd=yes master_ip=192.168.0.2 nb_master_port=6641
>> sb_master_port=6642 master
>>      ([root at node-1 ~]# pcs resource meta tst-ovndb-master notify=true
>>       Error: unable to find a resource/clone/master/group:
>> tst-ovndb-master) ## returned error, so I changed into below command.
>> [root at node-1 ~]# pcs resource master tst-ovndb-master tst-ovndb
>> notify=true
>> [root at node-1 ~]# pcs constraint colocation add master tst-ovndb-master
>> with vip__management_old
>>
>> 3. pcs status
>> [root at node-1 ~]# pcs status
>>  vip__management_old (ocf::es:ns_IPaddr2): Started node-1.domain.tld
>>  Master/Slave Set: tst-ovndb-master [tst-ovndb]
>>      Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld ]
>>
>> 4. pcs resource show XXX
>> [root at node-1 ~]# pcs resource show  vip__management_old
>>  Resource: vip__management_old (class=ocf provider=es type=ns_IPaddr2)
>>   Attributes: nic=br-mgmt base_veth=br-mgmt-hapr ns_veth=hapr-m
>> ip=192.168.0.2 iflabel=ka cidr_netmask=24 ns=haproxy gateway=none
>> gateway_metric=0 iptables_start_rules=false iptables_stop_rules=false
>> iptables_comment=default-comment
>>   Meta Attrs: migration-threshold=3 failure-timeout=60
>> resource-stickiness=1
>>   Operations: monitor interval=3 timeout=30 (vip__management_old-monitor-3
>> )
>>               start interval=0 timeout=30 (vip__management_old-start-0)
>>               stop interval=0 timeout=30 (vip__management_old-stop-0)
>> [root at node-1 ~]# pcs resource show tst-ovndb-master
>>  Master: tst-ovndb-master
>>   Meta Attrs: notify=true
>>   Resource: tst-ovndb (class=ocf provider=ovn type=ovndb-servers)
>>    Attributes: manage_northd=yes master_ip=192.168.0.2
>> nb_master_port=6641 sb_master_port=6642
>>    Operations: start interval=0s timeout=30s (tst-ovndb-start-timeout-30s)
>>                stop interval=0s timeout=20s (tst-ovndb-stop-timeout-20s)
>>                promote interval=0s timeout=50s
>> (tst-ovndb-promote-timeout-50s)
>>                demote interval=0s timeout=50s
>> (tst-ovndb-demote-timeout-50s)
>>                monitor interval=30s timeout=20s
>> (tst-ovndb-monitor-interval-30s)
>>                monitor interval=10s role=Master timeout=20s
>> (tst-ovndb-monitor-interval-10s-role-Master)
>>                monitor interval=30s role=Slave timeout=20s
>> (tst-ovndb-monitor-interval-30s-role-Slave)
>>
>>
>> colocation colocation-tst-ovndb-master-vip__management_old-INFINITY inf:
>> tst-ovndb-master:Master vip__management_old:Started
>>
>> 5. I have put log in every ovndb-servers op, seems only the monitor op is
>> being called, no promoted by the pacemaker DC:
>> <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> ovsdb_server_monitor
>> <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> ovsdb_server_check_status
>> <30>Nov 30 15:22:19 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> return OCFOCF_NOT_RUNNINGG
>> <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> ovsdb_server_master_update: 7}
>> <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> ovsdb_server_master_update end}
>> <30>Nov 30 15:22:20 node-1 ovndb-servers(tst-ovndb)[2980860]: INFO:
>> monitor is going to return 7
>> <30>Nov 30 15:22:20 node-1 ovndb-servers(undef)[2980970]: INFO: metadata
>> exit OCF_SUCCESS}
>>
>> 6. The cluster property:
>> property cib-bootstrap-options: \
>>         have-watchdog=false \
>>         dc-version=1.1.12-a14efad \
>>         cluster-infrastructure=corosync \
>>         no-quorum-policy=ignore \
>>         stonith-enabled=false \
>>         symmetric-cluster=false \
>>         last-lrm-refresh=1511802933
>>
>>
>>
>> Thank you very much for any help.
>> Hui.
>>
>>
>> Date: Mon, 27 Nov 2017 12:07:57 -0600
>> From: Ken Gaillot <kgaillot at redhat.com>
>> To: Cluster Labs - All topics related to open-source clustering
>>         welcomed        <users at clusterlabs.org>, jpokorny at redhat.com
>> Subject: Re: [ClusterLabs] pcs create master/slave resource doesn't
>>         work
>> Message-ID: <1511806077.5194.6.camel at redhat.com>
>> Content-Type: text/plain; charset="UTF-8"
>>
>> On Fri, 2017-11-24 at 18:00 +0800, Hui Xiang wrote:
>> > Jan,
>> >
>> > ? Very appreciated on your help, I am getting further more, but still
>> > it looks very strange.
>> >
>> > 1. To use "debug-promote", I upgrade pacemaker from 1.12 to 1.16, pcs
>> > to 0.9.160.
>> >
>> > 2. Recreate resource with below commands
>> > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
>> > ? master_ip=192.168.0.99 \
>> > ? op monitor interval="10s" \
>> > ? op monitor interval="11s" role=Master
>> > pcs resource master ovndb_servers-master ovndb_servers \
>> > ? meta notify="true" master-max="1" master-node-max="1" clone-max="3"
>> > clone-node-max="1"
>> > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.0.99 \
>> > ? ? op monitor interval=10s
>> > pcs constraint colocation add VirtualIP with master ovndb_servers-
>> > master \
>> > ? score=INFINITY
>> >
>> > 3. pcs status
>> > ?Master/Slave Set: ovndb_servers-master [ovndb_servers]
>> > ? ? ?Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld
>> > ]
>> > ?VirtualIP    (ocf::heartbeat:IPaddr2):       Stopped
>> >
>> > 4. Manually run 'debug-start' on 3 nodes and 'debug-promote' on one
>> > of nodes
>> > run below on [ node-1.domain.tld node-2.domain.tld node-3.domain.tld
>> > ]
>> > # pcs resource debug-start ovndb_servers --full
>> > run below on [ node-1.domain.tld ]
>> > # pcs resource debug-promote ovndb_servers --full
>>
>> Before running debug-* commands, I'd unmanage the resource or put the
>> cluster in maintenance mode, so Pacemaker doesn't try to "correct" your
>> actions.
>>
>> >
>> > 5. pcs status
>> > ?Master/Slave Set: ovndb_servers-master [ovndb_servers]
>> > ? ? ?Stopped: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld
>> > ]
>> > ?VirtualIP    (ocf::heartbeat:IPaddr2):       Stopped
>> >
>> > 6. However I have seen that one of ovndb_servers has been indeed
>> > promoted as master, but pcs status still showed all 'stopped'
>> > what am I missing?
>>
>> It's hard to tell from these logs. It's possible the resource agent's
>> monitor command is not exiting with the expected status values:
>>
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemake
>> r_Explained/index.html#_requirements_for_multi_state_resource_agents
>>
>> One of the nodes will be elected the DC, meaning it coordinates the
>> cluster's actions. The DC's logs will have more "pengine:" messages,
>> with each action that needs to be taken (e.g. "* Start <rsc> <node>").
>>
>> You can look through those actions to see what the cluster decided to
>> do -- whether the resources were ever started, whether any was
>> promoted, and whether any were explicitly stopped.
>>
>>
>> > ?>? stderr: + 17:45:59: ocf_log:327: __OCF_MSG='ovndb_servers:
>> > Promoting node-1.domain.tld as the master'
>> > ?>? stderr: + 17:45:59: ocf_log:329: case "${__OCF_PRIO}" in
>> > ?>? stderr: + 17:45:59: ocf_log:333: __OCF_PRIO=INFO
>> > ?>? stderr: + 17:45:59: ocf_log:338: '[' INFO = DEBUG ']'
>> > ?>? stderr: + 17:45:59: ocf_log:341: ha_log 'INFO: ovndb_servers:
>> > Promoting node-1.domain.tld as the master'
>> > ?>? stderr: + 17:45:59: ha_log:253: __ha_log 'INFO: ovndb_servers:
>> > Promoting node-1.domain.tld as the master'
>> > ?>? stderr: + 17:45:59: __ha_log:185: local ignore_stderr=false
>> > ?>? stderr: + 17:45:59: __ha_log:186: local loglevel
>> > ?>? stderr: + 17:45:59: __ha_log:188: '[' 'xINFO: ovndb_servers:
>> > Promoting node-1.domain.tld as the master' = x--ignore-stderr ']'
>> > ?>? stderr: + 17:45:59: __ha_log:190: '[' none = '' ']'
>> > ?>? stderr: + 17:45:59: __ha_log:192: tty
>> > ?>? stderr: + 17:45:59: __ha_log:193: '[' x = x0 -a x = xdebug ']'
>> > ?>? stderr: + 17:45:59: __ha_log:195: '[' false = true ']'
>> > ?>? stderr: + 17:45:59: __ha_log:199: '[' '' ']'
>> > ?>? stderr: + 17:45:59: __ha_log:202: echo 'INFO: ovndb_servers:
>> > Promoting node-1.domain.tld as the master'
>> > ?>? stderr: INFO: ovndb_servers: Promoting node-1.domain.tld as the
>> > master
>> > ?>? stderr: + 17:45:59: __ha_log:204: return 0
>> > ?>? stderr: + 17:45:59: ovsdb_server_promote:378:
>> > /usr/sbin/crm_attribute --type crm_config --name OVN_REPL_INFO -s
>> > ovn_ovsdb_master_server -v node-1.domain.tld
>> > ?>? stderr: + 17:45:59: ovsdb_server_promote:379:
>> > ovsdb_server_master_update 8
>> > ?>? stderr: + 17:45:59: ovsdb_server_master_update:214: case $1 in
>> > ?>? stderr: + 17:45:59: ovsdb_server_master_update:218:
>> > /usr/sbin/crm_master -l reboot -v 10
>> > ?>? stderr: + 17:45:59: ovsdb_server_promote:380: return 0
>> > ?>? stderr: + 17:45:59: 458: rc=0
>> > ?>? stderr: + 17:45:59: 459: exit 0
>> >
>> >
>> > On 23/11/17 23:52 +0800, Hui Xiang wrote:
>> > > I am working on HA with 3-nodes, which has below configurations:
>> > >?
>> > > """
>> > > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
>> > >???master_ip=168.254.101.2 \
>> > >???op monitor interval="10s" \
>> > >???op monitor interval="11s" role=Master
>> > > pcs resource master ovndb_servers-master ovndb_servers \
>> > >???meta notify="true" master-max="1" master-node-max="1" clone-
>> > max="3"
>> > > clone-node-max="1"
>> > > pcs resource create VirtualIP ocf:heartbeat:IPaddr2
>> > ip=168.254.101.2 \
>> > >?????op monitor interval=10s
>> > > pcs constraint order promote ovndb_servers-master then VirtualIP
>> > > pcs constraint colocation add VirtualIP with master ovndb_servers-
>> > master \
>> > >???score=INFINITY
>> > > """
>> >
>> > (Out of curiosity, this looks like a mix of output from?
>> > pcs config export pcs-commands [or clufter cib2pcscmd -s]
>> > and manual editing.??Is this a good guess?)
>> > It's the output of "pcs status".
>> >
>> > >???However, after setting it as above, the master is not being
>> > selected, all
>> > > are stopped, from pacemaker log, node-1 has been chosen as the
>> > master, I am
>> > > confuse where is wrong, can anybody give a help, it would be very
>> > > appreciated.
>> > >?
>> > >?
>> > >??Master/Slave Set: ovndb_servers-master [ovndb_servers]
>> > >??????Stopped: [ node-1.domain.tld node-2.domain.tld node-
>> > 3.domain.tld ]
>> > >??VirtualIP (ocf::heartbeat:IPaddr2): Stopped
>> > >?
>> > >?
>> > > # pacemaker log
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++ /cib/configuration/resources:??<primitive
>> > class="ocf"
>> > > id="ovndb_servers" provider="ovn" type="ovndb-servers"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op:
>> > ++??????????????????????????????????<instance_attributes
>> > > id="ovndb_servers-instance_attributes">
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<nvpair
>> > > id="ovndb_servers-instance_attributes-master_ip" name="master_ip"
>> > > value="168.254.101.2"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op:
>> > ++??????????????????????????????????</instance_attributes>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++??????????????????????????????????<operations>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-start-timeout-30s" interval="0s" name="start"
>> > > timeout="30s"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-stop-timeout-20s" interval="0s" name="stop"
>> > > timeout="20s"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-promote-timeout-50s" interval="0s" name="promote"
>> > > timeout="50s"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-demote-timeout-50s" interval="0s" name="demote"
>> > > timeout="50s"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-monitor-interval-10s" interval="10s"
>> > name="monitor"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????????<op
>> > > id="ovndb_servers-monitor-interval-11s-role-Master" interval="11s"
>> > > name="monitor" role="Master"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++??????????????????????????????????</operations>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++????????????????????????????????</primitive>
>> > >?
>> > > Nov 23 23:06:03 [665249] node-1.domain.tld??????attrd:?????info:
>> > > attrd_peer_update: Setting master-ovndb_servers[node-1.domain.tld]:
>> > (null)
>> > > -> 5 from node-1.domain.tld
>> >
>> > If it's probable your ocf:ovn:ovndb-servers agent in master mode can
>> > run something like "attrd_updater -n master-ovndb_servers -U 5", then
>> > it was indeed launched OK, and if it does not continue to run as
>> > expected, there may be a problem with the agent itself.
>> >
>> > no change.
>> > You can try running "pcs resource debug-promote ovndb_servers --full"
>> > to examine the executation details (assuming the agent responds to
>> > OCF_TRACE_RA=1 environment variable, which is what shell-based
>> > agents built on top ocf-shellfuncs sourcable shell library from
>> > resource-agents project, hence incl. also agents it ships,
>> > customarily do).
>> > Yes, thank, it's helpful.
>> >
>> > > Nov 23 23:06:03 [665251] node-1.domain.tld???????crmd:???notice:
>> > > process_lrm_event: Operation ovndb_servers_monitor_0: ok
>> > > (node=node-1.domain.tld, call=185, rc=0, cib-update=88,
>> > confirmed=true)
>> > > <29>Nov 23 23:06:03 node-1 crmd[665251]:???notice:
>> > process_lrm_event:
>> > > Operation ovndb_servers_monitor_0: ok (node=node-1.domain.tld,
>> > call=185,
>> > > rc=0, cib-update=88, confirmed=true)
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: Diff: --- 0.630.2 2
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: Diff: +++ 0.630.3 (null)
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: +??/cib:??@num_updates=3
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_perform_op: ++
>> > >
>> > /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instanc
>> > e_attributes[@id='status-1']:
>> > > <nvpair id="status-1-master-ovndb_servers" name="master-
>> > ovndb_servers"
>> > > value="5"/>
>> > > Nov 23 23:06:03 [665246] node-1.domain.tld????????cib:?????info:
>> > > cib_process_request: Completed cib_modify operation for section
>> > status: OK
>> > > (rc=0, origin=node-3.domain.tld/attrd/80, version=0.630.3)
>> >
>> > Also depends if there's anything interesting after this point...
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171201/2096904d/attachment-0002.html>