[ClusterLabs] Antw: [EXT] How to stop removed resources when replacing cib.xml via cibadmin or crm_shadow

Fri Oct 2 14:35:23 EDT 2020

On Thu, Oct 1, 2020 at 5:55 PM Ken Gaillot <kgaillot at redhat.com> wrote:

> There's no harm on the Pacemaker side in doing so.
>
> A resource that's running but removed from the configuration is what
> Pacemaker calls an "orphan". By default (the stop-orphan-resources
> cluster property) it will try to stop these. Pacemaker keeps the set of
> parameters that a resource was started with in memory, so it doesn't
> need the now-removed configuration to perform the stop. So, the
> "ORPHANED" part of this is normal and appropriate.
>
> The problem in this particular case is the "FAILED ... (blocked)".
> Removing the configuration shouldn't cause the resource to fail, and
> something is blocking the stop. You should be able to see in the failed
> action section of the status, or in the logs, what failed and why it's
> blocked. My guess is the stop itself failed, in which case you'd need
> to investigate why that happened.
>

Hi Ken,

As always, thanks a lot for pointing me to the right direction!
I have digged logs, but something not logical happens. Maybe you can shed
light a bit?

Just in case, I have a pretty old pacemaker (Pacemaker 1.1.15-11.el7)
freezing of the version was conducted by
changes in stikiness=-1 attribute handling logic. I consider update to a
newer stable version later on, but
at the moment I have to deal with this version.

First of all "stop-orphan-resources" was not set and thus by default should
be true, but still I see a strange message
> Cluster configured not to stop active orphans. vip-10.0.0.115 must be
stopped manually on tt738741-ip2

I even explicitly set "stop-orphan-resources" to true
> sudo crm_attribute --type crm_config --name stop-orphan-resources --query
scope=crm_config  name=stop-orphan-resources value=true

For the sake of justice pacemaker still tries to remove ORPHANED resources,
i.e
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
native_color:    Stopping orphan resource vip-10.0.0.115

To my mind the issue is the following:
> Clearing failcount for monitor on vip-10.0.0.115, tt738741-ip2 failed and
now resource parameters have changed.

I suppose that this operation clears parameters of the resource which were
used at start

The error is pretty straight:
> IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP address
(the ip parameter) is mandatory

As I understand it means the ip parameter has already vanished at the
moment of "stop" action.

It looks like a bug, but who knows.

Trimmed logs, if required I can provide full log, cib.xml, etc
```
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:  warning:
process_rsc_state:       Cluster configured not to stop active orphans.
vip-10.0.0.115 must be stopped manually on tt738741-ip2
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
native_add_running:      resource vip-10.0.0.115 isn't managed
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
native_add_running:      resource haproxy-10.0.0.115 isn't managed
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
determine_op_status:     Operation monitor found resource vip-10.0.0.115
active on tt738741-ip2
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
check_operation_expiry:  Clearing failcount for monitor on vip-10.0.0.115,
tt738741-ip2 failed and now resource parameters have changed.
...
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:  warning:
process_rsc_state:       Detected active orphan vip-10.0.0.115 running on
tt738741-ip2
...
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
native_print:    vip-10.0.0.115  (ocf::heartbeat:IPaddr2):         ORPHANED
Started tt738741-ip2
...
Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
native_color:    Stopping orphan resource vip-10.0.0.115
...
Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:     info: log_execute:
    executing - rsc:vip-10.0.0.115 action:stop call_id:4358
Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
te_crm_command:  Executing crm-event (1): clear_failcount on tt738741-ip2
Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
process_lrm_event:       Result of monitor operation for vip-10.0.0.115 on
tt738741-ip2: Cancelled | call=4323 key=vip-10.0.0.115_monitor_10000
confirmed=true
Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
handle_failcount_op:     Removing failcount for vip-10.0.0.115
...
IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP address (the
ip parameter) is mandatory
Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:   notice:
operation_finished:      vip-10.0.0.115_stop_0:21351:stderr [
ocf-exit-reason:IP address (the ip parameter) is mandatory ]
Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:     info:
log_finished:    finished - rsc:vip-10.0.0.115 action:stop call_id:4358
pid:21351 exit-code:6 exec-time:124ms queue-time:0ms
Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:   notice:
process_lrm_event:       Result of stop operation for vip-10.0.0.115 on
tt738741-ip2: 6 (not configured) | call=4358 key=vip-10.0.0.115_stop_0
confirmed=true cib-update=20532
Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:   notice:
process_lrm_event:       tt738741-ip2-vip-10.0.0.115_stop_0:4358 [
ocf-exit-reason:IP address (the ip parameter) is mandatory\n ]

```

To address the replacing:
>
> Never replace the status section; it's continually updated by the live
> cluster, so you could be re-uploading old information that leads to
> incorrect actions. Replacing just the configuration section is the way
> to go.
>
I'm just trying all possible options. But in general I use old status
section only for crm_similate -LS
in order to predict actions

>
> cibadmin --replace or crm_shadow should work fine for replacing the
> configuration, but both crm and pcs allow batching commands in a file
> and then applying them all at once, so it may not be necessary.
>
Yes, thanks, I have tried this approach and it is way better than one by
one command execution, but still
it requires additional logic to form those batch commands, i.e. find out
what resources should be
added, removed, updated, etc
Also it does not exclude possibility of stucking on --wait (it happens when
I try to execute several commands in series)
I haven't found the root, but it is reproduced from time to time and
further transactions are not possible

With XML I don't have to think about this, all I have to do is prepare the
correct XML and verify it on syntax errors.
Well in my case it is just the ideal way to configure a cluster. I lived
for ages with crm configure and now trying to
optimize cluster configuration. moreover it is just much faster then doing
this by commands and waiting to
take effect for each.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20201002/67195b42/attachment.htm>