[ClusterLabs] Antw: [EXT] How to stop removed resources when replacing cib.xml via cibadmin or crm_shadow

Fri Oct 2 16:35:44 EDT 2020

On Fri, 2020-10-02 at 21:35 +0300, Igor Tverdovskiy wrote:
> 
> 
> On Thu, Oct 1, 2020 at 5:55 PM Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > There's no harm on the Pacemaker side in doing so.
> > 
> > A resource that's running but removed from the configuration is
> > what
> > Pacemaker calls an "orphan". By default (the stop-orphan-resources
> > cluster property) it will try to stop these. Pacemaker keeps the
> > set of
> > parameters that a resource was started with in memory, so it
> > doesn't
> > need the now-removed configuration to perform the stop. So, the
> > "ORPHANED" part of this is normal and appropriate.
> > 
> > The problem in this particular case is the "FAILED ... (blocked)".
> > Removing the configuration shouldn't cause the resource to fail,
> > and
> > something is blocking the stop. You should be able to see in the
> > failed
> > action section of the status, or in the logs, what failed and why
> > it's
> > blocked. My guess is the stop itself failed, in which case you'd
> > need
> > to investigate why that happened.
> 
> Hi Ken,
> 
> As always, thanks a lot for pointing me to the right direction!
> I have digged logs, but something not logical happens. Maybe you can
> shed light a bit?
> 
> Just in case, I have a pretty old pacemaker (Pacemaker 1.1.15-11.el7) 
> freezing of the version was conducted by
> changes in stikiness=-1 attribute handling logic. I consider update
> to a newer stable version later on, but
> at the moment I have to deal with this version.
> 
> First of all "stop-orphan-resources" was not set and thus by default
> should be true, but still I see a strange message
> > Cluster configured not to stop active orphans. vip-10.0.0.115 must
> be stopped manually on tt738741-ip2
> 
> I even explicitly set "stop-orphan-resources" to true
> > sudo crm_attribute --type crm_config --name stop-orphan-resources
> --query
> scope=crm_config  name=stop-orphan-resources value=true
> 
> For the sake of justice pacemaker still tries to remove ORPHANED
> resources, i.e
> > Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> native_color:    Stopping orphan resource vip-10.0.0.115
> 
> To my mind the issue is the following:
> > Clearing failcount for monitor on vip-10.0.0.115, tt738741-ip2
> failed and now resource parameters have changed.
> 
> I suppose that this operation clears parameters of the resource which
> were used at start
> 
> The error is pretty straight:
> > IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP
> address (the ip parameter) is mandatory
> 
> As I understand it means the ip parameter has already vanished at the
> moment of "stop" action.
> 
> It looks like a bug, but who knows.
> 
> Trimmed logs, if required I can provide full log, cib.xml, etc
> ```
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:  warning:
> process_rsc_state:       Cluster configured not to stop active
> orphans. vip-10.0.0.115 must be stopped manually on tt738741-ip2

It'll log this message anytime the orphan is unmanaged, not just when
stop-orphan-resources is false, so I'll make a note to change the
message.

Did the resource happen to be unmanaged via the configuration at the
time it was removed? Obviously it couldn't be unmanaged via its own
(now gone) configuration, but maybe by resource defaults or maintenance
mode?

> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> native_add_running:      resource vip-10.0.0.115 isn't managed
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> native_add_running:      resource haproxy-10.0.0.115 isn't managed
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> determine_op_status:     Operation monitor found resource vip-
> 10.0.0.115 active on tt738741-ip2
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> check_operation_expiry:  Clearing failcount for monitor on vip-
> 10.0.0.115, tt738741-ip2 failed and now resource parameters have
> changed.
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:  warning:
> process_rsc_state:       Detected active orphan vip-10.0.0.115
> running on tt738741-ip2
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> native_print:    vip-10.0.0.115  (ocf::heartbeat:IPaddr2):        
> ORPHANED Started tt738741-ip2
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.ops    pengine:     info:
> native_color:    Stopping orphan resource vip-10.0.0.115

There should be a "saving inputs" log message with a filename shortly
after this. If you could email me that, I could check whether there are
any issues in the scheduler side of things.

> ...
> Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:     info:
> log_execute:     executing - rsc:vip-10.0.0.115 action:stop
> call_id:4358
> Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
> te_crm_command:  Executing crm-event (1): clear_failcount on
> tt738741-ip2
> Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
> process_lrm_event:       Result of monitor operation for vip-
> 10.0.0.115 on tt738741-ip2: Cancelled | call=4323 key=vip-
> 10.0.0.115_monitor_10000 confirmed=true
> Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:     info:
> handle_failcount_op:     Removing failcount for vip-10.0.0.115
> ...
> IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP address
> (the ip parameter) is mandatory
> Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:   notice:
> operation_finished:      vip-10.0.0.115_stop_0:21351:stderr [ ocf-
> exit-reason:IP address (the ip parameter) is mandatory ]
> Oct 02 09:20:56 [21763] tt738741-ip2.ops       lrmd:     info:
> log_finished:    finished - rsc:vip-10.0.0.115 action:stop
> call_id:4358 pid:21351 exit-code:6 exec-time:124ms queue-time:0ms
> Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:   notice:
> process_lrm_event:       Result of stop operation for vip-10.0.0.115
> on tt738741-ip2: 6 (not configured) | call=4358 key=vip-
> 10.0.0.115_stop_0 confirmed=true cib-update=20532
> Oct 02 09:20:56 [21766] tt738741-ip2.ops       crmd:   notice:
> process_lrm_event:       tt738741-ip2-vip-10.0.0.115_stop_0:4358 [
> ocf-exit-reason:IP address (the ip parameter) is mandatory\n ]
> 
> ```
> 
> > To address the replacing:
> > 
> > Never replace the status section; it's continually updated by the
> > live
> > cluster, so you could be re-uploading old information that leads to
> > incorrect actions. Replacing just the configuration section is the
> > way
> > to go.
> 
> I'm just trying all possible options. But in general I use old status
> section only for crm_similate -LS
> in order to predict actions
>  
> > cibadmin --replace or crm_shadow should work fine for replacing the
> > configuration, but both crm and pcs allow batching commands in a
> > file
> > and then applying them all at once, so it may not be necessary.
> 
> Yes, thanks, I have tried this approach and it is way better than one
> by one command execution, but still
> it requires additional logic to form those batch commands, i.e. find
> out what resources should be
> added, removed, updated, etc
> Also it does not exclude possibility of stucking on --wait (it
> happens when I try to execute several commands in series)
> I haven't found the root, but it is reproduced from time to time and
> further transactions are not possible
> 
> With XML I don't have to think about this, all I have to do is
> prepare the correct XML and verify it on syntax errors.
> Well in my case it is just the ideal way to configure a cluster. I
> lived for ages with crm configure and now trying to
> optimize cluster configuration. moreover it is just much faster
> then doing this by commands and waiting to
> take effect for each.

Makes sense, that should be fine
-- 
Ken Gaillot <kgaillot at redhat.com>