[ClusterLabs] Antw: [EXT] Q: placement-strategy=balanced

Tue Jan 19 03:32:50 EST 2021

>>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> schrieb am 15.01.2021
um
09:36 in Nachricht <60015410020000A10003E392 at gwsmtp.uni-regensburg.de>:
> Hi!
> 
> The cluster I'm configuring (SLES15 SP2) fenced a node last night. Still 
> unsure what exactly caused the fencing, but looking at the logs I found this

> "action plan" that lead to fencing:

I think I found the reason for fencing: I had renamed a VM, but kept the
UUID:
Jan 14 20:05:26 h19 libvirtd[4361]: operation failed: domain 'test-jeos' is
already defined with uuid 9a0f9ea5-a587-4a99-be44-bce079199c12

> 
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos1              ( h18 ‑> h19 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos2              ( h19 ‑> h16 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos3              ( h16 ‑> h18 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos4              ( h18 ‑> h19 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos1                    ( h18 ‑> h19 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos2                    ( h19 ‑> h16 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos3                    ( h16 ‑> h18 )
> Jan 14 20:05:12 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos4                    ( h18 ‑> h19 )
> 
> Those "cron_snap" resources depend on the corresponding xen resources 
> (colocation).
> Having 4 resources to be distributed equally to three nodes seems to trigger

> that problem.
> 
> After fencing the action plan was:
> 
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos2              ( h16 ‑> h19 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Move       
> prm_cron_snap_test‑jeos4              ( h19 ‑> h16 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Start      
> prm_cron_snap_test‑jeos1              (             h18 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Start      
> prm_cron_snap_test‑jeos3              (             h19 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Recover    
> prm_xen_test‑jeos1                    ( h19 ‑> h18 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos2                    ( h16 ‑> h19 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos3                    ( h18 ‑> h19 )
> Jan 14 20:05:26 h19 pacemaker‑schedulerd[4803]:  notice:  * Migrate    
> prm_xen_test‑jeos4                    ( h19 ‑> h16 )
> 
> ...some more recoivery actions like that...
> 
> Currently h18 has two VMs, while the other two nodes have one VM each.
> 
> Before having added those "cron_snap" resources, I did not detect such 
> "rebalancing".
> 
> The rebalancing was triggered by this ruleset present in every xen
resource:
> 
>         meta 1: resource‑stickiness=0 \
>         meta 2: rule 0: date spec hours=7‑19 weekdays=1‑5 
> resource‑stickiness=1000
> 
> At the moment the related scores (crm_simulate ‑LUs) look like this
(filtered 
> and re‑ordered):
> 
> Original: h16 capacity: utl_ram=231712 utl_cpu=440
> Original: h18 capacity: utl_ram=231712 utl_cpu=440
> Original: h19 capacity: utl_ram=231712 utl_cpu=440
> 
> Remaining: h16 capacity: utl_ram=229664 utl_cpu=420
> Remaining: h18 capacity: utl_ram=227616 utl_cpu=400
> Remaining: h19 capacity: utl_ram=229664 utl_cpu=420
> 
> pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h16: 0
> pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h18: 1000
> pcmk__native_allocate: prm_xen_test‑jeos1 allocation score on h19:
‑INFINITY
> native_assign_node: prm_xen_test‑jeos1 utilization on h18: utl_ram=2048 
> utl_cpu=20
> 
> pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h16: 0
> pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h18: 1000
> pcmk__native_allocate: prm_xen_test‑jeos2 allocation score on h19: 0
> native_assign_node: prm_xen_test‑jeos2 utilization on h18: utl_ram=2048 
> utl_cpu=20
> 
> pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h16: 0
> pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h18: 0
> pcmk__native_allocate: prm_xen_test‑jeos3 allocation score on h19: 1000
> native_assign_node: prm_xen_test‑jeos3 utilization on h19: utl_ram=2048 
> utl_cpu=20
> 
> pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h16: 1000
> pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h18: 0
> pcmk__native_allocate: prm_xen_test‑jeos4 allocation score on h19: 0
> native_assign_node: prm_xen_test‑jeos4 utilization on h16: utl_ram=2048 
> utl_cpu=20
> 
> Does that ring‑shifting of resources look like a bug in pacemaker?
> 
> Regards,
> Ulrich
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/