[ClusterLabs] ocf:pacemaker:ping works strange

Fri Dec 8 09:44:16 EST 2023

Hello experts.

I use pacemaker for a Lustre cluster. But for simplicity and exploration I
use a Dummy resource. I didn't like how resource performed failover and
failback. When I shut down VM with remote agent, pacemaker tries to restart
it. According to pcs status it marks the resource (not RA) Online for some
time while VM stays down.

OK, I wanted to improve its behavior and set up a ping monitor. I tuned the
scores like this:
pcs resource create FAKE3 ocf:pacemaker:Dummy
pcs resource create FAKE4 ocf:pacemaker:Dummy
pcs constraint location FAKE3 prefers lustre3=100
pcs constraint location FAKE3 prefers lustre4=90
pcs constraint location FAKE4 prefers lustre3=90
pcs constraint location FAKE4 prefers lustre4=100
pcs resource defaults update resource-stickiness=110
pcs resource create ping ocf:pacemaker:ping dampen=5s host_list=local op
monitor interval=3s timeout=7s clone meta target-role="started"
for i in lustre{1..4}; do pcs constraint location ping-clone prefers $i;
done
pcs constraint location FAKE3 rule score=0 pingd lt 1 or not_defined pingd
pcs constraint location FAKE4 rule score=0 pingd lt 1 or not_defined pingd
pcs constraint location FAKE3 rule score=125 pingd gt 0 or defined pingd
pcs constraint location FAKE4 rule score=125 pingd gt 0 or defined pingd

Question #1) Why I cannot see accumulated score from pingd in crm_simulate
output? Only location score and stickiness.
pcmk__primitive_assign: FAKE3 allocation score on lustre3: 210
pcmk__primitive_assign: FAKE3 allocation score on lustre4: 90
pcmk__primitive_assign: FAKE4 allocation score on lustre3: 90
pcmk__primitive_assign: FAKE4 allocation score on lustre4: 210
Either when all is OK or when VM is down - score from pingd not added to
total score of RA

Question #2) I shut lustre3 VM down and leave it like that. pcs status:
  * FAKE3       (ocf::pacemaker:Dummy):  Stopped
  * FAKE4       (ocf::pacemaker:Dummy):  Started lustre4
  * Clone Set: ping-clone [ping]:
    * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4
] << lustre3 missing
OK for now
VM boots up. pcs status:
  * FAKE3       (ocf::pacemaker:Dummy):  FAILED (blocked) [ lustre3 lustre4
]  << what is it?
  * Clone Set: ping-clone [ping]:
    * ping      (ocf::pacemaker:ping):   FAILED lustre3 (blocked)    << why
not started?
    * Started: [ lustre-mds1 lustre-mds2 lustre-mgs lustre1 lustre2 lustre4
]
I checked server processes manually and found that lustre4 runs
"/usr/lib/ocf/resource.d/pacemaker/ping monitor" while lustre3 doesn't
All is according to documentation but results are strange.
Then I tried to add meta target-role="started" to pcs resource create ping
and this time ping started after node rebooted. Can I expect that it was
just missing from official setup documentation, and now everything will
work fine?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20231208/23dbfa55/attachment.htm>