[ClusterLabs] Speed up the resource moves in the case of a node hard shutdown

Mon Feb 12 16:31:48 UTC 2018

On 2018-02-12 08:15 AM, Klaus Wenninger wrote:
> On 02/12/2018 01:02 PM, Maxim wrote:
>> Hello,
>>
>> [Sorry for a message duplication. Web mail client ruined the
>> formatting of the previous e-mail =( ]
>>
>> There is a simple configuration of two cluster nodes (built via RHEL 6
>> pcs interface) with multiple master/slave resources, disabled fencing
>> and the single sync interface.
> 
> fencing-disabled is probably due to it being a test-setup ...
> RHEL 6 pcs being made for configuring a cman-pacemaker-setup
> I'm not sure if it is advisable to do a setup for a corosync-2 pacemaker
> setup with that. You've obviously edited corosync.conf to
> reflect that ...

Without fencing, all bets are off. Please enable it and see if the issue
remains

Changing EL6 to corosync 2 pushes further into uncharted waters. EL6
should be using the cman pluging with corosync 1. May I ask why you
don't use EL7 if you want such a recent stack?

>> All is ok mainly. But there is some problem of the cluster activity
>> performance when the master node is powered off (hard): the slave node
>> detects that the master one is down after about 100-3500 ms. And the
>> main question is how to avoid this 3 sec delay that occurred sometimes.
> 
> Kind of interesting that you ever get a detection below 2000ms with the
> token-timeout set to that value. (Given you are doing a hard-shutdown
> that doesn't give corosync time to sign off.)
> You've derived these times from the corosync-logs!?
> 
> Regards,
> Klaus
> 
>>
>> On the slave node i have a little script that checks the connection to
>> the master node. It detects a problem of a sync breakage within about
>> 100 ms. But corosync requires a much more time sometimes to figure out
>> the situation and mark the master node as offline one. It shows 'ok'
>> ring status.
>>
>> If i understand correctly then
>> 1 the pacemaker actions (crm_resource --move) will not perform until
>> corosync is not refreshed its ring state
>> 2 the detection of a problem (from a corosync side) can be speeded up
>> via timeout tuning in the corosync.conf
>> 3 there is no way to ask corosync to recheck its ring status or mark a
>> ring as failed manually
>>
>> But maybe i'm missing something.
>>
>> All i want is to move resources faster.
>> In my little script i tried to force the cluster software to move
>> resources to the slave node. But i've no success so far.
>>
>> Could you please share your thoughts about the situation.
>> Thank you in advance.
>>
>>
>> Cluster software:
>> corosync - 2.4.3
>> pacemaker - 1.1.18
>> libqb - 1.0.2
>>
>>
>> corosync.conf:
>> totem {
>>       version: 2
>>       secauth: off
>>       cluster_name: cluster
>>       transport: udpu
>>       token: 2000
>> }
>>
>> nodelist {
>>      node {
>>          ring0_addr: main-node
>>          nodeid: 1
>>     }
>>
>>      node {
>>          ring0_addr: reserve-node
>>          nodeid: 2
>>      }
>> }
>>
>> quorum {
>>      provider: corosync_votequorum
>>      two_node: 1
>> }
>>
>>
>> Regards,
>> Maxim.
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould