[ClusterLabs Developers] OCF_RESKEY_CRM_meta_notify_active_* always empty

Ken Gaillot kgaillot at redhat.com
Fri May 6 20:41:11 UTC 2016


On 05/03/2016 05:30 PM, Jehan-Guillaume de Rorthais wrote:
> Le Tue, 3 May 2016 21:10:12 +0200,
> Jehan-Guillaume de Rorthais <jgdr at dalibo.com> a écrit :
> 
>> Le Mon, 2 May 2016 17:59:55 -0500,
>> Ken Gaillot <kgaillot at redhat.com> a écrit :
>>
>>> On 04/28/2016 04:47 AM, Jehan-Guillaume de Rorthais wrote:
>>>> Hello all,
>>>>
>>>> While testing and experiencing with our RA for PostgreSQL, I found the
>>>> meta_notify_active_* variables seems always empty. Here is an example of
>>>> these variables as they are seen from our RA during a
>>>> migration/switchover:
>>>>
>>>>
>>>>   {
>>>>     'type' => 'pre',
>>>>     'operation' => 'demote',
>>>>     'active' => [],
>>>>     'inactive' => [],
>>>>     'start' => [],
>>>>     'stop' => [],
>>>>     'demote' => [
>>>>                   {
>>>>                     'rsc' => 'pgsqld:1',
>>>>                     'uname' => 'hanode1'
>>>>                   }
>>>>                 ],
>>>>     
>>>>     'master' => [
>>>>                   {
>>>>                     'rsc' => 'pgsqld:1',
>>>>                     'uname' => 'hanode1'
>>>>                   }
>>>>                 ],
>>>>     
>>>>     'promote' => [
>>>>                    {
>>>>                      'rsc' => 'pgsqld:0',
>>>>                      'uname' => 'hanode3'
>>>>                    }
>>>>                  ],
>>>>     'slave' => [
>>>>                  {
>>>>                    'rsc' => 'pgsqld:0',
>>>>                    'uname' => 'hanode3'
>>>>                  },
>>>>                  {
>>>>                    'rsc' => 'pgsqld:2',
>>>>                    'uname' => 'hanode2'
>>>>                  }
>>>>                ],
>>>>     
>>>>   }
>>>>
>>>> In case this comes from our side, here is code building this:
>>>>
>>>>   https://github.com/dalibo/PAF/blob/6e86284bc647ef1e81f01f047f1862e40ba62906/lib/OCF_Functions.pm#L444
>>>>
>>>> But looking at the variable itself in debug logs, I always find it empty,
>>>> in various situations (switchover, recover, failover).
>>>>
>>>> If I understand the documentation correctly, I would expect 'active' to
>>>> list all the three resources, shouldn't it? Currently, to bypass this, we
>>>> consider: active == master + slave
>>>
>>> You're right, it should. The pacemaker code that generates the "active"
>>> variables is the same used for "demote" etc., so it seems unlikely the
>>> issue is on pacemaker's side. Especially since your code treats active
>>> etc. differently from demote etc., it seems like it must be in there
>>> somewhere, but I don't see where.
>>
>> The code treat active, inactive, start and stop all together, for any cloned
>> resource. If the resource is a multistate, it adds promote, demote, slave and
>> master.
>>
>> Note that from this piece of code, the 7 other notify vars are set
>> correctly: start, stop, inactive, promote, demote, slave, master. Only active
>> is always missing.
>>
>> I'll investigate and try to find where is hiding the bug.
> 
> So I added a piece of code to dump the **all** the environment variables to a
> temp file as early as possible **to avoid any interaction with our perl
> module** in the code of the RA, ie.:
> 
>   BEGIN {
>     use Time::HiRes qw(time);
>     my $now = time;
>     open my $fh, ">", "/tmp/test-$now.env.txt";
>     printf($fh "%-20s = ''%s''\n", $_, $ENV{$_}) foreach sort keys %ENV;
>   }
> 
> Then I started my cluster and set maintenance-mode=false while no resources
> where running. So the debug files contains the probe action, start on all
> nodes, one promote on the master and the first monitors. The "*active" variables
> are always empty anywhere in the cluster. Find in attachment the result of
> the following command on the master node:
> 
>   for i in test-*; do echo "===== $i ====="; grep OCF_ $i; done > debug-env.txt
> 
> I'm using Pacemaker 1.1.13-10.el7_2.2-44eb2dd under CentOS 7.2.1511.
> 
> For completeness, I added the Pacemaker configuration I use for my 3 node
> dev/test cluster.
> 
> Let me know if you think of more investigations and test I could run on this
> issue. I'm out of ideas for tonight (and I really would prefer having this bug
> on my side).

>From your environment dumps, what I think is happening is that you are
getting multiple notifications (start, pre-promote, post-promote) in a
single cluster transition. So the variables reflect the initial state of
that transition -- none of the instances are active, all three are being
started (so the nodes are in the "*_start_*" variables), and one is
being promoted.

The starts will be done before the promote. If one of the starts fails,
the transition will be aborted, and a new one will be calculated. So, if
you get to the promote, you can assume anything in "*_start_*" is now
active.

> On a side note, I noticed with these debug files that the notify
> variables where also available outside of notify actions (start and notify
> here). Are they always available during "transition actions" (start, stop,
> promote, demote)? Checking at the mysql RA, they are using
> OCF_RESKEY_CRM_meta_notify_master_uname during the start action. So I suppose
> it's safe?

Good question, I've never tried that before. I'm reluctant to say it's
guaranteed; it's possible seeing them in the start action is a side
effect of the current implementation and could theoretically change in
the future. But if mysql is relying on it, I suppose it's
well-established already, making changing it unlikely ...





More information about the Developers mailing list