[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Thu Apr 15 17:45:08 EDT 2021

Hi ALl,

Sorry...
Due to my operation mistake, the same email was sent multiple times.

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2021/4/15, Thu 11:45
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
> 
> Hi Klaus,
> Hi Ken,
> 
> We have confirmed that the operation is improved by the test.
> Thank you for your prompt response.
> 
> We look forward to including this fix in the release version of RHEL 8.4.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> ----- Original Message -----
>>  From: "renayama19661014 at ybb.ne.jp" 
> <renayama19661014 at ybb.ne.jp>
>>  To: "kwenning at redhat.com" <kwenning at redhat.com>; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> <users at clusterlabs.org>; Cluster Labs - All topics related to open-source 
> clustering welcomed <users at clusterlabs.org>
>>  Cc: 
>>  Date: 2021/4/13, Tue 07:08
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  Hi Klaus,
>>  Hi Ken,
>> 
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>> 
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>> 
>> 
>>  Thank you for the fix.
>> 
>> 
>>  I have confirmed that the fixes have been merged.
>> 
>>  I'll test this fix today just in case.
>> 
>>  Many thanks,
>>  Hideo Yamauchi.
>> 
>> 
>>  ----- Original Message -----
>>>   From: Klaus Wenninger <kwenning at redhat.com>
>>>   To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to 
>>  open-source clustering welcomed <users at clusterlabs.org>
>>>   Cc: 
>>>   Date: 2021/4/12, Mon 22:22
>>>   Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
> control 
>>  fails.
>>> 
>>>   On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>>    On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>>>    On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>>>    On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>>>    On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>>    Hi Klaus,
>>>>>>>> 
>>>>>>>>    Thanks for your comment.
>>>>>>>> 
>>>>>>>>>    Hmm ... is that with selinux enabled?
>>>>>>>>>    Respectively do you see any related avc 
> messages?
>>>>>>>> 
>>>>>>>>    Selinux is not enabled.
>>>>>>>>    Isn't crm_mon caused by not returning a 
> response 
>>  when 
>>>   pacemakerd 
>>>>>>>>    prepares to stop?
>>>>>>    yep ... that doesn't look good.
>>>>>>    While in pcmk_shutdown_worker ipc isn't handled.
>>>>>    Stop ... that should actually work as pcmk_shutdown_worker
>>>>>    should exit quite quickly and proceed after mainloop
>>>>>    dispatching when called again.
>>>>>    Don't see anything atm that might be blocking for longer 
> ...
>>>>>    but let me dig into it further ...
>>>>    What happens is clear (thanks Ken for the hint ;-) ).
>>>>    When pacemakerd is shutting down - already when it
>>>>    shuts down the resources and not just when it starts to
>>>>    reap the subdaemons - crm_mon reads that state and
>>>>    doesn't try to connect to the cib anymore.
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>>>>>>    Question is why that didn't create issue earlier.
>>>>>>    Probably I didn't test with resources that had 
> crm_mon in
>>>>>>    their stop/monitor-actions but sbd should have run into
>>>>>>    issues.
>>>>>> 
>>>>>>    Klaus
>>>>>>>    But when shutting down a node the resources should be
>>>>>>>    shutdown before pacemakerd goes down.
>>>>>>>    But let me have a look if it can happen that 
> pacemakerd
>>>>>>>    doesn't react to the ipc-pings before. That btw. 
> might 
>>  be
>>>>>>>    lethal for sbd-scenarios (if the phase is too long 
> and it
>>>>>>>    migh actually not be defined).
>>>>>>> 
>>>>>>>    My idea with selinux would have been that it might 
> block
>>>>>>>    the ipc if crm_mon is issued by execd. But well 
> forget
>>>>>>>    about it as it is not enabled ;-)
>>>>>>> 
>>>>>>> 
>>>>>>>    Klaus
>>>>>>>> 
>>>>>>>>    pgsql needs the result of crm_mon in demote 
> processing 
>>  and 
>>>   stop 
>>>>>>>>    processing.
>>>>>>>>    crm_mon should return a response even after 
> pacemakerd 
>>  goes 
>>>   into a 
>>>>>>>>    stop operation.
>>>>>>>> 
>>>>>>>>    Best Regards,
>>>>>>>>    Hideo Yamauchi.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>    ----- Original Message -----
>>>>>>>>>    From: Klaus Wenninger 
> <kwenning at redhat.com>
>>>>>>>>>    To: renayama19661014 at ybb.ne.jp; Cluster Labs 
> - All 
>> 
>>>   topics related 
>>>>>>>>>    to open-source clustering welcomed 
>>>   <users at clusterlabs.org>
>>>>>>>>>    Cc:
>>>>>>>>>    Date: 2021/4/9, Fri 21:12
>>>>>>>>>    Subject: Re: [ClusterLabs] [Problem] In 
>>  RHEL8.4beta, 
>>>   pgsql 
>>>>>>>>>    resource control fails.
>>>>>>>>> 
>>>>>>>>>    On 4/8/21 11:21 PM, 
> renayama19661014 at ybb.ne.jp 
>>  wrote:
>>>>>>>>>>      Hi Ken,
>>>>>>>>>>      Hi All,
>>>>>>>>>> 
>>>>>>>>>>      In the pgsql resource, crm_mon is 
> executed 
>>  in the 
>>>   process of 
>>>>>>>>>>    demote and
>>>>>>>>>    stop, and the result is processed.
>>>>>>>>>>      However, pacemaker included in 
> RHEL8.4beta 
>>  fails 
>>>   to execute 
>>>>>>>>>>    this crm_mon.
>>>>>>>>>>        - The problem also occurs on github
>>>>>>>>>    
> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>>>>>>      The problem can be easily reproduced in 
> the 
>>>   following ways.
>>>>>>>>>> 
>>>>>>>>>>      Step1. Modify to execute crm_mon in the 
> stop 
>> 
>>>   process of the 
>>>>>>>>>>    Dummy resource.
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      dummy_stop() {
>>>>>>>>>>           mon=$(crm_mon -1)
>>>>>>>>>>           ret=$?
>>>>>>>>>>           ocf_log info "### YAMAUCHI 
> #### 
>>>   crm_mon[${ret}] : ${mon}"
>>>>>>>>>>           dummy_monitor
>>>>>>>>>>           if [ $? =  $OCF_SUCCESS ]; then
>>>>>>>>>>               rm ${OCF_RESKEY_state}
>>>>>>>>>>           fi
>>>>>>>>>>           return $OCF_SUCCESS
>>>>>>>>>>      }
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      Step2. Configure a cluster with two 
> nodes.
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>>>>>>>      Cluster Summary:
>>>>>>>>>>         * Stack: corosync
>>>>>>>>>>         * Current DC: rh84-beta01 (version 
>>>   2.0.5-8.el8-ba59be7122) 
>>>>>>>>>>    - partition
>>>>>>>>>    with quorum
>>>>>>>>>>         * Last updated: Thu Apr  8 18:00:52 
> 2021
>>>>>>>>>>         * Last change:  Thu Apr  8 18:00:38 
> 2021 
>>  by 
>>>   root via 
>>>>>>>>>>    cibadmin on
>>>>>>>>>    rh84-beta01
>>>>>>>>>>         * 2 nodes configured
>>>>>>>>>>         * 1 resource instance configured
>>>>>>>>>> 
>>>>>>>>>>      Node List:
>>>>>>>>>>         * Online: [ rh84-beta01 rh84-beta02 
> ]
>>>>>>>>>> 
>>>>>>>>>>      Full List of Resources:
>>>>>>>>>>         * dummy-1    
>  (ocf::heartbeat:Dummy):  
>>  Started 
>>>   rh84-beta01
>>>>>>>>>> 
>>>>>>>>>>      Migration Summary:
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      Step3. Stop the node where the Dummy 
>>  resource is 
>>>   running. The 
>>>>>>>>>>    resource will
>>>>>>>>>    fail over.
>>>>>>>>>>      ----
>>>>>>>>>>      [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>>>>>>>      Cluster Summary:
>>>>>>>>>>         * Stack: corosync
>>>>>>>>>>         * Current DC: rh84-beta02 (version 
>>>   2.0.5-8.el8-ba59be7122) 
>>>>>>>>>>    - partition
>>>>>>>>>    with quorum
>>>>>>>>>>         * Last updated: Thu Apr  8 18:08:56 
> 2021
>>>>>>>>>>         * Last change:  Thu Apr  8 18:05:08 
> 2021 
>>  by 
>>>   root via 
>>>>>>>>>>    cibadmin on
>>>>>>>>>    rh84-beta01
>>>>>>>>>>         * 2 nodes configured
>>>>>>>>>>         * 1 resource instance configured
>>>>>>>>>> 
>>>>>>>>>>      Node List:
>>>>>>>>>>         * Online: [ rh84-beta02 ]
>>>>>>>>>>         * OFFLINE: [ rh84-beta01 ]
>>>>>>>>>> 
>>>>>>>>>>      Full List of Resources:
>>>>>>>>>>         * dummy-1    
>  (ocf::heartbeat:Dummy):  
>>  Started 
>>>   rh84-beta02
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      However, if you look at the log, you 
> can see 
>>  that 
>>>   the 
>>>>>>>>>>    execution of crm_mon
>>>>>>>>>    in the stop processing of the Dummy resource 
> has 
>>>   failed.
>>>>>>>>>>      ----
>>>>>>>>>>      Apr 08 18:05:17  Dummy(dummy-1)[2631]:  
>   
>>  INFO: 
>>>   ### YAMAUCHI ####
>>>>>>>>>    crm_mon[102] : Pacemaker daemons shutting 
> down ...
>>>>>>>>>>      Apr 08 18:05:17 rh84-beta01 
> pacemaker-execd  
>>    
>>>    [2219] 
>>>>>>>>>>    (log_op_output)
>>>>>>>>>    notice: dummy-1_stop_0[2631] error output [ 
>>  crm_mon: 
>>>   Error: 
>>>>>>>>>    cluster is not
>>>>>>>>>    available on this node ]
>>>>>>>>>    Hmm ... is that with selinux enabled?
>>>>>>>>>    Respectively do you see any related avc 
> messages?
>>>>>>>>> 
>>>>>>>>>    Klaus
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      Similarly, pgsql also executes crm_mon 
> with 
>>>   demote or stop, so 
>>>>>>>>>>    control
>>>>>>>>>    fails.
>>>>>>>>>>      The problem seems to be related to the 
> next 
>>  fix.
>>>>>>>>>>        * Report pacemakerd in state waiting 
> for 
>>  sbd
>>>>>>>>>>         - 
>>>   https://github.com/ClusterLabs/pacemaker/pull/2278 
>>>>>>>>>> 
>>>>>>>>>>      The problem does not occur with the 
> release 
>>>   version of 
>>>>>>>>>>    Pacemaker 2.0.5 or
>>>>>>>>>    the Pacemaker included with RHEL8.3.
>>>>>>>>>>      This issue has a huge impact on the 
> user.
>>>>>>>>>> 
>>>>>>>>>>      Perhaps it also affects the control of 
> other 
>> 
>>>   resources that 
>>>>>>>>>>    utilize
>>>>>>>>>    crm_mon.
>>>>>>>>>>      Please improve the release version of 
>>  RHEL8.4 so 
>>>   that it 
>>>>>>>>>>    includes Pacemaker
>>>>>>>>>    which does not cause this problem.
>>>>>>>>>>        * Distributions other than RHEL may 
> also 
>>  be 
>>>   affected in 
>>>>>>>>>>    future releases.
>>>>>>>>>> 
>>>>>>>>>>      ----
>>>>>>>>>>      This content is the same as the 
> following 
>>>   Bugzilla.
>>>>>>>>>>        - 
>>>   https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>>>>>>>>>      ----
>>>>>>>>>> 
>>>>>>>>>>      Best Regards,
>>>>>>>>>>      Hideo Yamauchi.
>>>>>>>>>> 
>>>>>>>>>>      
>>  _______________________________________________
>>>> 
>>> 
>> 
>>  _______________________________________________
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
>