[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Wed Apr 14 22:43:59 EDT 2021

Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.

----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: "kwenning at redhat.com" <kwenning at redhat.com>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> ----- Original Message -----
>>  From: Klaus Wenninger <kwenning at redhat.com>
>>  To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed <users at clusterlabs.org>
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>>   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>>>>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>>>>   On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>>   Hi Klaus,
>>>>>>> 
>>>>>>>   Thanks for your comment.
>>>>>>> 
>>>>>>>>   Hmm ... is that with selinux enabled?
>>>>>>>>   Respectively do you see any related avc messages?
>>>>>>> 
>>>>>>>   Selinux is not enabled.
>>>>>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>>>>>   prepares to stop?
>>>>>   yep ... that doesn't look good.
>>>>>   While in pcmk_shutdown_worker ipc isn't handled.
>>>>   Stop ... that should actually work as pcmk_shutdown_worker
>>>>   should exit quite quickly and proceed after mainloop
>>>>   dispatching when called again.
>>>>   Don't see anything atm that might be blocking for longer ...
>>>>   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>>>>>   Question is why that didn't create issue earlier.
>>>>>   Probably I didn't test with resources that had crm_mon in
>>>>>   their stop/monitor-actions but sbd should have run into
>>>>>   issues.
>>>>> 
>>>>>   Klaus
>>>>>>   But when shutting down a node the resources should be
>>>>>>   shutdown before pacemakerd goes down.
>>>>>>   But let me have a look if it can happen that pacemakerd
>>>>>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>>>>>   lethal for sbd-scenarios (if the phase is too long and it
>>>>>>   migh actually not be defined).
>>>>>> 
>>>>>>   My idea with selinux would have been that it might block
>>>>>>   the ipc if crm_mon is issued by execd. But well forget
>>>>>>   about it as it is not enabled ;-)
>>>>>> 
>>>>>> 
>>>>>>   Klaus
>>>>>>> 
>>>>>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>>>>>   processing.
>>>>>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>>>>>   stop operation.
>>>>>>> 
>>>>>>>   Best Regards,
>>>>>>>   Hideo Yamauchi.
>>>>>>> 
>>>>>>> 
>>>>>>>   ----- Original Message -----
>>>>>>>>   From: Klaus Wenninger <kwenning at redhat.com>
>>>>>>>>   To: renayama19661014 at ybb.ne.jp; Cluster Labs - All 
> 
>>  topics related 
>>>>>>>>   to open-source clustering welcomed 
>>  <users at clusterlabs.org>
>>>>>>>>   Cc:
>>>>>>>>   Date: 2021/4/9, Fri 21:12
>>>>>>>>   Subject: Re: [ClusterLabs] [Problem] In 
> RHEL8.4beta, 
>>  pgsql 
>>>>>>>>   resource control fails.
>>>>>>>> 
>>>>>>>>   On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp 
> wrote:
>>>>>>>>>     Hi Ken,
>>>>>>>>>     Hi All,
>>>>>>>>> 
>>>>>>>>>     In the pgsql resource, crm_mon is executed 
> in the 
>>  process of 
>>>>>>>>>   demote and
>>>>>>>>   stop, and the result is processed.
>>>>>>>>>     However, pacemaker included in RHEL8.4beta 
> fails 
>>  to execute 
>>>>>>>>>   this crm_mon.
>>>>>>>>>       - The problem also occurs on github
>>>>>>>>   master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>>>>>     The problem can be easily reproduced in the 
>>  following ways.
>>>>>>>>> 
>>>>>>>>>     Step1. Modify to execute crm_mon in the stop 
> 
>>  process of the 
>>>>>>>>>   Dummy resource.
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     dummy_stop() {
>>>>>>>>>          mon=$(crm_mon -1)
>>>>>>>>>          ret=$?
>>>>>>>>>          ocf_log info "### YAMAUCHI #### 
>>  crm_mon[${ret}] : ${mon}"
>>>>>>>>>          dummy_monitor
>>>>>>>>>          if [ $? =  $OCF_SUCCESS ]; then
>>>>>>>>>              rm ${OCF_RESKEY_state}
>>>>>>>>>          fi
>>>>>>>>>          return $OCF_SUCCESS
>>>>>>>>>     }
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     Step2. Configure a cluster with two nodes.
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>>>>>>     Cluster Summary:
>>>>>>>>>        * Stack: corosync
>>>>>>>>>        * Current DC: rh84-beta01 (version 
>>  2.0.5-8.el8-ba59be7122) 
>>>>>>>>>   - partition
>>>>>>>>   with quorum
>>>>>>>>>        * Last updated: Thu Apr  8 18:00:52 2021
>>>>>>>>>        * Last change:  Thu Apr  8 18:00:38 2021 
> by 
>>  root via 
>>>>>>>>>   cibadmin on
>>>>>>>>   rh84-beta01
>>>>>>>>>        * 2 nodes configured
>>>>>>>>>        * 1 resource instance configured
>>>>>>>>> 
>>>>>>>>>     Node List:
>>>>>>>>>        * Online: [ rh84-beta01 rh84-beta02 ]
>>>>>>>>> 
>>>>>>>>>     Full List of Resources:
>>>>>>>>>        * dummy-1     (ocf::heartbeat:Dummy):  
> Started 
>>  rh84-beta01
>>>>>>>>> 
>>>>>>>>>     Migration Summary:
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     Step3. Stop the node where the Dummy 
> resource is 
>>  running. The 
>>>>>>>>>   resource will
>>>>>>>>   fail over.
>>>>>>>>>     ----
>>>>>>>>>     [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>>>>>>     Cluster Summary:
>>>>>>>>>        * Stack: corosync
>>>>>>>>>        * Current DC: rh84-beta02 (version 
>>  2.0.5-8.el8-ba59be7122) 
>>>>>>>>>   - partition
>>>>>>>>   with quorum
>>>>>>>>>        * Last updated: Thu Apr  8 18:08:56 2021
>>>>>>>>>        * Last change:  Thu Apr  8 18:05:08 2021 
> by 
>>  root via 
>>>>>>>>>   cibadmin on
>>>>>>>>   rh84-beta01
>>>>>>>>>        * 2 nodes configured
>>>>>>>>>        * 1 resource instance configured
>>>>>>>>> 
>>>>>>>>>     Node List:
>>>>>>>>>        * Online: [ rh84-beta02 ]
>>>>>>>>>        * OFFLINE: [ rh84-beta01 ]
>>>>>>>>> 
>>>>>>>>>     Full List of Resources:
>>>>>>>>>        * dummy-1     (ocf::heartbeat:Dummy):  
> Started 
>>  rh84-beta02
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     However, if you look at the log, you can see 
> that 
>>  the 
>>>>>>>>>   execution of crm_mon
>>>>>>>>   in the stop processing of the Dummy resource has 
>>  failed.
>>>>>>>>>     ----
>>>>>>>>>     Apr 08 18:05:17  Dummy(dummy-1)[2631]:    
> INFO: 
>>  ### YAMAUCHI ####
>>>>>>>>   crm_mon[102] : Pacemaker daemons shutting down ...
>>>>>>>>>     Apr 08 18:05:17 rh84-beta01 pacemaker-execd  
>   
>>   [2219] 
>>>>>>>>>   (log_op_output)
>>>>>>>>   notice: dummy-1_stop_0[2631] error output [ 
> crm_mon: 
>>  Error: 
>>>>>>>>   cluster is not
>>>>>>>>   available on this node ]
>>>>>>>>   Hmm ... is that with selinux enabled?
>>>>>>>>   Respectively do you see any related avc messages?
>>>>>>>> 
>>>>>>>>   Klaus
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     Similarly, pgsql also executes crm_mon with 
>>  demote or stop, so 
>>>>>>>>>   control
>>>>>>>>   fails.
>>>>>>>>>     The problem seems to be related to the next 
> fix.
>>>>>>>>>       * Report pacemakerd in state waiting for 
> sbd
>>>>>>>>>        - 
>>  https://github.com/ClusterLabs/pacemaker/pull/2278 
>>>>>>>>> 
>>>>>>>>>     The problem does not occur with the release 
>>  version of 
>>>>>>>>>   Pacemaker 2.0.5 or
>>>>>>>>   the Pacemaker included with RHEL8.3.
>>>>>>>>>     This issue has a huge impact on the user.
>>>>>>>>> 
>>>>>>>>>     Perhaps it also affects the control of other 
> 
>>  resources that 
>>>>>>>>>   utilize
>>>>>>>>   crm_mon.
>>>>>>>>>     Please improve the release version of 
> RHEL8.4 so 
>>  that it 
>>>>>>>>>   includes Pacemaker
>>>>>>>>   which does not cause this problem.
>>>>>>>>>       * Distributions other than RHEL may also 
> be 
>>  affected in 
>>>>>>>>>   future releases.
>>>>>>>>> 
>>>>>>>>>     ----
>>>>>>>>>     This content is the same as the following 
>>  Bugzilla.
>>>>>>>>>       - 
>>  https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>>>>>>>>     ----
>>>>>>>>> 
>>>>>>>>>     Best Regards,
>>>>>>>>>     Hideo Yamauchi.
>>>>>>>>> 
>>>>>>>>>     
> _______________________________________________
>>> 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
>