[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

Fri Apr 9 11:13:49 EDT 2021

On 4/9/21 4:04 PM, Klaus Wenninger wrote:
> On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>> On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>> On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>>>> Hi Klaus,
>>>>
>>>> Thanks for your comment.
>>>>
>>>>> Hmm ... is that with selinux enabled?
>>>>> Respectively do you see any related avc messages?
>>>>
>>>> Selinux is not enabled.
>>>> Isn't crm_mon caused by not returning a response when pacemakerd 
>>>> prepares to stop?
>> yep ... that doesn't look good.
>> While in pcmk_shutdown_worker ipc isn't handled.
> Stop ... that should actually work as pcmk_shutdown_worker
> should exit quite quickly and proceed after mainloop
> dispatching when called again.
> Don't see anything atm that might be blocking for longer ...
> but let me dig into it further ...
What happens is clear (thanks Ken for the hint ;-) ).
When pacemakerd is shutting down - already when it
shuts down the resources and not just when it starts to
reap the subdaemons - crm_mon reads that state and
doesn't try to connect to the cib anymore.
>> Question is why that didn't create issue earlier.
>> Probably I didn't test with resources that had crm_mon in
>> their stop/monitor-actions but sbd should have run into
>> issues.
>>
>> Klaus
>>> But when shutting down a node the resources should be
>>> shutdown before pacemakerd goes down.
>>> But let me have a look if it can happen that pacemakerd
>>> doesn't react to the ipc-pings before. That btw. might be
>>> lethal for sbd-scenarios (if the phase is too long and it
>>> migh actually not be defined).
>>>
>>> My idea with selinux would have been that it might block
>>> the ipc if crm_mon is issued by execd. But well forget
>>> about it as it is not enabled ;-)
>>>
>>>
>>> Klaus
>>>>
>>>> pgsql needs the result of crm_mon in demote processing and stop 
>>>> processing.
>>>> crm_mon should return a response even after pacemakerd goes into a 
>>>> stop operation.
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related 
>>>>> to open-source clustering welcomed <users at clusterlabs.org>
>>>>> Cc:
>>>>> Date: 2021/4/9, Fri 21:12
>>>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql 
>>>>> resource control fails.
>>>>>
>>>>> On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>>   Hi Ken,
>>>>>>   Hi All,
>>>>>>
>>>>>>   In the pgsql resource, crm_mon is executed in the process of 
>>>>>> demote and
>>>>> stop, and the result is processed.
>>>>>>   However, pacemaker included in RHEL8.4beta fails to execute 
>>>>>> this crm_mon.
>>>>>>     - The problem also occurs on github
>>>>> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>>   The problem can be easily reproduced in the following ways.
>>>>>>
>>>>>>   Step1. Modify to execute crm_mon in the stop process of the 
>>>>>> Dummy resource.
>>>>>>   ----
>>>>>>
>>>>>>   dummy_stop() {
>>>>>>        mon=$(crm_mon -1)
>>>>>>        ret=$?
>>>>>>        ocf_log info "### YAMAUCHI #### crm_mon[${ret}] : ${mon}"
>>>>>>        dummy_monitor
>>>>>>        if [ $? =  $OCF_SUCCESS ]; then
>>>>>>            rm ${OCF_RESKEY_state}
>>>>>>        fi
>>>>>>        return $OCF_SUCCESS
>>>>>>   }
>>>>>>   ----
>>>>>>
>>>>>>   Step2. Configure a cluster with two nodes.
>>>>>>   ----
>>>>>>
>>>>>>   [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>>>   Cluster Summary:
>>>>>>      * Stack: corosync
>>>>>>      * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - 
>>>>>> partition
>>>>> with quorum
>>>>>>      * Last updated: Thu Apr  8 18:00:52 2021
>>>>>>      * Last change:  Thu Apr  8 18:00:38 2021 by root via 
>>>>>> cibadmin on
>>>>> rh84-beta01
>>>>>>      * 2 nodes configured
>>>>>>      * 1 resource instance configured
>>>>>>
>>>>>>   Node List:
>>>>>>      * Online: [ rh84-beta01 rh84-beta02 ]
>>>>>>
>>>>>>   Full List of Resources:
>>>>>>      * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01
>>>>>>
>>>>>>   Migration Summary:
>>>>>>   ----
>>>>>>
>>>>>>   Step3. Stop the node where the Dummy resource is running. The 
>>>>>> resource will
>>>>> fail over.
>>>>>>   ----
>>>>>>   [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>>>   Cluster Summary:
>>>>>>      * Stack: corosync
>>>>>>      * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - 
>>>>>> partition
>>>>> with quorum
>>>>>>      * Last updated: Thu Apr  8 18:08:56 2021
>>>>>>      * Last change:  Thu Apr  8 18:05:08 2021 by root via 
>>>>>> cibadmin on
>>>>> rh84-beta01
>>>>>>      * 2 nodes configured
>>>>>>      * 1 resource instance configured
>>>>>>
>>>>>>   Node List:
>>>>>>      * Online: [ rh84-beta02 ]
>>>>>>      * OFFLINE: [ rh84-beta01 ]
>>>>>>
>>>>>>   Full List of Resources:
>>>>>>      * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
>>>>>>   ----
>>>>>>
>>>>>>   However, if you look at the log, you can see that the execution 
>>>>>> of crm_mon
>>>>> in the stop processing of the Dummy resource has failed.
>>>>>>   ----
>>>>>>   Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI ####
>>>>> crm_mon[102] : Pacemaker daemons shutting down ...
>>>>>>   Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] 
>>>>>> (log_op_output)
>>>>> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: 
>>>>> cluster is not
>>>>> available on this node ]
>>>>> Hmm ... is that with selinux enabled?
>>>>> Respectively do you see any related avc messages?
>>>>>
>>>>> Klaus
>>>>>>   ----
>>>>>>
>>>>>>   Similarly, pgsql also executes crm_mon with demote or stop, so 
>>>>>> control
>>>>> fails.
>>>>>>   The problem seems to be related to the next fix.
>>>>>>     * Report pacemakerd in state waiting for sbd
>>>>>>      - https://github.com/ClusterLabs/pacemaker/pull/2278
>>>>>>
>>>>>>   The problem does not occur with the release version of 
>>>>>> Pacemaker 2.0.5 or
>>>>> the Pacemaker included with RHEL8.3.
>>>>>>   This issue has a huge impact on the user.
>>>>>>
>>>>>>   Perhaps it also affects the control of other resources that 
>>>>>> utilize
>>>>> crm_mon.
>>>>>>   Please improve the release version of RHEL8.4 so that it 
>>>>>> includes Pacemaker
>>>>> which does not cause this problem.
>>>>>>     * Distributions other than RHEL may also be affected in 
>>>>>> future releases.
>>>>>>
>>>>>>   ----
>>>>>>   This content is the same as the following Bugzilla.
>>>>>>     - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
>>>>>>   ----
>>>>>>
>>>>>>   Best Regards,
>>>>>>   Hideo Yamauchi.
>>>>>>
>>>>>>   _______________________________________________