[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.
Klaus Wenninger
kwenning at redhat.com
Fri Apr 9 11:13:49 EDT 2021
On 4/9/21 4:04 PM, Klaus Wenninger wrote:
> On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>> On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>> On 4/9/21 2:37 PM, renayama19661014 at ybb.ne.jp wrote:
>>>> Hi Klaus,
>>>>
>>>> Thanks for your comment.
>>>>
>>>>> Hmm ... is that with selinux enabled?
>>>>> Respectively do you see any related avc messages?
>>>>
>>>> Selinux is not enabled.
>>>> Isn't crm_mon caused by not returning a response when pacemakerd
>>>> prepares to stop?
>> yep ... that doesn't look good.
>> While in pcmk_shutdown_worker ipc isn't handled.
> Stop ... that should actually work as pcmk_shutdown_worker
> should exit quite quickly and proceed after mainloop
> dispatching when called again.
> Don't see anything atm that might be blocking for longer ...
> but let me dig into it further ...
What happens is clear (thanks Ken for the hint ;-) ).
When pacemakerd is shutting down - already when it
shuts down the resources and not just when it starts to
reap the subdaemons - crm_mon reads that state and
doesn't try to connect to the cib anymore.
>> Question is why that didn't create issue earlier.
>> Probably I didn't test with resources that had crm_mon in
>> their stop/monitor-actions but sbd should have run into
>> issues.
>>
>> Klaus
>>> But when shutting down a node the resources should be
>>> shutdown before pacemakerd goes down.
>>> But let me have a look if it can happen that pacemakerd
>>> doesn't react to the ipc-pings before. That btw. might be
>>> lethal for sbd-scenarios (if the phase is too long and it
>>> migh actually not be defined).
>>>
>>> My idea with selinux would have been that it might block
>>> the ipc if crm_mon is issued by execd. But well forget
>>> about it as it is not enabled ;-)
>>>
>>>
>>> Klaus
>>>>
>>>> pgsql needs the result of crm_mon in demote processing and stop
>>>> processing.
>>>> crm_mon should return a response even after pacemakerd goes into a
>>>> stop operation.
>>>>
>>>> Best Regards,
>>>> Hideo Yamauchi.
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: Klaus Wenninger <kwenning at redhat.com>
>>>>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related
>>>>> to open-source clustering welcomed <users at clusterlabs.org>
>>>>> Cc:
>>>>> Date: 2021/4/9, Fri 21:12
>>>>> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql
>>>>> resource control fails.
>>>>>
>>>>> On 4/8/21 11:21 PM, renayama19661014 at ybb.ne.jp wrote:
>>>>>> Hi Ken,
>>>>>> Hi All,
>>>>>>
>>>>>> In the pgsql resource, crm_mon is executed in the process of
>>>>>> demote and
>>>>> stop, and the result is processed.
>>>>>> However, pacemaker included in RHEL8.4beta fails to execute
>>>>>> this crm_mon.
>>>>>> - The problem also occurs on github
>>>>> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>>>>>> The problem can be easily reproduced in the following ways.
>>>>>>
>>>>>> Step1. Modify to execute crm_mon in the stop process of the
>>>>>> Dummy resource.
>>>>>> ----
>>>>>>
>>>>>> dummy_stop() {
>>>>>> mon=$(crm_mon -1)
>>>>>> ret=$?
>>>>>> ocf_log info "### YAMAUCHI #### crm_mon[${ret}] : ${mon}"
>>>>>> dummy_monitor
>>>>>> if [ $? = $OCF_SUCCESS ]; then
>>>>>> rm ${OCF_RESKEY_state}
>>>>>> fi
>>>>>> return $OCF_SUCCESS
>>>>>> }
>>>>>> ----
>>>>>>
>>>>>> Step2. Configure a cluster with two nodes.
>>>>>> ----
>>>>>>
>>>>>> [root at rh84-beta01 ~]# crm_mon -rfA1
>>>>>> Cluster Summary:
>>>>>> * Stack: corosync
>>>>>> * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) -
>>>>>> partition
>>>>> with quorum
>>>>>> * Last updated: Thu Apr 8 18:00:52 2021
>>>>>> * Last change: Thu Apr 8 18:00:38 2021 by root via
>>>>>> cibadmin on
>>>>> rh84-beta01
>>>>>> * 2 nodes configured
>>>>>> * 1 resource instance configured
>>>>>>
>>>>>> Node List:
>>>>>> * Online: [ rh84-beta01 rh84-beta02 ]
>>>>>>
>>>>>> Full List of Resources:
>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta01
>>>>>>
>>>>>> Migration Summary:
>>>>>> ----
>>>>>>
>>>>>> Step3. Stop the node where the Dummy resource is running. The
>>>>>> resource will
>>>>> fail over.
>>>>>> ----
>>>>>> [root at rh84-beta02 ~]# crm_mon -rfA1
>>>>>> Cluster Summary:
>>>>>> * Stack: corosync
>>>>>> * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) -
>>>>>> partition
>>>>> with quorum
>>>>>> * Last updated: Thu Apr 8 18:08:56 2021
>>>>>> * Last change: Thu Apr 8 18:05:08 2021 by root via
>>>>>> cibadmin on
>>>>> rh84-beta01
>>>>>> * 2 nodes configured
>>>>>> * 1 resource instance configured
>>>>>>
>>>>>> Node List:
>>>>>> * Online: [ rh84-beta02 ]
>>>>>> * OFFLINE: [ rh84-beta01 ]
>>>>>>
>>>>>> Full List of Resources:
>>>>>> * dummy-1 (ocf::heartbeat:Dummy): Started rh84-beta02
>>>>>> ----
>>>>>>
>>>>>> However, if you look at the log, you can see that the execution
>>>>>> of crm_mon
>>>>> in the stop processing of the Dummy resource has failed.
>>>>>> ----
>>>>>> Apr 08 18:05:17 Dummy(dummy-1)[2631]: INFO: ### YAMAUCHI ####
>>>>> crm_mon[102] : Pacemaker daemons shutting down ...
>>>>>> Apr 08 18:05:17 rh84-beta01 pacemaker-execd [2219]
>>>>>> (log_op_output)
>>>>> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error:
>>>>> cluster is not
>>>>> available on this node ]
>>>>> Hmm ... is that with selinux enabled?
>>>>> Respectively do you see any related avc messages?
>>>>>
>>>>> Klaus
>>>>>> ----
>>>>>>
>>>>>> Similarly, pgsql also executes crm_mon with demote or stop, so
>>>>>> control
>>>>> fails.
>>>>>> The problem seems to be related to the next fix.
>>>>>> * Report pacemakerd in state waiting for sbd
>>>>>> - https://github.com/ClusterLabs/pacemaker/pull/2278
>>>>>>
>>>>>> The problem does not occur with the release version of
>>>>>> Pacemaker 2.0.5 or
>>>>> the Pacemaker included with RHEL8.3.
>>>>>> This issue has a huge impact on the user.
>>>>>>
>>>>>> Perhaps it also affects the control of other resources that
>>>>>> utilize
>>>>> crm_mon.
>>>>>> Please improve the release version of RHEL8.4 so that it
>>>>>> includes Pacemaker
>>>>> which does not cause this problem.
>>>>>> * Distributions other than RHEL may also be affected in
>>>>>> future releases.
>>>>>>
>>>>>> ----
>>>>>> This content is the same as the following Bugzilla.
>>>>>> - https://bugs.clusterlabs.org/show_bug.cgi?id=5471
>>>>>> ----
>>>>>>
>>>>>> Best Regards,
>>>>>> Hideo Yamauchi.
>>>>>>
>>>>>> _______________________________________________
More information about the Users
mailing list