[ClusterLabs] resource agent OCF_HEARTBEAT_GALERA issue/broken - ?

lejeczek peljasz at yahoo.co.uk
Wed Jul 27 05:08:18 EDT 2022



On 26/07/2022 20:56, Reid Wahl wrote:
> On Tue, Jul 26, 2022 at 4:21 AM lejeczek via Users
> <users at clusterlabs.org> wrote:
>> Hi guys
>>
>> I set up a clone of a new instance of mariadb galera - which otherwise,
>> outside of pcs works - but I see something weird.
>>
>> Firstly cluster claims it's all good:
>>
>> -> $ pcs status --full
>> ...
>>
>>     * Clone Set: mariadb-apps-clone [mariadb-apps] (promotable):
>>       * mariadb-apps    (ocf::heartbeat:galera):     Master
>> sucker.internal.ccn
>>       * mariadb-apps    (ocf::heartbeat:galera):     Master
>> drunk.internal.ccn
> Clearly the problem is that your server is drunk.
>
>> but that mariadb is _not_ started actually.
>>
>> In clone's attr I set:
>>
>> config=/apps/etc/mariadb-server.cnf
>>
>> I also for peace of mind set:
>>
>> datadir=/apps/mysql/data
>>
>> even tough '/apps/etc/mariadb-server.cnf' declares that & other bits -
>> again, works outside of pcs.
>>
>> Then I see in pacemaker logs:
>>
>> notice: mariadb-apps_start_0 at drunk.internal.ccn output [ 220726 11:56:13
>> mysqld_safe Logging to '/tmp/tmp.On5VnzOyaF'.\n220726 11:56:13
>> mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql\n ]
>>
>> .. and I think what the F?
>>
>> resource-agents-4.9.0-22.el8.x86_64
>>
>> All thoughts share are much appreciated.
>>
>> many thanks, L.
>>
> How do you start mariadb outside of pacemaker's control?
simply by:
-> $ sudo -u mysql /usr/libexec/mysqld 
--defaults-file=/apps/etc/mariadb-server.cnf 
--wsrep-new-clusterrep
>
> It seems that *something* is running, based on the "Starting" message
> and the fact that the resources are still in Started state...
Yes, a "regular" instance of galera (which outside of 
pacemaker works with "regular" systemd) is running as ofc 
one resource already.

>
> The logic to start and promote the galera resource is contained within
> /usr/lib/ocf/resource.d/heartbeat/galera and
> /usr/lib/ocf/lib/heartbeat/mysql-common.sh. I encourage you to inspect
> those for any relevant differences between your own startup method and
> that of the resource agent.
>
> As one example, note that the resource agent uses mysqld_safe by
> default. This is configurable via the `binary` option for the
> resource. Be sure that you've looked at all the available options
> (`pcs resource describe ocf:heartbeat:galera`) and configured any of
> them that you need. You've definitely at least started that process
> with config and datadir.
log snippet I pasted was not to emphasize 'mysqld_safe' but 
the fact that resource does:
...
Starting mysqld daemon with databases from /var/lib/mysql
...

Yes, I did check man pages for options and if if you suggest 
checking source code then I say that's  case for a bug report.
I tried my best to make it clear, doubt can do that better 
but I'll re-try:
a) bits needed to run a "new/second" instance are all in 
'mariadb-server.cnf' and galera works outside of pacemaker 
(cmd above)
b) if I use attrs: 'datadir' & 'config' yet resource tells 
me "..daemon with databases from /var/lib/mysql..." then..
people looking into the code should perhaps be you (as 
Redhat employee) and/or other authors/developers - if I 
filed it as BZ - no?

It should be easily reproducible, have:
1) a galera (as a 2nd/3rd/etc. instance, outside & with 
different settings from "standard" OS's rpm installation) 
work & tested
2) set up a resource:
-> $ pcs resource create mariadb-apps ocf:heartbeat:galera 
cluster_host_map="drunk.internal.ccn:10.0.1.6;sucker.internal.ccn:10.0.1.7" 
config="/apps/etc/mariadb-server.cnf" 
wsrep_cluster_address="gcomm://10.0.1.6:5567,10.0.1.7:5567" 
user=mysql group=mysql check_user="pacemaker" 
check_passwd="pacemaker#9897" op monitor OCF_CHECK_LEVEL="0" 
timeout="30s" interval="20s" op monitor role="Master" 
OCF_CHECK_LEVEL="0" timeout="30s" interval="10s" op monitor 
role="Slave" OCF_CHECK_LEVEL="0" timeout="30s" 
interval="30s" promotable promoted-max=2 meta 
failure-timeout=30s

and you should get (give version & evn is the same) see what 
I see - weird "misbehavior"

many thanks, L.



More information about the Users mailing list