[ClusterLabs] pacemaker startup problem

Sun Jul 26 06:32:29 EDT 2020

Illumos might have getpeerucred, which can also set errno to ENOTSUP.

On Sun, Jul 26, 2020 at 3:25 AM Reid Wahl <nwahl at redhat.com> wrote:

> Hmm. If it's reading PCMK_ipc_type and matching the server type to
> QB_IPC_SOCKET, then the only other place I see it could be coming from is
> qb_ipc_auth_creds.
>
> qb_ipcs_run -> qb_ipcs_us_publish -> qb_ipcs_us_connection_acceptor ->
> qb_ipcs_uc_recv_and_auth -> process_auth -> qb_ipc_auth_creds ->
>
> static int32_t
> qb_ipc_auth_creds(struct ipc_auth_data *data)
> {
> ...
> #ifdef HAVE_GETPEERUCRED
>         /*
>          * Solaris and some BSD systems
> ...
> #elif defined(HAVE_GETPEEREID)
>         /*
>         * Usually MacOSX systems
> ...
> #elif defined(SO_PASSCRED)
>         /*
>         * Usually Linux systems
> ...
> #else /* no credentials */
>         data->ugp.pid = 0;
>         data->ugp.uid = 0;
>         data->ugp.gid = 0;
>         res = -ENOTSUP;
> #endif /* no credentials */
>
>         return res;
>
> I'll leave it to Ken to say whether that's likely and what it implies if
> so.
>
> On Sun, Jul 26, 2020 at 2:53 AM Gabriele Bulfon <gbulfon at sonicle.com>
> wrote:
>
>> Sorry, actually the problem is not gone yet.
>> Now corosync and pacemaker are running happily, but those IPC errors are
>> coming out of heartbeat and crmd as soon as I start it.
>> The pacemakerd process has PCMK_ipc_type=socket, what's wrong with
>> heartbeat or crmd?
>>
>> Here's the env of the process:
>>
>> sonicle at xstorage1:/sonicle/etc/cluster/ha.d# penv 4222
>> 4222: /usr/sbin/pacemakerd
>> envp[0]: PCMK_respawned=true
>> envp[1]: PCMK_watchdog=false
>> envp[2]: HA_LOGFACILITY=none
>> envp[3]: HA_logfacility=none
>> envp[4]: PCMK_logfacility=none
>> envp[5]: HA_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[6]: PCMK_logfile=/sonicle/var/log/cluster/corosync.log
>> envp[7]: HA_debug=0
>> envp[8]: PCMK_debug=0
>> envp[9]: HA_quorum_type=corosync
>> envp[10]: PCMK_quorum_type=corosync
>> envp[11]: HA_cluster_type=corosync
>> envp[12]: PCMK_cluster_type=corosync
>> envp[13]: HA_use_logd=off
>> envp[14]: PCMK_use_logd=off
>> envp[15]: HA_mcp=true
>> envp[16]: PCMK_mcp=true
>> envp[17]: HA_LOGD=no
>> envp[18]: LC_ALL=C
>> envp[19]: PCMK_service=pacemakerd
>> envp[20]: PCMK_ipc_type=socket
>> envp[21]: SMF_ZONENAME=global
>> envp[22]: PWD=/
>> envp[23]: SMF_FMRI=svc:/sonicle/xstream/cluster/pacemaker:default
>> envp[24]: _=/usr/sbin/pacemakerd
>> envp[25]: TZ=Europe/Rome
>> envp[26]: LANG=en_US.UTF-8
>> envp[27]: SMF_METHOD=start
>> envp[28]: SHLVL=2
>> envp[29]: PATH=/usr/sbin:/usr/bin
>> envp[30]: SMF_RESTARTER=svc:/system/svc/restarter:default
>> envp[31]: A__z="*SHLVL
>>
>>
>> Here are crmd complaints:
>>
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Node xstorage1 state is now member
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not start crmd IPC server: Operation not supported (-48)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Failed to create IPC server: shutting down and inhibiting respawn
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> The local CRM is operational
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_ERROR received in state S_STARTING from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> State transition S_STARTING -> S_RECOVERY
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Fast-tracking shutdown in response to errors
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.warning] warning:
>> Input I_PENDING received in state S_RECOVERY from do_started
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Input I_TERMINATE received in state S_RECOVERY from do_recover
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.notice] notice:
>> Disconnected from the LRM
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Child process pengine exited (pid=4316, rc=100)
>> Jul 26 11:39:07 xstorage1 crmd[4315]: [ID 702911 daemon.error] error:
>> Could not recover from internal error
>> Jul 26 11:39:07 xstorage1 heartbeat: [ID 996084 daemon.warning] [4275]:
>> WARN: Managed /usr/libexec/pacemaker/crmd process 4315 exited with return
>> code 201.
>>
>>
>>
>>
>> *Sonicle S.r.l. *: http://www.sonicle.com
>> *Music: *http://www.gabrielebulfon.com
>> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>>
>>
>>
>>
>> ----------------------------------------------------------------------------------
>>
>> Da: Ken Gaillot <kgaillot at redhat.com>
>> A: Cluster Labs - All topics related to open-source clustering welcomed <
>> users at clusterlabs.org>
>> Data: 25 luglio 2020 0.46.52 CEST
>> Oggetto: Re: [ClusterLabs] pacemaker startup problem
>>
>> On Fri, 2020-07-24 at 18:34 +0200, Gabriele Bulfon wrote:
>> > Hello,
>> >
>> > after a long time I'm back to run heartbeat/pacemaker/corosync on our
>> > XStreamOS/illumos distro.
>> > I rebuilt the original components I did in 2016 on our latest release
>> > (probably a bit outdated, but I want to start from where I left).
>> > Looks like pacemaker is having trouble starting up showin this logs:
>> >
>> > Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
>> > Set r/w permissions for uid=401, gid=401 on /var/log/pacemaker.log
>> > Jul 24 18:21:32 [971] crmd: info: crm_log_init: Changed active
>> > directory to /sonicle/var/cluster/lib/pacemaker/cores
>> > Jul 24 18:21:32 [971] crmd: info: main: CRM Git Version: 1.1.15
>> > (e174ec8)
>> > Jul 24 18:21:32 [971] crmd: info: do_log: Input I_STARTUP received in
>> > state S_STARTING from crmd_init
>> > Jul 24 18:21:32 [969] lrmd: info: crm_log_init: Changed active
>> > directory to /sonicle/var/cluster/lib/pacemaker/cores
>> > Jul 24 18:21:32 [968] stonith-ng: info: crm_log_init: Changed active
>> > directory to /sonicle/var/cluster/lib/pacemaker/cores
>> > Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Verifying
>> > cluster type: 'heartbeat'
>> > Jul 24 18:21:32 [968] stonith-ng: info: get_cluster_type: Assuming an
>> > active 'heartbeat' cluster
>> > Jul 24 18:21:32 [968] stonith-ng: notice: crm_cluster_connect:
>> > Connecting to cluster infrastructure: heartbeat
>>
>>
>> > Jul 24 18:21:32 [969] lrmd: error: mainloop_add_ipc_server: Could not
>> > start lrmd IPC server: Operation not supported (-48)
>>
>> This is repeated for all the subdaemons ... the error is coming from
>> qb_ipcs_run(), which looks like the issue is an invalid PCMK_ipc_type
>> for illumos. If you set it to "socket" it should work.
>>
>>
>> > Jul 24 18:21:32 [969] lrmd: error: main: Failed to create IPC server:
>> > shutting down and inhibiting respawn
>> > Jul 24 18:21:32 [969] lrmd: info: crm_xml_cleanup: Cleaning up memory
>> > from libxml2
>> > Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Verifying cluster
>> > type: 'heartbeat'
>> > Jul 24 18:21:32 [971] crmd: info: get_cluster_type: Assuming an
>> > active 'heartbeat' cluster
>> > Jul 24 18:21:32 [971] crmd: info: start_subsystem: Starting sub-
>> > system "pengine"
>> > Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Created entry
>> > 25bc5492-a49e-40d7-ae60-fd8f975a294a/80886f0 for node xstorage1/0 (1
>> > total)
>> > Jul 24 18:21:32 [968] stonith-ng: info: crm_get_peer: Node 0 has uuid
>> > d426a730-5229-6758-853a-99d4d491514a
>> > Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn:
>> > Hostname: xstorage1
>> > Jul 24 18:21:32 [968] stonith-ng: info: register_heartbeat_conn:
>> > UUID: d426a730-5229-6758-853a-99d4d491514a
>> > Jul 24 18:21:32 [970] attrd: notice: crm_cluster_connect: Connecting
>> > to cluster infrastructure: heartbeat
>> > Jul 24 18:21:32 [970] attrd: error: mainloop_add_ipc_server: Could
>> > not start attrd IPC server: Operation not supported (-48)
>> > Jul 24 18:21:32 [970] attrd: error: attrd_ipc_server_init: Failed to
>> > create attrd servers: exiting and inhibiting respawn.
>> > Jul 24 18:21:32 [970] attrd: warning: attrd_ipc_server_init: Verify
>> > pacemaker and pacemaker_remote are not both enabled.
>> > Jul 24 18:21:32 [972] pengine: info: crm_log_init: Changed active
>> > directory to /sonicle/var/cluster/lib/pacemaker/cores
>> > Jul 24 18:21:32 [972] pengine: error: mainloop_add_ipc_server: Could
>> > not start pengine IPC server: Operation not supported (-48)
>> > Jul 24 18:21:32 [972] pengine: error: main: Failed to create IPC
>> > server: shutting down and inhibiting respawn
>> > Jul 24 18:21:32 [972] pengine: info: crm_xml_cleanup: Cleaning up
>> > memory from libxml2
>> > Jul 24 18:21:33 [971] crmd: info: do_cib_control: Could not connect
>> > to the CIB service: Transport endpoint is not connected
>> > Jul 24 18:21:33 [971] crmd: warning: do_cib_control: Couldn't
>> > complete CIB registration 1 times... pause and retry
>> > Jul 24 18:21:33 [971] crmd: error: crmd_child_exit: Child process
>> > pengine exited (pid=972, rc=100)
>> > Jul 24 18:21:35 [971] crmd: info: crm_timer_popped: Wait Timer
>> > (I_NULL) just popped (2000ms)
>> > Jul 24 18:21:36 [971] crmd: info: do_cib_control: Could not connect
>> > to the CIB service: Transport endpoint is not connected
>> > Jul 24 18:21:36 [971] crmd: warning: do_cib_control: Couldn't
>> > complete CIB registration 2 times... pause and retry
>> > Jul 24 18:21:38 [971] crmd: info: crm_timer_popped: Wait Timer
>> > (I_NULL) just popped (2000ms)
>> > Jul 24 18:21:39 [971] crmd: info: do_cib_control: Could not connect
>> > to the CIB service: Transport endpoint is not connected
>> > Jul 24 18:21:39 [971] crmd: warning: do_cib_control: Couldn't
>> > complete CIB registration 3 times... pause and retry
>> > Jul 24 18:21:41 [971] crmd: info: crm_timer_popped: Wait Timer
>> > (I_NULL) just popped (2000ms)
>> > Jul 24 18:21:42 [971] crmd: info: do_cib_control: Could not connect
>> > to the CIB service: Transport endpoint is not connected
>> > Jul 24 18:21:42 [971] crmd: warning: do_cib_control: Couldn't
>> > complete CIB registration 4 times... pause and retry
>> > Jul 24 18:21:42 [968] stonith-ng: error: setup_cib: Could not connect
>> > to the CIB service: Transport endpoint is not connected (-134)
>> > Jul 24 18:21:42 [968] stonith-ng: error: mainloop_add_ipc_server:
>> > Could not start stonith-ng IPC server: Operation not supported (-48)
>> > Jul 24 18:21:42 [968] stonith-ng: error: stonith_ipc_server_init:
>> > Failed to create stonith-ng servers: exiting and inhibiting respawn.
>> > Jul 24 18:21:42 [968] stonith-ng: warning: stonith_ipc_server_init:
>> > Verify pacemaker and pacemaker_remote are not both enabled.
>> >
>> > Any idea what's happening?
>> > Gabriele
>> >
>> >
>> >
>> >
>> > Sonicle S.r.l. : http://www.sonicle.com
>> > Music: http://www.gabrielebulfon.com
>> > Quantum Mechanics : http://www.cdbaby.com/cd/gabrielebulfon
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200726/c10f9889/attachment-0001.htm>