[ClusterLabs] Pacemaker fails to start after few starts

Wed Apr 22 11:27:03 UTC 2015

I faced this issue one more time.
Now I can surly say that Corosync doesn't crash.

On a working machine I stopped Pacemaker and Corosync.
Then I started them with the next commands and got this:
------------------
# /etc/init.d/corosync start
Starting Corosync Cluster Engine (corosync): [  OK  ]
# /etc/init.d/corosync status
corosync (pid 100837) is running...
# /etc/init.d/pacemaker start
Starting Pacemaker Cluster Manager[  OK  ]
# /etc/init.d/pacemaker status
pacemakerd is stopped
------------------

/var/log/messages:
------------------
Apr 22 10:49:08 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
crm_add_logfile: Additional logging available in
/var/log/pacemaker.log
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
mcp_read_config: Configured corosync to accept connections from group
107: Library error (2)
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: main:
Starting Pacemaker 1.1.12 (Build: 561c4cf):  generated-manpages
agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing
upstart nagios  corosync-native snmp libesmtp acls
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
cluster_connect_quorum: Quorum lost
Apr 22 10:49:08 daemon.notice<29> stonithd[114136]:   notice:
crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice:
crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied
connection attempt from 105:107
Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error:
cluster_connect_cpg: Could not connect to the Cluster Process Group
API: 11
Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error: main: Cluster
connection failed
Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice: main:
Cleaning up before exit
Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]:
segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in
libqb.so.0.17.1[7f375480d000+22000]
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid
IPC credentials (100837-114138-2).
Apr 22 10:49:08 daemon.notice<29> cib[114135]:   notice:
crm_cluster_connect: Connecting to cluster infrastructure: corosync
Apr 22 10:49:08 daemon.err<27> cib[114135]:    error:
cluster_connect_cpg: Could not connect to the Cluster Process Group
API: 11
Apr 22 10:49:08 daemon.crit<26> cib[114135]:     crit: cib_init:
Cannot sign in to the cluster... terminating
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied
connection attempt from 105:107
Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid
IPC credentials (100837-114135-3).
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: main: CRM
Git Version: 561c4cf
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
crm_update_peer_state: pcmk_quorum_notification: Node node-0[1] -
state is now member (was (null))
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
pcmk_child_exit: Child process cib (114135) exited: Network is down
(100)
Apr 22 10:49:08 daemon.warning<28> pacemakerd[114133]:  warning:
pcmk_child_exit: Pacemaker child process cib no longer wishes to be
respawned. Shutting ourselves down.
Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
child_waitpid: Managed process 114138 (attrd) dumped core
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
pcmk_child_exit: Child process attrd terminated with signal 11
(pid=114138, core=1)
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
pcmk_shutdown_worker: Shuting down Pacemaker
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
stop_child: Stopping crmd: Sent -15 to process 114140
Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning:
do_cib_control: Couldn't complete CIB registration 1 times... pause
and retry
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
crm_shutdown: Requesting shutdown, upper limit is 1200000ms
Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning: do_log:
FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
do_state_transition: State transition S_STARTING -> S_STOPPING [
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
terminate_cs_connection: Disconnecting from Corosync
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
stop_child: Stopping pengine: Sent -15 to process 114139
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
stop_child: Stopping lrmd: Sent -15 to process 114137
Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
stop_child: Stopping stonith-ng: Sent -15 to process 114136
Apr 22 10:49:17 daemon.err<27> stonithd[114136]:    error: setup_cib:
Could not connect to the CIB service: Transport endpoint is not
connected (-107)
Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice:
pcmk_shutdown_worker: Shutdown complete
Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice:
pcmk_shutdown_worker: Attempting to inhibit respawning after fatal
error
------------------

"/var/cores/" contains only "core.attrd-*".
What else can I do?

Could be the problem in 'libqb'?
I noticed this line in the log:
Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]:
segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in
libqb.so.0.17.1[7f375480d000+22000]

Thank you,
Kostya

On Mon, Apr 20, 2015 at 7:56 AM, Andrew Beekhof <andrew at beekhof.net> wrote:

>
> > On 14 Apr 2015, at 9:01 pm, Kostiantyn Ponomarenko <
> konstantin.ponomarenko at gmail.com> wrote:
> >
> > Disk wasn't full.
> > According to: "Mar 27 14:00:50 daemon.err<27> pacemakerd[111069]:
> error: child_waitpid: Managed process 111074 (attrd) dumped core", there is
> a core dump in "/var/cores/core.attrd-111074-1427464849".
> > It is the one which corresponds to the log snippet and it is attached to
> the email.
>
> attrd crashing will be unrelated to whether or not corosync is also
> crashing
>
> >
> >
> >
> >
> > Thank you,
> > Kostya
> >
> > On Fri, Apr 10, 2015 at 10:00 AM, Jan Pokorný <jpokorny at redhat.com>
> wrote:
> > Hello ,
> >
> > On 30/03/15 10:36 +1100, Andrew Beekhof wrote:
> > >> On 28 Mar 2015, at 1:10 am, Kostiantyn Ponomarenko <
> konstantin.ponomarenko at gmail.com> wrote:
> > >> If I start/stop Corosync and Pacemaker few times I get the state
> > >> where Corosync is running, but Pacemaker cannot start.
> > >> Here is a snippet from /var/log/messages:
> >
> > [...]
> >
> > >> Mar 27 14:00:49 daemon.notice<29> pacemakerd[111069]:   notice:
> mcp_read_config: Configured corosync to accept connections from group 107:
> Library error (2)
> > >
> > > Everything else flows from this.
> > > Perhaps one of the corosync people can comment on the conditions
> > > under which this call would fail.
> >
> > CC'd relevant ML.
> >
> > > Relevant code from pacemaker is:
> > >
> > >             char key[PATH_MAX];
> > >             snprintf(key, PATH_MAX, "uidgid.gid.%u", gid);
> > >             rc = cmap_set_uint8(local_handle, key, 1);
> > >             crm_notice("Configured corosync to accept connections from
> group %u: %s (%d)",
> > >                        gid, ais_error2text(rc), rc);
> >
> >
> > Appears to resemble https://bugzilla.redhat.com/show_bug.cgi?id=1114852
> >
> > --
> > Jan
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> <core.attrd-111074-1427464849>_______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150422/c678c528/attachment-0002.html>