[ClusterLabs] Pacemaker fails to start after few starts

Wed Apr 22 14:00:49 UTC 2015

On 04/22/2015 07:27 AM, Kostiantyn Ponomarenko wrote:
> I faced this issue one more time.
> Now I can surly say that Corosync doesn't crash.
> 
> On a working machine I stopped Pacemaker and Corosync.
> Then I started them with the next commands and got this:
> ------------------
> # /etc/init.d/corosync start
> Starting Corosync Cluster Engine (corosync): [  OK  ]
> # /etc/init.d/corosync status
> corosync (pid 100837) is running...
> # /etc/init.d/pacemaker start
> Starting Pacemaker Cluster Manager[  OK  ]
> # /etc/init.d/pacemaker status
> pacemakerd is stopped
> ------------------
> 
> 
> /var/log/messages:
> ------------------
> Apr 22 10:49:08 daemon.notice<29> pacemaker: Starting Pacemaker Cluster Manager
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> crm_add_logfile: Additional logging available in
> /var/log/pacemaker.log
> Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
> mcp_read_config: Couldn't create logfile: /var/log/pacemaker.log

The above two messages are really strange in combination. There's no
code path that could print both. It would also be really strange that
you could sometimes create /var/log/pacemaker.log and sometimes not.

Given this and libqb segfaulting when attrd tries to clean up later
below, I'm wondering whether you have some sort of library issue --
maybe incompatible versions or a corrupted binary. Did you install via
OS packages or compile yourself? If you compiled yourself, you may want
to retry with versions that are known to work with each other. If via
the OS, you may want to force a reinstall.

The only other thing I can think of is memory corruption, but this is in
the first second that pacemaker starts up, so it would probably be
hardware. If you're using ECC RAM, check the kernel logs for errors with
"EDAC" in them (error detection and correction). You might want to run
memtest on the system if you can afford to take it out of production for
a day or so. I would strongly suspect this if the problem always happens
on the same physical machine, but highly unlikely if it happens on all
the nodes.

> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> mcp_read_config: Configured corosync to accept connections from group
> 107: Library error (2)

Another unusual error, this time from corosync's cmap_set_uint8().

> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice: main:
> Starting Pacemaker 1.1.12 (Build: 561c4cf):  generated-manpages
> agent-manpages ascii-docs ncurses libqb-logging libqb-ipc lha-fencing
> upstart nagios  corosync-native snmp libesmtp acls
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> cluster_connect_quorum: Quorum lost
> Apr 22 10:49:08 daemon.notice<29> stonithd[114136]:   notice:
> crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice:
> crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied
> connection attempt from 105:107
> Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error:
> cluster_connect_cpg: Could not connect to the Cluster Process Group
> API: 11
> Apr 22 10:49:08 daemon.err<27> attrd[114138]:    error: main: Cluster
> connection failed

Everything above here is errors from pacemaker being unable to talk to
corosync.

> Apr 22 10:49:08 daemon.notice<29> attrd[114138]:   notice: main:
> Cleaning up before exit
> Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]:
> segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in
> libqb.so.0.17.1[7f375480d000+22000]

libqb (used by attrd) isn't handling the failure well, and this is where
your core dump is coming from, but that's a symptom rather than a cause.
Given the earlier errors, I doubt it's even an issue in libqb.

> Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid
> IPC credentials (100837-114138-2).
> Apr 22 10:49:08 daemon.notice<29> cib[114135]:   notice:
> crm_cluster_connect: Connecting to cluster infrastructure: corosync
> Apr 22 10:49:08 daemon.err<27> cib[114135]:    error:
> cluster_connect_cpg: Could not connect to the Cluster Process Group
> API: 11
> Apr 22 10:49:08 daemon.crit<26> cib[114135]:     crit: cib_init:
> Cannot sign in to the cluster... terminating
> Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [MAIN  ] Denied
> connection attempt from 105:107
> Apr 22 10:49:08 daemon.err<27> corosync[100837]:   [QB    ] Invalid
> IPC credentials (100837-114135-3).
> Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice: main: CRM
> Git Version: 561c4cf
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> crm_update_peer_state: pcmk_quorum_notification: Node node-0[1] -
> state is now member (was (null))
> Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
> pcmk_child_exit: Child process cib (114135) exited: Network is down
> (100)
> Apr 22 10:49:08 daemon.warning<28> pacemakerd[114133]:  warning:
> pcmk_child_exit: Pacemaker child process cib no longer wishes to be
> respawned. Shutting ourselves down.
> Apr 22 10:49:08 daemon.err<27> pacemakerd[114133]:    error:
> child_waitpid: Managed process 114138 (attrd) dumped core
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> pcmk_child_exit: Child process attrd terminated with signal 11
> (pid=114138, core=1)
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> pcmk_shutdown_worker: Shuting down Pacemaker
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> stop_child: Stopping crmd: Sent -15 to process 114140
> Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning:
> do_cib_control: Couldn't complete CIB registration 1 times... pause
> and retry
> Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
> crm_shutdown: Requesting shutdown, upper limit is 1200000ms
> Apr 22 10:49:08 daemon.warning<28> crmd[114140]:  warning: do_log:
> FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
> Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
> do_state_transition: State transition S_STARTING -> S_STOPPING [
> input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
> Apr 22 10:49:08 daemon.notice<29> crmd[114140]:   notice:
> terminate_cs_connection: Disconnecting from Corosync
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> stop_child: Stopping pengine: Sent -15 to process 114139
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> stop_child: Stopping lrmd: Sent -15 to process 114137
> Apr 22 10:49:08 daemon.notice<29> pacemakerd[114133]:   notice:
> stop_child: Stopping stonith-ng: Sent -15 to process 114136
> Apr 22 10:49:17 daemon.err<27> stonithd[114136]:    error: setup_cib:
> Could not connect to the CIB service: Transport endpoint is not
> connected (-107)
> Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice:
> pcmk_shutdown_worker: Shutdown complete
> Apr 22 10:49:17 daemon.notice<29> pacemakerd[114133]:   notice:
> pcmk_shutdown_worker: Attempting to inhibit respawning after fatal
> error
> ------------------
> 
> 
> "/var/cores/" contains only "core.attrd-*".
> What else can I do?
> 
> 
> Could be the problem in 'libqb'?
> I noticed this line in the log:
> Apr 22 10:49:08 kern.info<6> kernel: [162259.416242] attrd[114138]:
> segfault at 1b8 ip 00007f375481c9e1 sp 00007fff7ddf0d50 error 4 in
> libqb.so.0.17.1[7f375480d000+22000]
> 
> 
> 
> Thank you,
> Kostya
> 
> On Mon, Apr 20, 2015 at 7:56 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
>>
>>> On 14 Apr 2015, at 9:01 pm, Kostiantyn Ponomarenko <
>> konstantin.ponomarenko at gmail.com> wrote:
>>>
>>> Disk wasn't full.
>>> According to: "Mar 27 14:00:50 daemon.err<27> pacemakerd[111069]:
>> error: child_waitpid: Managed process 111074 (attrd) dumped core", there is
>> a core dump in "/var/cores/core.attrd-111074-1427464849".
>>> It is the one which corresponds to the log snippet and it is attached to
>> the email.
>>
>> attrd crashing will be unrelated to whether or not corosync is also
>> crashing
>>
>>>
>>>
>>>
>>>
>>> Thank you,
>>> Kostya
>>>
>>> On Fri, Apr 10, 2015 at 10:00 AM, Jan Pokorný <jpokorny at redhat.com>
>> wrote:
>>> Hello ,
>>>
>>> On 30/03/15 10:36 +1100, Andrew Beekhof wrote:
>>>>> On 28 Mar 2015, at 1:10 am, Kostiantyn Ponomarenko <
>> konstantin.ponomarenko at gmail.com> wrote:
>>>>> If I start/stop Corosync and Pacemaker few times I get the state
>>>>> where Corosync is running, but Pacemaker cannot start.
>>>>> Here is a snippet from /var/log/messages:
>>>
>>> [...]
>>>
>>>>> Mar 27 14:00:49 daemon.notice<29> pacemakerd[111069]:   notice:
>> mcp_read_config: Configured corosync to accept connections from group 107:
>> Library error (2)
>>>>
>>>> Everything else flows from this.
>>>> Perhaps one of the corosync people can comment on the conditions
>>>> under which this call would fail.
>>>
>>> CC'd relevant ML.
>>>
>>>> Relevant code from pacemaker is:
>>>>
>>>>             char key[PATH_MAX];
>>>>             snprintf(key, PATH_MAX, "uidgid.gid.%u", gid);
>>>>             rc = cmap_set_uint8(local_handle, key, 1);
>>>>             crm_notice("Configured corosync to accept connections from
>> group %u: %s (%d)",
>>>>                        gid, ais_error2text(rc), rc);
>>>
>>>
>>> Appears to resemble https://bugzilla.redhat.com/show_bug.cgi?id=1114852
>>>
>>> --
>>> Jan
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users