[ClusterLabs] Unable to run Pacemaker: pcmk_child_exit

Fri May 6 06:40:05 EDT 2016

Hi,

I used the blackbox feature which showed the reason for failure.
As I am cross-compiling pacemaker on a build machine and later moving the
binaries to the target, few binaries were missing. After fixing that and
bunch of other errors/warning, I am able to get pacemaker started though
not completely running fine.

The node is not getting added:
airv_cu        cib:    error: xml_log: Element node failed to validate
attributes

I suppose it is because of this error:
crmd:    error: node_list_update_callback: Node update 4 failed: Update
does not conform to the configured schema (-203)

I am suspecting this is caused because of validate-with="pacemaker-0.7" in
the cib. In another installation this is being set to '"pacemaker-2.0"'

[root at airv_cu pacemaker]# pcs cluster cib
<cib crm_feature_set="3.0.10" validate-with="pacemaker-0.7" epoch="3"
num_updates="0" admin_epoch="0" cib-last-written="Fri May  6 09:28:10 2016"
have-quorum="1">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-have-watchdog"
name="have-watchdog" value="true"/>
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
value="1.1.14-5a6cdd1"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure"
name="cluster-infrastructure" value="corosync"/>
      </cluster_property_set>
    </crm_config>
    <nodes/>
    <resources/>
    <constraints/>
  </configuration>
  <status/>
</cib>

Any idea why/where this is being set to 0.7. I am using latest pacemaker
from GitHub.

[root at airv_cu pacemaker]# pacemakerd --version
Pacemaker 1.1.14
Written by Andrew Beekhof

Attaching the corosync.log and corosync.conf file.

-Thanks
Nikhil

On Thu, May 5, 2016 at 10:21 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 05/05/2016 11:25 AM, Nikhil Utane wrote:
> > Thanks Ken for your quick response as always.
> >
> > But what if I don't want to use quorum? I just want to bring up
> > pacemaker + corosync on 1 node to check that it all comes up fine.
> > I added corosync_votequorum as you suggested. Additionally I also added
> > these 2 lines:
> >
> > expected_votes: 2
> > two_node: 1
>
> There's actually nothing wrong with configuring a single-node cluster.
> You can list just one node in corosync.conf and leave off the above.
>
> > However still pacemaker is not able to run.
>
> There must be other issues involved. Even if pacemaker doesn't have
> quorum, it will still run, it just won't start resources.
>
> > [root at airv_cu root]# pcs cluster start
> > Starting Cluster...
> > Starting Pacemaker Cluster Manager[FAILED]
> >
> > Error: unable to start pacemaker
> >
> > Corosync.log:
> > *May 05 16:15:20 [16294] airv_cu pacemakerd:     info:
> > pcmk_quorum_notification: Membership 240: quorum still lost (1)*
> > May 05 16:15:20 [16259] airv_cu corosync debug   [QB    ] Free'ing
> > ringbuffer: /dev/shm/qb-cmap-request-16259-16294-21-header
> > May 05 16:15:20 [16294] airv_cu pacemakerd:   notice:
> > crm_update_peer_state_iter:       pcmk_quorum_notification: Node
> > airv_cu[181344357] - state is now member (was (null))
> > May 05 16:15:20 [16294] airv_cu pacemakerd:     info:
> > pcmk_cpg_membership:      Node 181344357 joined group pacemakerd
> > (counter=0.0)
> > May 05 16:15:20 [16294] airv_cu pacemakerd:     info:
> > pcmk_cpg_membership:      Node 181344357 still member of group
> > pacemakerd (peer=airv_cu, counter=0.0)
> > May 05 16:15:20 [16294] airv_cu pacemakerd:  warning: pcmk_child_exit:
> >  The cib process (16353) can no longer be respawned, shutting the
> > cluster down.
> > May 05 16:15:20 [16294] airv_cu pacemakerd:   notice:
> > pcmk_shutdown_worker:     Shutting down Pacemaker
> >
> > The log and conf file is attached.
> >
> > -Regards
> > Nikhil
> >
> > On Thu, May 5, 2016 at 8:04 PM, Ken Gaillot <kgaillot at redhat.com
> > <mailto:kgaillot at redhat.com>> wrote:
> >
> >     On 05/05/2016 08:36 AM, Nikhil Utane wrote:
> >     > Hi,
> >     >
> >     > Continuing with my adventure to run Pacemaker & Corosync on our
> >     > big-endian system, I managed to get past the corosync issue for
> now. But
> >     > facing an issue in running Pacemaker.
> >     >
> >     > Seeing following messages in corosync.log.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The cib process (20000)
> can no
> >     > longer be respawned, shutting the cluster down.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The stonith-ng process
> (20001)
> >     > can no longer be respawned, shutting the cluster down.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The lrmd process (20002)
> can no
> >     > longer be respawned, shutting the cluster down.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The attrd process (20003)
> can
> >     > no longer be respawned, shutting the cluster down.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The pengine process
> (20004) can
> >     > no longer be respawned, shutting the cluster down.
> >     >  pacemakerd:  warning: pcmk_child_exit:  The crmd process (20005)
> can no
> >     > longer be respawned, shutting the cluster down.
> >     >
> >     > I see following error before these messages. Not sure if this is
> the cause.
> >     > May 05 11:26:24 [19998] airv_cu pacemakerd:    error:
> >     > cluster_connect_quorum:   Corosync quorum is not configured
> >     >
> >     > I tried removing the quorum block (which is anyways blank) from
> the conf
> >     > file but still had the same error.
> >
> >     Yes, that is the issue. Pacemaker can't do anything if it can't ask
> >     corosync about quorum. I don't know what the issue is at the corosync
> >     level, but your corosync.conf should have:
> >
> >     quorum {
> >         provider: corosync_votequorum
> >     }
> >
> >
> >     > Attaching the log and conf files. Please let me know if there is
> any
> >     > obvious mistake or how to investigate it further.
> >     >
> >     > I am using pcs cluster start command to start the cluster
> >     >
> >     > -Thanks
> >     > Nikhil
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160506/4de331dd/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync_4.log
Type: application/octet-stream
Size: 200696 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160506/4de331dd/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync_4.conf
Type: application/octet-stream
Size: 2901 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160506/4de331dd/attachment-0007.obj>