[Pacemaker] coroync not able to exec services properly
Shravan Mishra
shravan.mishra at gmail.com
Fri Jan 1 15:35:04 EST 2010
It took us good 6 hours to figure the problem out.
I'm sharing this in case anybody might face it again.
Since pacemaker pieces need syslog-ng to log information and they do
it through unix domain sockets it means syslog-ng has to be up and
running always.
On our system we have multiple syslog-ng's on different network
interfaces and it turned out that our loopback interfaces's syslog-ng
to which pacemaker uses to log information was stopped at one point
and at another there was a race condition in our scripts where
corosync would start first before syslog-ng.
Anyways the moral of the story is that when you cannot start corosync
properly where the processes are spawned and can't exec properly this
might be one thing to check.
Thanks
Shravan
On Mon, Dec 28, 2009 at 6:58 AM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Thu, Dec 24, 2009 at 02:35:01PM -0500, Shravan Mishra wrote:
>> Hi Guys,
>>
>> I had a perfectly running system for about 3 weeks now but now on reboot I
>> see problems.
>>
>> Looks like the processes are being spawned and respawned but a proper exec
>> is not happening.
>
> According to the logs, attrd can't start (exit code 100) for some
> reason (perhaps there are more logs elsewhere where it says
> what's wrong) and pengine segfaults. For the latter please
> enable coredumps (ulimit -c unlimited) and file a bugzilla.
>
>> Am I missing some permissions on directories.
>>
>>
>> I have a script which does the following for directories:
>
> Why do you need this script? It should be done by the package
> installation scripts.
>
>> =============
>> getent group haclient > /dev/null || groupadd -r haclient
>> getent passwd hacluster > /dev/null || useradd -r -g haclient -d
>> /var/lib/heartbeat/cores/hacluster -s /sbin/nologin -c "cluster user"
>> hacluster
>>
>> if [ ! -d "/var/lib/pengine" ];then
>> mkdir /var/lib/pengine
>> fi
>> chown -R hacluster:haclient /var/lib/pengine
>>
>> if [ ! -d "/var/lib/heartbeat" ];then
>> mkdir /var/lib/heartbeat
>> fi
>>
>> if [ ! -d "/var/lib/heartbeat/crm" ];then
>> mkdir /var/lib/heartbeat/crm
>> fi
>> chown -R hacluster:haclient /var/lib/heartbeat/crm/
>> chmod 750 /var/lib/heartbeat/crm/
>>
>> if [ ! -d "/var/lib/heartbeat/ccm" ];then
>> mkdir /var/lib/heartbeat/ccm
>> fi
>> chown -R hacluster:haclient /var/lib/heartbeat/ccm/
>> chmod 750 /var/lib/heartbeat/ccm/
>>
>> if [ ! -d "/var/run/heartbeat/" ];then
>> mkdir /var/run/heartbeat/
>> fi
>>
>> if [ ! -d "/var/run/heartbeat/ccm" ];then
>> mkdir /var/run/heartbeat/ccm/
>> fi
>> chown -R hacluster:haclient /var/run/heartbeat/ccm/
>> chmod 750 /var/run/heartbeat/ccm/
>
> You don't need ccm for corosync/openais clusters.
>
>> if [ ! -d "/var/run/heartbeat/crm" ];then
>> mkdir /var/run/heartbeat/crm/
>> fi
>> chown -R hacluster:haclient /var/run/heartbeat/crm/
>> chmod 750 /var/run/heartbeat/crm/
>>
>> if [ ! -d "/var/run/crm" ];then
>> mkdir /var/run/crm
>> fi
>>
>> if [ ! -d "/var/lib/corosync" ];then
>> mkdir /var/lib/corosync
>> fi
>> =============
>>
>>
>> I have a very simple active-passive configuration with just 2 nodes.
>>
>> On starting Corosync , on doing
>>
>>
>> [root at node2 ~]# ps -ef | grep coro
>> root 8242 1 0 11:33 ? 00:00:00 /usr/sbin/corosync
>> root 8248 8242 0 11:33 ? 00:00:00 /usr/sbin/corosync
>> root 8249 8242 0 11:33 ? 00:00:00 /usr/sbin/corosync
>> root 8250 8242 0 11:33 ? 00:00:00 /usr/sbin/corosync
>> root 8252 8242 0 11:33 ? 00:00:00 /usr/sbin/corosync
>> root 8393 8242 0 11:35 ? 00:00:00 /usr/sbin/corosync
>> [root at node2 ~]# ps -ef | grep heart
>> 82 7924 1 0 11:28 ? 00:00:00 /usr/lib64/heartbeat/pengine
>>
>> I'm attaching the log file.
>>
>> My config is:
>>
>>
>> # Please read the corosync.conf.5 manual page
>> compatibility: whitetank
>>
>> totem {
>> version: 2
>> token: 3000
>> token_retransmits_before_loss_const: 10
>> join: 60
>> consensus: 1500
>> vsftype: none
>> max_messages: 20
>> clear_node_high_bit: yes
>> secauth: on
>> threads: 0
>> rrp_mode: passive
>> interface {
>> ringnumber: 0
>> bindnetaddr: 192.168.1.0
>> # mcastaddr: 226.94.1.1
>> broadcast: yes
>> mcastport: 5405
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 172.20.20.0
>> # mcastaddr: 226.94.1.1
>> broadcast: yes
>> mcastport: 5405
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_logfile: yes
>> to_syslog: yes
>> logfile: /tmp/corosync.log
>
> Don't log to file. Can't recall exactly but there were some
> permission problems with that, probably because Pacemaker daemons
> don't run as root.
>
> Thanks,
>
> Dejan
>
>> debug: on
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> service {
>> name: pacemaker
>> ver: 0
>> }
>>
>> aisexec {
>> user:root
>> group: root
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>>
>> Please help.
>>
>> Sincerely
>> Shravan
>
>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
More information about the Pacemaker
mailing list