[Pacemaker] coroync not able to exec services properly

Fri Jan 1 15:35:04 EST 2010

It took us good 6 hours to figure the problem out.
I'm sharing this in case anybody might face it again.

Since pacemaker pieces need syslog-ng to log information and they do
it through unix domain sockets it means syslog-ng has to be up and
running always.
On our system we have multiple syslog-ng's on different network
interfaces and it turned out that our loopback interfaces's syslog-ng
to which pacemaker uses to log information was stopped at one point
and at another there was a race condition in our scripts where
corosync would start first before syslog-ng.

Anyways the moral of the story is that when you cannot start corosync
properly where the processes are spawned and can't exec properly this
might be one thing to check.

Thanks
Shravan

On Mon, Dec 28, 2009 at 6:58 AM, Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> Hi,
>
> On Thu, Dec 24, 2009 at 02:35:01PM -0500, Shravan Mishra wrote:
>> Hi Guys,
>>
>> I had a perfectly running system for about 3 weeks now but now on reboot I
>> see problems.
>>
>> Looks like the processes are being spawned and respawned but a proper exec
>> is not happening.
>
> According to the logs, attrd can't start (exit code 100) for some
> reason (perhaps there are more logs elsewhere where it says
> what's wrong) and pengine segfaults. For the latter please
> enable coredumps (ulimit -c unlimited) and file a bugzilla.
>
>> Am I missing some permissions on directories.
>>
>>
>> I have a script which does the following for directories:
>
> Why do you need this script? It should be done by the package
> installation scripts.
>
>> =============
>> getent group haclient > /dev/null || groupadd -r haclient
>> getent passwd hacluster > /dev/null || useradd -r -g haclient -d
>> /var/lib/heartbeat/cores/hacluster -s /sbin/nologin -c "cluster user"
>> hacluster
>>
>> if [ ! -d "/var/lib/pengine" ];then
>>  mkdir /var/lib/pengine
>> fi
>> chown -R hacluster:haclient /var/lib/pengine
>>
>> if [ ! -d "/var/lib/heartbeat" ];then
>> mkdir /var/lib/heartbeat
>> fi
>>
>> if [ ! -d "/var/lib/heartbeat/crm" ];then
>>  mkdir /var/lib/heartbeat/crm
>> fi
>> chown -R hacluster:haclient /var/lib/heartbeat/crm/
>> chmod 750 /var/lib/heartbeat/crm/
>>
>> if [ ! -d "/var/lib/heartbeat/ccm" ];then
>>  mkdir /var/lib/heartbeat/ccm
>> fi
>> chown -R hacluster:haclient /var/lib/heartbeat/ccm/
>> chmod 750 /var/lib/heartbeat/ccm/
>>
>> if [ ! -d "/var/run/heartbeat/" ];then
>>  mkdir /var/run/heartbeat/
>>  fi
>>
>> if [ ! -d "/var/run/heartbeat/ccm" ];then
>>  mkdir /var/run/heartbeat/ccm/
>>  fi
>> chown -R hacluster:haclient /var/run/heartbeat/ccm/
>> chmod 750 /var/run/heartbeat/ccm/
>
> You don't need ccm for corosync/openais clusters.
>
>> if [ ! -d "/var/run/heartbeat/crm" ];then
>>  mkdir /var/run/heartbeat/crm/
>>  fi
>> chown -R hacluster:haclient /var/run/heartbeat/crm/
>> chmod 750 /var/run/heartbeat/crm/
>>
>> if [ ! -d "/var/run/crm" ];then
>>  mkdir /var/run/crm
>> fi
>>
>> if [ ! -d "/var/lib/corosync" ];then
>>  mkdir /var/lib/corosync
>> fi
>> =============
>>
>>
>> I have a very simple active-passive configuration with just 2 nodes.
>>
>> On starting Corosync , on doing
>>
>>
>> [root at node2 ~]# ps -ef | grep coro
>> root      8242     1  0 11:33 ?        00:00:00 /usr/sbin/corosync
>> root      8248  8242  0 11:33 ?        00:00:00 /usr/sbin/corosync
>> root      8249  8242  0 11:33 ?        00:00:00 /usr/sbin/corosync
>> root      8250  8242  0 11:33 ?        00:00:00 /usr/sbin/corosync
>> root      8252  8242  0 11:33 ?        00:00:00 /usr/sbin/corosync
>> root      8393  8242  0 11:35 ?        00:00:00 /usr/sbin/corosync
>> [root at node2 ~]# ps -ef | grep heart
>> 82        7924     1  0 11:28 ?        00:00:00 /usr/lib64/heartbeat/pengine
>>
>> I'm attaching the log file.
>>
>> My config is:
>>
>>
>> # Please read the corosync.conf.5 manual page
>> compatibility: whitetank
>>
>> totem {
>>  version: 2
>>   token: 3000
>>   token_retransmits_before_loss_const: 10
>>   join: 60
>>   consensus: 1500
>>   vsftype: none
>>   max_messages: 20
>>   clear_node_high_bit: yes
>>   secauth: on
>>   threads: 0
>>   rrp_mode: passive
>> interface {
>> ringnumber: 0
>> bindnetaddr: 192.168.1.0
>> # mcastaddr: 226.94.1.1
>> broadcast: yes
>> mcastport: 5405
>> }
>> interface {
>> ringnumber: 1
>> bindnetaddr: 172.20.20.0
>> # mcastaddr: 226.94.1.1
>> broadcast: yes
>> mcastport: 5405
>> }
>> }
>>
>> logging {
>> fileline: off
>> to_stderr: yes
>> to_logfile: yes
>> to_syslog: yes
>> logfile: /tmp/corosync.log
>
> Don't log to file. Can't recall exactly but there were some
> permission problems with that, probably because Pacemaker daemons
> don't run as root.
>
> Thanks,
>
> Dejan
>
>> debug: on
>> timestamp: on
>> logger_subsys {
>> subsys: AMF
>> debug: off
>> }
>> }
>>
>> service {
>> name: pacemaker
>> ver: 0
>> }
>>
>> aisexec {
>> user:root
>> group: root
>> }
>>
>> amf {
>> mode: disabled
>> }
>>
>>
>> Please help.
>>
>> Sincerely
>> Shravan
>
>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>