[Pacemaker] Pengine behavior
Виталий Давудов
vitaliy.davudov at vts24.ru
Thu Jul 19 09:08:12 EDT 2012
Hi!
I had moved my cluster from heartbeat to corosync.
Here corosync.conf content:
compatibility: whitetank
totem {
version: 2
token: 500
downcheck: 500
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 10.10.1.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
quorum {
provider: corosync_votequorum
expected_votes: 1
}
Pacemaker configuration is not changed.
After first node crashed in corosync.log I can see that monitoring
stoped at 15:15:24 (i.e. node crashed at 15:15:24):
Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
monitor
Jul 19 15:53:22 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:fs:16: monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: RA output:
(fs:monitor:stdout) OK
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP2:12:
monitor
Jul 19 15:53:23 freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP1:10:
monitor
Jul 19 *15:53:24* freeswitch1 lrmd: [24569]: debug: rsc:FailoverIP3:14:
monitor
Jul 19 15:55:00 corosync [MAIN ] Corosync Cluster Engine ('1.2.7'):
started and ready to provide service.
Jul 19 15:55:00 corosync [MAIN ] Corosync built-in features: nss rdma
On second node in corosync.log:
Jul 19 *15:53:27* corosync [TOTEM ] The token was lost in the
OPERATIONAL state.
Jul 19 15:53:27 corosync [TOTEM ] A processor failed, forming new
configuration.
Jul 19 15:53:27 corosync [TOTEM ] Receive multicast socket recv buffer
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] Transmit multicast socket send buffer
size (262142 bytes).
Jul 19 15:53:27 corosync [TOTEM ] entering GATHER state from 2.
Jul 19 15:53:28 corosync [TOTEM ] entering GATHER state from 0.
I.e. second node detected crash after 3 secs.
Is there any way to reduce this amount of time?
Thanks in advance for all yours hints.
12.07.2012 10:47, Виталий Давудов пишет:
> David, thanks for your answer!
>
> I'll try to migrate to corosync.
>
> 11.07.2012 22:40, David Vossel пишет:
>>
>> ----- Original Message -----
>>> From: "Виталий Давудов" <vitaliy.davudov at vts24.ru>
>>> To: pacemaker at oss.clusterlabs.org
>>> Sent: Wednesday, July 11, 2012 7:34:08 AM
>>> Subject: [Pacemaker] Pengine behavior
>>>
>>>
>>> Hi, list!
>>>
>>> I have configured cluster for voip application.
>>> Here my configuration:
>>>
>>> # crm configure show
>>> node $id="552f91eb-e70a-40a5-ac43-cb16e063fdba" freeswitch1 \
>>> attributes standby="off"
>> Ah... right here is your problem. You are using freeswitch instead of
>> Asterisk :P
>>
>>> node $id="c86ab64d-26c4-4595-aa32-bf9d18f714e7" freeswitch2 \
>>> attributes standby="off"
>>> primitive FailoverIP1 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP1" ip="91.211.219.142" cidr_netmask="30"
>>> nic="eth1.50" \
>>> op monitor interval="1s"
>>> primitive FailoverIP2 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP2" ip="172.30.0.1" cidr_netmask="16"
>>> nic="eth1.554" \
>>> op monitor interval="1s"
>>> primitive FailoverIP3 ocf:heartbeat:IPaddr2 \
>>> params iflabel="FoIP3" ip="10.18.1.1" cidr_netmask="24"
>>> nic="eth1.552" \
>>> op monitor interval="1s"
>>> primitive fs lsb:FSSofia \
>>> op monitor interval="1s" enabled="false" timeout="2s"
>>> on-fail="standby" \
>>> meta target-role="Started"
>>> group HAServices FailoverIP1 FailoverIP2 FailoverIP3 \
>>> meta target-role="Started"
>>> order FS-after-IP inf: HAServices fs
>>> property $id="cib-bootstrap-options" \
>>> dc-version="1.0.12-unknown" \
>>> cluster-infrastructure="Heartbeat" \
>>> stonith-enabled="false" \
>>> expected-quorum-votes="1" \
>>> no-quorum-policy="ignore" \
>>> last-lrm-refresh="1299964019"
>>> rsc_defaults $id="rsc-options" \
>>> resource-stickiness="100"
>>>
>>> When 1-st node was crashed, then 2-nd node become active. During this
>>> process in ha-debug file I found lines:
>>>
>>> ...
>>> Jul 06 17:16:42 freeswitch1 crmd: [3385]: info: start_subsystem:
>>> Starting sub-system "pengine"
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: Invoked:
>>> /usr/lib64/heartbeat/pengine
>>> Jul 06 17:16:42 freeswitch1 pengine: [3675]: info: main: Starting
>>> pengine
>>> Jul 06 17:16:46 freeswitch1 crmd: [3385]: info: do_dc_takeover:
>>> Taking over DC status for this partition
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_readwrite:
>>> We are now in R/W mode
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_master for section 'all'
>>> (origin=local/crmd/11, version=0.391.20): ok (
>>> rc=0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section cib
>>> (origin=local/crmd/12, version=0.391.20): ok (rc
>>> =0)
>>> Jul 06 17:16:46 freeswitch1 cib: [3381]: info: cib_process_request:
>>> Operation complete: op cib_modify for section crm_config
>>> (origin=local/crmd/14, version=0.391.20):
>>> ok (rc=0)
>>> ...
>>>
>>> After "Starting pengine", only thru 4 seconds occured next action.
>>> What happens at this time? Is it possible to reduce this time?
>> I seem to remember seeing something related to this in the code at
>> one point. I believe it is limited only to the use of heartbeat as
>> the messaging layer. After starting the pengine, the crmd sleeps
>> waiting for the pengine to start before contacting it. The sleep is
>> just a guess at how long it will take before the pengine will be up
>> and ready to accept a connection though. That's why it is so long...
>> so the gap will hopefully be large enough that no one will ever run
>> into any problems with it (I am not a big fan of this type of logic
>> at all) I'd recommend moving to corosync and seeing if this delay
>> goes away.
>>
>> -- Vossel
>>
>>> Thanks in advance.
>>> --
>>> Best regards,
>>> Vitaly
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
--
Best regards,
Vitaly
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120719/79a6dbf0/attachment-0003.html>
More information about the Pacemaker
mailing list