[Pacemaker] node status does not change even if pacemakerd dies
Kazunori INOUE
inouekazu at intellilink.co.jp
Wed Dec 19 04:15:17 EST 2012
(12.12.13 08:26), Andrew Beekhof wrote:
> On Wed, Dec 12, 2012 at 8:02 PM, Kazunori INOUE
> <inouekazu at intellilink.co.jp> wrote:
>>
>> Hi,
>>
>> I recognize that pacemakerd is much less likely to crash.
>> However, a possibility of being killed by OOM_Killer etc. is not 0%.
>
> True. Although we just established in another thread that we don't
> have any leaks :)
>
>> So I think that a user gets confused. since behavior at the time of process
>> death differs even if pacemakerd is running.
>>
>> case A)
>> When pacemakerd and other processes (crmd etc.) are the parent-child
>> relation.
>>
>
> [snip]
>
>>
>> For example, crmd died.
>> However, since it is relaunched, the state of the cluster is not affected.
>
> Right.
>
> [snip]
>
>>
>> case B)
>> When pacemakerd and other processes are NOT the parent-child relation.
>> Although pacemakerd was killed, it assumed the state where it was respawned
>> by Upstart.
>>
>> $ service corosync start ; service pacemaker start
>> $ pkill -9 pacemakerd
>> $ ps -ef|egrep 'corosync|pacemaker|UID'
>> UID PID PPID C STIME TTY TIME CMD
>> root 21091 1 1 14:52 ? 00:00:00 corosync
>> 496 21099 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/cib
>> root 21100 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/stonithd
>> root 21101 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/lrmd
>> 496 21102 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/attrd
>> 496 21103 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/pengine
>> 496 21104 1 0 14:52 ? 00:00:00 /usr/libexec/pacemaker/crmd
>> root 21128 1 1 14:53 ? 00:00:00 /usr/sbin/pacemakerd
>
> Yep, looks right.
>
Hi Andrew,
We discussed this behavior.
Behavior when pacemakerd and other processes are not parent-child
relation (case B) reached the conclusion that there is room for
improvement.
Since not all users are experts, they may kill pacemakerd accidentally.
Such a user will get confused if the behavior after crmd death changes
with the following conditions.
case A: pacemakerd and others (crmd etc.) are the parent-child relation.
case B: pacemakerd and others are not the parent-child relation.
So, we want to *always* obtain the same behavior as the case where
there is parent-child relation.
That is, when crmd etc. die, we want pacemaker to always relaunch
the process always immediately.
Regards,
Kazunori INOUE
>> In this case, the node will be set to UNCLEAN if crmd dies.
>> That is, the node will be fenced if there is stonith resource.
>
> Which is exactly what happens if only pacemakerd is killed with your proposal.
> Except now you have time to do a graceful pacemaker restart to
> re-establish the parent-child relationship.
>
> If you want to compare B with something, it needs to be with the old
> "children terminate if pacemakerd dies" strategy.
> Which is:
>
>> $ service corosync start ; service pacemaker start
>> $ pkill -9 pacemakerd
>> ... the node will be set to UNCLEAN
>
> Old way: always downtime because children terminate which triggers fencing
> Our way: no downtime unless there is an additional failure (to the cib or crmd)
>
> Given that we're trying for HA, the second seems preferable.
>
>>
>> $ pkill -9 crmd
>> $ crm_mon -1
>> Last updated: Wed Dec 12 14:53:48 2012
>> Last change: Wed Dec 12 14:53:10 2012 via crmd on dev2
>>
>> Stack: corosync
>> Current DC: dev2 (2472913088) - partition with quorum
>> Version: 1.1.8-3035414
>>
>> 2 Nodes configured, unknown expected votes
>> 0 Resources configured.
>>
>> Node dev1 (2506467520): UNCLEAN (online)
>> Online: [ dev2 ]
>>
>>
>> How about making behavior selectable with an option?
>
> MORE_DOWNTIME_PLEASE=(true|false) ?
>
>>
>> When pacemakerd dies,
>> mode A) which behaves in an existing way. (default)
>> mode B) which makes the node UNCLEAN.
>>
>> Best Regards,
>> Kazunori INOUE
>>
>>
>>
>>> Making stop work when there is no pacemakerd process is a different
>>> matter. We can make that work.
>>>
>>>>
>>>> Though the best solution is to relaunch pacemakerd, if it is difficult,
>>>> I think that a shortcut method is to make a node unclean.
>>>>
>>>>
>>>> And now, I tried Upstart a little bit.
>>>>
>>>> 1) started the corosync and pacemaker.
>>>>
>>>> $ cat /etc/init/pacemaker.conf
>>>> respawn
>>>> script
>>>> [ -f /etc/sysconfig/pacemaker ] && {
>>>> . /etc/sysconfig/pacemaker
>>>> }
>>>> exec /usr/sbin/pacemakerd
>>>> end script
>>>>
>>>> $ service co start
>>>> Starting Corosync Cluster Engine (corosync): [ OK ]
>>>> $ initctl start pacemaker
>>>> pacemaker start/running, process 4702
>>>>
>>>>
>>>> $ ps -ef|egrep 'corosync|pacemaker'
>>>> root 4695 1 0 17:21 ? 00:00:00 corosync
>>>> root 4702 1 0 17:21 ? 00:00:00 /usr/sbin/pacemakerd
>>>> 496 4703 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
>>>> root 4704 4702 0 17:21 ? 00:00:00
>>>> /usr/libexec/pacemaker/stonithd
>>>> root 4705 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
>>>> 496 4706 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
>>>> 496 4707 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
>>>> 496 4708 4702 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/crmd
>>>>
>>>> 2) killed pacemakerd.
>>>>
>>>> $ pkill -9 pacemakerd
>>>>
>>>> $ ps -ef|egrep 'corosync|pacemaker'
>>>> root 4695 1 0 17:21 ? 00:00:01 corosync
>>>> 496 4703 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
>>>> root 4704 1 0 17:21 ? 00:00:00
>>>> /usr/libexec/pacemaker/stonithd
>>>> root 4705 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
>>>> 496 4706 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
>>>> 496 4707 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
>>>> 496 4708 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/crmd
>>>> root 4760 1 1 17:24 ? 00:00:00 /usr/sbin/pacemakerd
>>>>
>>>> 3) then I stopped pacemakerd. however, some processes did not stop.
>>>>
>>>> $ initctl stop pacemaker
>>>> pacemaker stop/waiting
>>>>
>>>>
>>>> $ ps -ef|egrep 'corosync|pacemaker'
>>>> root 4695 1 0 17:21 ? 00:00:01 corosync
>>>> 496 4703 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/cib
>>>> root 4704 1 0 17:21 ? 00:00:00
>>>> /usr/libexec/pacemaker/stonithd
>>>> root 4705 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/lrmd
>>>> 496 4706 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/attrd
>>>> 496 4707 1 0 17:21 ? 00:00:00 /usr/libexec/pacemaker/pengine
>>>>
>>>> Best Regards,
>>>> Kazunori INOUE
>>>>
>>>>
>>>>>>> This isnt the case when the plugin is in use though, but then I'd
>>>>>>> also
>>>>>>> have expected most of the processes to die also.
>>>>>>>
>>>>>> Since node status will also change if such a result is brought,
>>>>>> we desire to become so.
>>>>>>
>>>>>>>>
>>>>>>>> ----
>>>>>>>> $ cat /etc/redhat-release
>>>>>>>> Red Hat Enterprise Linux Server release 6.3 (Santiago)
>>>>>>>>
>>>>>>>> $ ./configure --sysconfdir=/etc --localstatedir=/var
>>>>>>>> --without-cman
>>>>>>>> --without-heartbeat
>>>>>>>> -snip-
>>>>>>>> pacemaker configuration:
>>>>>>>> Version = 1.1.8 (Build: 9c13d14)
>>>>>>>> Features = generated-manpages agent-manpages
>>>>>>>> ascii-docs
>>>>>>>> publican-docs ncurses libqb-logging libqb-ipc lha-fencing
>>>>>>>> corosync-native
>>>>>>>> snmp
>>>>>>>>
>>>>>>>>
>>>>>>>> $ cat config.log
>>>>>>>> -snip-
>>>>>>>> 6000 | #define BUILD_VERSION "9c13d14"
>>>>>>>> 6001 | /* end confdefs.h. */
>>>>>>>> 6002 | #include <gio/gio.h>
>>>>>>>> 6003 |
>>>>>>>> 6004 | int
>>>>>>>> 6005 | main ()
>>>>>>>> 6006 | {
>>>>>>>> 6007 | if (sizeof (GDBusProxy))
>>>>>>>> 6008 | return 0;
>>>>>>>> 6009 | ;
>>>>>>>> 6010 | return 0;
>>>>>>>> 6011 | }
>>>>>>>> 6012 configure:32411: result: no
>>>>>>>> 6013 configure:32417: WARNING: Unable to support systemd/upstart.
>>>>>>>> You need
>>>>>>>> to use glib >= 2.26
>>>>>>>> -snip-
>>>>>>>> 6286 | #define BUILD_VERSION "9c13d14"
>>>>>>>> 6287 | #define SUPPORT_UPSTART 0
>>>>>>>> 6288 | #define SUPPORT_SYSTEMD 0
>>>>>>>>
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Kazunori INOUE
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> related bugzilla:
>>>>>>>>>> http://bugs.clusterlabs.org/show_bug.cgi?id=5064
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Best Regards,
>>>>>>>>>> Kazunori INOUE
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started:
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list