[Pacemaker] why pacemaker does not control the resources

Thu Nov 14 23:12:39 UTC 2013

On 14 Nov 2013, at 5:06 pm, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 14.11.2013, 02:22, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 14 Nov 2013, at 6:13 am, Andrey Groshev <greenx at yandex.ru> wrote:
>> 
>>>  13.11.2013, 03:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>  On 12 Nov 2013, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>   11.11.2013, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>   On 8 Nov 2013, at 7:49 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>    Hi, PPL!
>>>>>>>    I need help. I do not understand... Why has stopped working.
>>>>>>>    This configuration work on other cluster, but on corosync1.
>>>>>>> 
>>>>>>>    So... cluster postgres with master/slave.
>>>>>>>    Classic config as in wiki.
>>>>>>>    I build cluster, start, he is working.
>>>>>>>    Next I kill postgres on Master with 6 signal, as if "disk space left"
>>>>>>> 
>>>>>>>    # pkill -6 postgres
>>>>>>>    # ps axuww|grep postgres
>>>>>>>    root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep postgres
>>>>>>> 
>>>>>>>    PostgreSQL die, But crm_mon shows that the master is still running.
>>>>>>> 
>>>>>>>    Last updated: Fri Nov  8 00:42:08 2013
>>>>>>>    Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
>>>>>>>    Stack: corosync
>>>>>>>    Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>>>>    Version: 1.1.10-1.el6-368c726
>>>>>>>    3 Nodes configured
>>>>>>>    7 Resources configured
>>>>>>> 
>>>>>>>    Node dev-cluster2-node2 (172793105): online
>>>>>>>           pingCheck       (ocf::pacemaker:ping):  Started
>>>>>>>           pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>>    Node dev-cluster2-node3 (172793106): online
>>>>>>>           pingCheck       (ocf::pacemaker:ping):  Started
>>>>>>>           pgsql   (ocf::heartbeat:pgsql): Started
>>>>>>>    Node dev-cluster2-node4 (172793107): online
>>>>>>>           pgsql   (ocf::heartbeat:pgsql): Master
>>>>>>>           pingCheck       (ocf::pacemaker:ping):  Started
>>>>>>>           VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>>>>>> 
>>>>>>>    Node Attributes:
>>>>>>>    * Node dev-cluster2-node2:
>>>>>>>       + default_ping_set                  : 100
>>>>>>>       + master-pgsql                      : -INFINITY
>>>>>>>       + pgsql-data-status                 : STREAMING|ASYNC
>>>>>>>       + pgsql-status                      : HS:async
>>>>>>>    * Node dev-cluster2-node3:
>>>>>>>       + default_ping_set                  : 100
>>>>>>>       + master-pgsql                      : -INFINITY
>>>>>>>       + pgsql-data-status                 : STREAMING|ASYNC
>>>>>>>       + pgsql-status                      : HS:async
>>>>>>>    * Node dev-cluster2-node4:
>>>>>>>       + default_ping_set                  : 100
>>>>>>>       + master-pgsql                      : 1000
>>>>>>>       + pgsql-data-status                 : LATEST
>>>>>>>       + pgsql-master-baseline             : 0000000002000078
>>>>>>>       + pgsql-status                      : PRI
>>>>>>> 
>>>>>>>    Migration summary:
>>>>>>>    * Node dev-cluster2-node4:
>>>>>>>    * Node dev-cluster2-node2:
>>>>>>>    * Node dev-cluster2-node3:
>>>>>>> 
>>>>>>>    Tickets:
>>>>>>> 
>>>>>>>    CONFIG:
>>>>>>>    node $id="172793105" dev-cluster2-node2. \
>>>>>>>           attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>>>    node $id="172793106" dev-cluster2-node3. \
>>>>>>>           attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>>>    node $id="172793107" dev-cluster2-node4. \
>>>>>>>           attributes pgsql-data-status="LATEST"
>>>>>>>    primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>>>>>           params ip="10.76.157.194" \
>>>>>>>           op start interval="0" timeout="60s" on-fail="stop" \
>>>>>>>           op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>>>>           op stop interval="0" timeout="60s" on-fail="block"
>>>>>>>    primitive pgsql ocf:heartbeat:pgsql \
>>>>>>>           params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.76.157.194" \
>>>>>>>           op start interval="0" timeout="60s" on-fail="restart" \
>>>>>>>           op monitor interval="5s" timeout="61s" on-fail="restart" \
>>>>>>>           op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
>>>>>>>           op promote interval="0" timeout="63s" on-fail="restart" \
>>>>>>>           op demote interval="0" timeout="64s" on-fail="stop" \
>>>>>>>           op stop interval="0" timeout="65s" on-fail="block" \
>>>>>>>           op notify interval="0" timeout="66s"
>>>>>>>    primitive pingCheck ocf:pacemaker:ping \
>>>>>>>           params name="default_ping_set" host_list="10.76.156.1" multiplier="100" \
>>>>>>>           op start interval="0" timeout="60s" on-fail="restart" \
>>>>>>>           op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>>>>           op stop interval="0" timeout="60s" on-fail="ignore"
>>>>>>>    ms msPostgresql pgsql \
>>>>>>>           meta master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Master" clone-max="3"
>>>>>>>    clone clnPingCheck pingCheck \
>>>>>>>           meta clone-max="3"
>>>>>>>    location l0_DontRunPgIfNotPingGW msPostgresql \
>>>>>>>           rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
>>>>>>>    colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>>>>>>    colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>>>>>>    order rsc_order-1 0: clnPingCheck msPostgresql
>>>>>>>    order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
>>>>>>>    order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
>>>>>>>    property $id="cib-bootstrap-options" \
>>>>>>>           dc-version="1.1.10-1.el6-368c726" \
>>>>>>>           cluster-infrastructure="corosync" \
>>>>>>>           stonith-enabled="false" \
>>>>>>>           no-quorum-policy="stop"
>>>>>>>    rsc_defaults $id="rsc-options" \
>>>>>>>           resource-stickiness="INFINITY" \
>>>>>>>           migration-threshold="1"
>>>>>>> 
>>>>>>>    Tell me where to look - why does pacemaker not work?
>>>>>>   You might want to follow some of the steps at:
>>>>>> 
>>>>>>      http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>>>>>> 
>>>>>>   under the heading "Resource-level failures".
>>>>>   Yes. Thank you.
>>>>>   I've seen this article and now I study it in more detail.
>>>>>   A lot of information in the logs, so it is difficult to determine where the error is, and where the consequence of error.
>>>>>   Now I'm trying to figure it out.
>>>>> 
>>>>>   BUT...
>>>>>   While I can say with certainty that the RA with monitor in the MS(pgsql) is called ONLY on the node on which the last was launched PACEMAKER.
>>>>  It looks like you're hitting https://github.com/beekhof/pacemaker/commit/58962338
>>>>  Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use the 1.1.10 packages that come with 6.4?
>>>>  They include the above patch.
>>>  I already use (builded from source two weeks ago)
>> 
>> Upstream 1.1.10 does not include the above patch.
> 
> Strangely, in my source code, these lines exist.
> Maybe I do not collect build.
> I have the source - "master", and build RPM - 1.1.10.
> 
>> 
>>>  * pacemaker 1.1.10
>>>  * resource-agents 3.9.5
>>>  * corosync 2.3.2
>>>  * libqb 0.16
>>>  & CentOS 6.4
>>> 
>>>  The same config work on pacemaker 1.1.9/corosync 1.4.5
>>>  Not ideal, but no such problem.
>>> 
>>>  At first idea - I thought I should move the target-role=Master from MS to primitive pgsql.
>>>  And so even working.
>>>  But after a crash killing the main PostgreSQL process - started the same.
>>>  Today's experiments showed that this behavior starts after I add in MS "notify=true".
>>>  But primitive pgsql not properly work without "notify" messages.
>>>  While I in frustration :(
>>>>  Also, just to be sure. Are you expecting monitor operations to detect when you started a resource manually?
>>>>  If so, you'll need a monitor operation with role=Stopped. We don't do that by default.
>>>  I expect that the resource monitoring on all the time, otherwise how to control them?
>> 
>> When a node joins the cluster, we check to see if it had any resources running.
>> If no-one has the resource running, we pick a node and start it there.
>> 
>> If a malicious admin then starts the resource somewhere else, manually, we would not normally detect this.
>> It is assumed that someone trusted with root privileges would not do this on purpose.
>> 
>> However, if you do not trust your admins, as explained above, you can configure pacemaker to periodically re-check the node to detect and recover from this situation.
> 
> This individual servers allocated for development.

Then you likely have a random version from Git.
Thats probably not a great idea.

> Their only one malicious admin - It's I.
> But, I trust myself yet. :) 
> 
> And even if I run something "hands" , then each time the utility deployment of cluster:
> * Certainly removes all the packages that were installed prior to this (referring to the cluster (libqb, corosync1 / 2 , cman, pacemaker, resource-agents, pcs, crmsh, cluster-glue));
> * Deletes all files that might remain from a previous installation ( and configs , and temp) ;
> * Install my RPM's;
> * Turns off and disables all of the services that may interfere with the operation of the cluster ;
> * Cleared the old configuration, calculates new configs corosync and pacemaker;
> * Writes and runs their cluster alternately every synchronizing each node.
> 
> In this situation it is difficult to survive something forgotten.
> I reinstalled the OS would be if I had access to a level above :)
> 
>> 
>>>  Or I do not quite understand the question.
>>>>>>   'crm_mon -o' might be a good source of information too.
>>>>>   Therefore, I see that my resources allegedly functioning normally.
>>>>> 
>>>>>   # crm_mon -o1
>>>>>   Last updated: Tue Nov 12 09:27:16 2013
>>>>>   Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2
>>>>>   Stack: corosync
>>>>>   Current DC: dev-cluster2-node2 (172793105) - partition with quorum
>>>>>   Version: 1.1.10-1.el6-368c726
>>>>>   3 Nodes configured
>>>>>   337 Resources configured
>>>>> 
>>>>>   Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>>>> 
>>>>>   Clone Set: clonePing [pingCheck]
>>>>>       Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>>>>   Master/Slave Set: msPgsql [pgsql]
>>>>>       Masters: [ dev-cluster2-node2 ]
>>>>>       Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
>>>>>   VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2
>>>>> 
>>>>>   Operations:
>>>>>   * Node dev-cluster2-node2:
>>>>>     pingCheck: migration-threshold=1
>>>>>      + (20) start: rc=0 (ok)
>>>>>      + (23) monitor: interval=10000ms rc=0 (ok)
>>>>>     pgsql: migration-threshold=1
>>>>>      + (41) promote: rc=0 (ok)
>>>>>      + (87) monitor: interval=1000ms rc=8 (master)
>>>>>     VirtualIP: migration-threshold=1
>>>>>      + (49) start: rc=0 (ok)
>>>>>      + (52) monitor: interval=10000ms rc=0 (ok)
>>>>>   * Node dev-cluster2-node3:
>>>>>     pingCheck: migration-threshold=1
>>>>>      + (20) start: rc=0 (ok)
>>>>>      + (23) monitor: interval=10000ms rc=0 (ok)
>>>>>     pgsql: migration-threshold=1
>>>>>      + (26) start: rc=0 (ok)
>>>>>      + (32) monitor: interval=10000ms rc=0 (ok)
>>>>>   * Node dev-cluster2-node4:
>>>>>     pingCheck: migration-threshold=1
>>>>>      + (20) start: rc=0 (ok)
>>>>>      + (23) monitor: interval=10000ms rc=0 (ok)
>>>>>     pgsql: migration-threshold=1
>>>>>      + (26) start: rc=0 (ok)
>>>>>      + (32) monitor: interval=10000ms rc=0 (ok)
>>>>> 
>>>>>   In reality now killed (signal 4|6) the PG master and the penultimate slave PG.
>>>>>   IMHO, even if I have something configured incorrectly, the inability to monitor the resource must cause a fatal error.
>>>>>   Or is there a reason not to do so?
>>>>> 
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>  ,
>>>>  _______________________________________________
>>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>  Project Home: http://www.clusterlabs.org
>>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>  Bugs: http://bugs.clusterlabs.org
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131115/150c5dec/attachment-0004.sig>