[ClusterLabs] [Question] About movement of pacemaker_remote.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Mon Mar 16 00:48:14 UTC 2015
Hi David,
Thank you for comments.
> Pacemaker is attempting to restore connection to the remote node here, are you
> sure the remote is accessible? The "Timed Out" error means that
> pacemaker was
> unable to establish the connection during the timeout period.
I extended timeout to 60 seconds, but the result was the same.
Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 7: probe_complete probe_complete-sl7-01 on sl7-01 (local) - no waiting
Mar 16 09:36:13 sl7-01 crmd[5480]: info: te_rsc_command: Action 7 confirmed - no wait
Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 20: start snmp2_start_0 on sl7-01 (local)
Mar 16 09:36:13 sl7-01 crmd[5480]: info: do_lrm_rsc_op: Performing key=20:9:0:421ed4a6-85e6-4be8-8c22-d5ff051fc00a op=snmp2_start_0
Mar 16 09:36:13 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
Mar 16 09:36:13 sl7-01 cib[5475]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=sl7-01/crmd/77, version=0.6.1)
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_write_with_digest: Wrote version 0.6.0 of the CIB to disk (digest: afb3e14165d2f8318bac8f1027a49338)
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Reading cluster configuration file /var/lib/pacemaker/cib/cib.D4k0LO
Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Verifying cluster configuration signature from /var/lib/pacemaker/cib/cib.zZMn4s
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 63.
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tls_connection_destroy: TLS connection destroyed
Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from lrmd service
Mar 16 09:36:14 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async: Attempting to connect to remote server at 192.168.40.110:3121
Mar 16 09:36:15 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client TLS connection established with server snmp2:3121
Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive expected reply, disconnecting.
Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server disconnected while waiting for reply with id 64.
(snip)
Mar 16 09:37:11 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from lrmd service
Mar 16 09:37:11 sl7-01 crmd[5480]: info: action_synced_wait: Managed remote_meta-data_0 process 5675 exited with rc=0
Mar 16 09:37:11 sl7-01 crmd[5480]: error: process_lrm_event: Operation snmp2_start_0: Timed Out (node=sl7-01, call=8, timeout=60000ms)
(snip)
In the node that rebooted pacemaker_remoted, pacemaker_remoted seem to do listen definitely.
[root at snmp2 ~]# netstat -an | more
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp 0 0 192.168.40.110:22 192.168.40.1:40510 ESTABLISHED
tcp 0 0 :::3121 :::* LISTEN
tcp 0 0 :::22 :::* LISTEN
udp 0 0 0.0.0.0:631 0.0.0.0:*
(snip)
> Maybe this is expected. It's impossible to tell without looking at the
> configuration
> in more detail.
We use the following crm file.
Besides, is there the information necessary?
-----------------------------
property no-quorum-policy="ignore" \
stonith-enabled="false" \
startup-fencing="false" \
rsc_defaults resource-stickiness="INFINITY" \
migration-threshold="1"
primitive snmp1 ocf:pacemaker:remote \
params \
server="snmp1" \
op monitor interval="3s" timeout="15s" \
op stop interval="0s" timeout="60s" on-fail="ignore"
primitive snmp2 ocf:pacemaker:remote \
params \
server="snmp2" \
op monitor interval="3s" timeout="15s" \
op stop interval="0s" timeout="60s" on-fail="stop"
primitive Host-rsc1 ocf:heartbeat:Dummy \ op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop interval="0s" timeout="60s" on-fail="ignore"
primitive Remote-rsc1 ocf:heartbeat:Dummy \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="60s" on-fail="restart" \
op stop interval="0s" timeout="60s" on-fail="ignore"
primitive Remote-rsc2 ocf:heartbeat:Dummy \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="60s" on-fail="restart" \
op stop interval="0s" timeout="60s" on-fail="ignore"
location loc1 Remote-rsc1 \
rule 200: #uname eq snmp1 \ rule 100: #uname eq snmp2
location loc2 Remote-rsc2 \
rule 200: #uname eq snmp2 \
rule 100: #uname eq snmp1
location loc3 Host-rsc1 \
rule 200: #uname eq sl7-01
-----------------------------
Best Regards,
Hideo Yamauchi.
----- Original Message -----
> From: David Vossel <dvossel at redhat.com>
> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2015/3/14, Sat 08:14
> Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
>
>
>
> ----- Original Message -----
>> Hi All,
>>
>> We confirm a function of pacemaker_remote.(stonith is invalid.)
>>
>> There are two questions.
>>
>> * Questsion 1 : The pacemaker_remote does not restore from trouble. Is this
>> right movement?
>>
>> - Step1 - Start a cluster.
>> -----------------------
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:25:05 2015
>> Last change: Thu Mar 12 14:24:31 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): Started sl7-01
>>
>> -----------------------
>>
>> - Step2 - Let pacemaker_remote break down.
>>
>> -----------------------
>> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>> [1] 24202
>> [root at snmp2 ~]# kill -TERM 24202
>>
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:25:55 2015
>> Last change: Thu Mar 12 14:24:31 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>> 14:25:40 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
> status=Error,
>> exit-reason='none', last-rc-change='Thu Mar 12 14:25:40
> 2015',
>> queued=0ms, exec=0ms
>> -----------------------
>>
>> - Step3 - Reboot pacemaker_remote. And remote clear it, but a node is
>> offline.
>>
>> -----------------------
>> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>> [2] 24248
>>
>> [root at sl7-01 ~]# crm_resource -C -r snmp2
>> Cleaning up snmp2 on sl7-01
>> Cleaning up snmp2 on snmp1
>> Waiting for 1 replies from the CRMd. OK
>>
>>
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:26:46 2015
>> Last change: Thu Mar 12 14:26:26 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1000000 last-failure='Thu
> Mar 12
>> 14:26:44 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
> status=Timed Out,
>> exit-reason='none', last-rc-change='Thu Mar 12 14:26:26
> 2015',
>> queued=0ms, exec=0ms
>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
> status=Timed Out,
>> exit-reason='none', last-rc-change='Thu Mar 12 14:26:26
> 2015',
>> queued=0ms, exec=0ms
>> -----------------------
>
> Pacemaker is attempting to restore connection to the remote node here, are you
> sure the remote is accessible? The "Timed Out" error means that
> pacemaker was
> unable to establish the connection during the timeout period.
>
>>
>> * Questsion 2 : When pacemaker_remote broke down, is the method that
> stonith
>> moves a resource by constitution of the invalidity right in the next
>> procedure? Is there a procedure to move without deleting the node?
>>
>> - Step1 - Start a cluster.
>> -----------------------
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:30:27 2015
>> Last change: Thu Mar 12 14:29:14 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): Started sl7-01
>>
>> -----------------------
>>
>> - Step2 - Let pacemaker_remote break down.
>> -----------------------
>> [root at snmp2 ~]# kill -TERM 24248
>>
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:31:59 2015
>> Last change: Thu Mar 12 14:29:14 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>> 14:31:42 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
> status=Error,
>> exit-reason='none', last-rc-change='Thu Mar 12 14:31:42
> 2015',
>> queued=0ms, exec=0ms
>> -----------------------
>>
>> - Step3 - We delete the inoperative node. Then a resource moves.
>
> Maybe this is expected. It's impossible to tell without looking at the
> configuration
> in more detail.
>
> -- David
>
>>
>> -----------------------
>> [root at sl7-01 ~]# crm
>> crm(live)# node
>> crm(live)node# delete snmp2
>> INFO: node snmp2 deleted
>>
>> [root at sl7-01 ~]# crm_mon -1 -Af
>> Last updated: Thu Mar 12 14:35:00 2015
>> Last change: Thu Mar 12 14:34:20 2015
>> Stack: corosync
>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>> Version: 1.1.12-ce09802
>> 3 Nodes configured
>> 5 Resources configured
>>
>>
>> Online: [ sl7-01 ]
>> RemoteOnline: [ snmp1 ]
>> RemoteOFFLINE: [ snmp2 ]
>>
>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp1
>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>
>> Migration summary:
>> * Node sl7-01:
>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar 12
>> 14:51:44 2015'
>> * Node snmp1:
>>
>> Failed actions:
>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
> status=Error,
>> exit-reason='none', last-rc-change='Thu Mar 12 14:51:44
> 2015',
>> queued=0ms, exec=0ms
>> -----------------------
>>
>> Best Regards,
>> Hideo Yamauchi.
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
More information about the Users
mailing list