[ClusterLabs] [Question] About movement of pacemaker_remote.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Thu Mar 26 00:08:40 UTC 2015
Hi David,
Does my setting include a mistake?
Can the movement in conjunction with this pacemaker_remote be improved by setting?
Best Regards,
Hideo Yamauchi.
----- Original Message -----
> From: "renayama19661014 at ybb.ne.jp" <renayama19661014 at ybb.ne.jp>
> To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
> Cc:
> Date: 2015/3/16, Mon 09:48
> Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
>
> Hi David,
>
> Thank you for comments.
>
>
>> Pacemaker is attempting to restore connection to the remote node here, are
> you
>> sure the remote is accessible? The "Timed Out" error means that
>> pacemaker was
>> unable to establish the connection during the timeout period.
>
>
> I extended timeout to 60 seconds, but the result was the same.
>
> Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 7:
> probe_complete probe_complete-sl7-01 on sl7-01 (local) - no waiting
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: te_rsc_command: Action 7 confirmed - no
> wait
> Mar 16 09:36:13 sl7-01 crmd[5480]: notice: te_rsc_command: Initiating action 20:
> start snmp2_start_0 on sl7-01 (local)
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: do_lrm_rsc_op: Performing
> key=20:9:0:421ed4a6-85e6-4be8-8c22-d5ff051fc00a op=snmp2_start_0
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async:
> Attempting to connect to remote server at 192.168.40.110:3121
> Mar 16 09:36:13 sl7-01 cib[5475]: info: cib_process_request: Completed
> cib_modify operation for section status: OK (rc=0, origin=sl7-01/crmd/77,
> version=0.6.1)
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_write_with_digest: Wrote
> version 0.6.0 of the CIB to disk (digest: afb3e14165d2f8318bac8f1027a49338)
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Reading
> cluster configuration file /var/lib/pacemaker/cib/cib.D4k0LO
> Mar 16 09:36:13 sl7-01 cib[5648]: info: cib_file_read_and_verify: Verifying
> cluster configuration signature from /var/lib/pacemaker/cib/cib.zZMn4s
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client
> TLS connection established with server snmp2:3121
> Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive
> expected reply, disconnecting.
> Mar 16 09:36:13 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server
> disconnected while waiting for reply with id 63.
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_tls_connection_destroy: TLS
> connection destroyed
> Mar 16 09:36:13 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from
> lrmd service
> Mar 16 09:36:14 sl7-01 crmd[5480]: info: crm_remote_tcp_connect_async:
> Attempting to connect to remote server at 192.168.40.110:3121
> Mar 16 09:36:15 sl7-01 crmd[5480]: info: lrmd_tcp_connect_cb: Remote lrmd client
> TLS connection established with server snmp2:3121
> Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_recv_reply: Unable to receive
> expected reply, disconnecting.
> Mar 16 09:36:15 sl7-01 crmd[5480]: error: lrmd_tls_send_recv: Remote lrmd server
> disconnected while waiting for reply with id 64.
> (snip)
> Mar 16 09:37:11 sl7-01 crmd[5480]: info: lrmd_api_disconnect: Disconnecting from
> lrmd service
> Mar 16 09:37:11 sl7-01 crmd[5480]: info: action_synced_wait: Managed
> remote_meta-data_0 process 5675 exited with rc=0
> Mar 16 09:37:11 sl7-01 crmd[5480]: error: process_lrm_event: Operation
> snmp2_start_0: Timed Out (node=sl7-01, call=8, timeout=60000ms)
> (snip)
>
> In the node that rebooted pacemaker_remoted, pacemaker_remoted seem to do listen
> definitely.
>
> [root at snmp2 ~]# netstat -an | more
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address Foreign Address
> State
> tcp 0 0 0.0.0.0:22 0.0.0.0:*
> LISTEN
> tcp 0 0 192.168.40.110:22 192.168.40.1:40510
> ESTABLISHED
> tcp 0 0 :::3121 :::*
> LISTEN
> tcp 0 0 :::22 :::*
> LISTEN
> udp 0 0 0.0.0.0:631 0.0.0.0:*
>
> (snip)
>
>> Maybe this is expected. It's impossible to tell without looking at the
>
>> configuration
>> in more detail.
>
>
> We use the following crm file.
> Besides, is there the information necessary?
>
> -----------------------------
> property no-quorum-policy="ignore" \
> stonith-enabled="false" \
> startup-fencing="false" \
>
> rsc_defaults resource-stickiness="INFINITY" \
> migration-threshold="1"
>
> primitive snmp1 ocf:pacemaker:remote \
> params \
> server="snmp1" \
> op monitor interval="3s" timeout="15s" \
> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>
> primitive snmp2 ocf:pacemaker:remote \
> params \
> server="snmp2" \
> op monitor interval="3s" timeout="15s" \
> op stop interval="0s" timeout="60s"
> on-fail="stop"
>
> primitive Host-rsc1 ocf:heartbeat:Dummy \ op start
> interval="0s" timeout="60s" on-fail="restart"
> \
> op monitor interval="10s" timeout="60s"
> on-fail="restart" \ op stop interval="0s"
> timeout="60s" on-fail="ignore"
>
> primitive Remote-rsc1 ocf:heartbeat:Dummy \
> op start interval="0s" timeout="60s"
> on-fail="restart" \
> op monitor interval="10s" timeout="60s"
> on-fail="restart" \
> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>
> primitive Remote-rsc2 ocf:heartbeat:Dummy \
> op start interval="0s" timeout="60s"
> on-fail="restart" \
> op monitor interval="10s" timeout="60s"
> on-fail="restart" \
> op stop interval="0s" timeout="60s"
> on-fail="ignore"
>
> location loc1 Remote-rsc1 \
> rule 200: #uname eq snmp1 \ rule 100: #uname eq snmp2
> location loc2 Remote-rsc2 \
> rule 200: #uname eq snmp2 \
> rule 100: #uname eq snmp1
> location loc3 Host-rsc1 \
> rule 200: #uname eq sl7-01
> -----------------------------
>
>
> Best Regards,
> Hideo Yamauchi.
>
>
>
>
> ----- Original Message -----
>> From: David Vossel <dvossel at redhat.com>
>> To: renayama19661014 at ybb.ne.jp; Cluster Labs - All topics related to
> open-source clustering welcomed <users at clusterlabs.org>
>> Cc:
>> Date: 2015/3/14, Sat 08:14
>> Subject: Re: [ClusterLabs] [Question] About movement of pacemaker_remote.
>>
>>
>>
>> ----- Original Message -----
>>> Hi All,
>>>
>>> We confirm a function of pacemaker_remote.(stonith is invalid.)
>>>
>>> There are two questions.
>>>
>>> * Questsion 1 : The pacemaker_remote does not restore from trouble. Is
> this
>>> right movement?
>>>
>>> - Step1 - Start a cluster.
>>> -----------------------
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:25:05 2015
>>> Last change: Thu Mar 12 14:24:31 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>> snmp2 (ocf::pacemaker:remote): Started sl7-01
>>>
>>> -----------------------
>>>
>>> - Step2 - Let pacemaker_remote break down.
>>>
>>> -----------------------
>>> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>> [1] 24202
>>> [root at snmp2 ~]# kill -TERM 24202
>>>
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:25:55 2015
>>> Last change: Thu Mar 12 14:24:31 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 ]
>>> RemoteOFFLINE: [ snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>>
>>> Migration summary:
>>> * Node sl7-01:
>>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar
> 12
>>> 14:25:40 2015'
>>> * Node snmp1:
>>>
>>> Failed actions:
>>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
>> status=Error,
>>> exit-reason='none', last-rc-change='Thu Mar 12
> 14:25:40
>> 2015',
>>> queued=0ms, exec=0ms
>>> -----------------------
>>>
>>> - Step3 - Reboot pacemaker_remote. And remote clear it, but a node is
>>> offline.
>>>
>>> -----------------------
>>> [root at snmp2 ~]# /usr/sbin/pacemaker_remoted &
>>> [2] 24248
>>>
>>> [root at sl7-01 ~]# crm_resource -C -r snmp2
>>> Cleaning up snmp2 on sl7-01
>>> Cleaning up snmp2 on snmp1
>>> Waiting for 1 replies from the CRMd. OK
>>>
>>>
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:26:46 2015
>>> Last change: Thu Mar 12 14:26:26 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 ]
>>> RemoteOFFLINE: [ snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>>
>>> Migration summary:
>>> * Node sl7-01:
>>> snmp2: migration-threshold=1 fail-count=1000000
> last-failure='Thu
>> Mar 12
>>> 14:26:44 2015'
>>> * Node snmp1:
>>>
>>> Failed actions:
>>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
>> status=Timed Out,
>>> exit-reason='none', last-rc-change='Thu Mar 12
> 14:26:26
>> 2015',
>>> queued=0ms, exec=0ms
>>> snmp2_start_0 on sl7-01 'unknown error' (1): call=8,
>> status=Timed Out,
>>> exit-reason='none', last-rc-change='Thu Mar 12
> 14:26:26
>> 2015',
>>> queued=0ms, exec=0ms
>>> -----------------------
>>
>> Pacemaker is attempting to restore connection to the remote node here, are
> you
>> sure the remote is accessible? The "Timed Out" error means that
>> pacemaker was
>> unable to establish the connection during the timeout period.
>>
>>>
>>> * Questsion 2 : When pacemaker_remote broke down, is the method that
>> stonith
>>> moves a resource by constitution of the invalidity right in the next
>>> procedure? Is there a procedure to move without deleting the node?
>>>
>>> - Step1 - Start a cluster.
>>> -----------------------
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:30:27 2015
>>> Last change: Thu Mar 12 14:29:14 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp2
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>> snmp2 (ocf::pacemaker:remote): Started sl7-01
>>>
>>> -----------------------
>>>
>>> - Step2 - Let pacemaker_remote break down.
>>> -----------------------
>>> [root at snmp2 ~]# kill -TERM 24248
>>>
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:31:59 2015
>>> Last change: Thu Mar 12 14:29:14 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 ]
>>> RemoteOFFLINE: [ snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>> snmp2 (ocf::pacemaker:remote): FAILED sl7-01
>>>
>>> Migration summary:
>>> * Node sl7-01:
>>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar
> 12
>>> 14:31:42 2015'
>>> * Node snmp1:
>>>
>>> Failed actions:
>>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
>> status=Error,
>>> exit-reason='none', last-rc-change='Thu Mar 12
> 14:31:42
>> 2015',
>>> queued=0ms, exec=0ms
>>> -----------------------
>>>
>>> - Step3 - We delete the inoperative node. Then a resource moves.
>>
>> Maybe this is expected. It's impossible to tell without looking at the
>> configuration
>> in more detail.
>>
>> -- David
>>
>>>
>>> -----------------------
>>> [root at sl7-01 ~]# crm
>>> crm(live)# node
>>> crm(live)node# delete snmp2
>>> INFO: node snmp2 deleted
>>>
>>> [root at sl7-01 ~]# crm_mon -1 -Af
>>> Last updated: Thu Mar 12 14:35:00 2015
>>> Last change: Thu Mar 12 14:34:20 2015
>>> Stack: corosync
>>> Current DC: sl7-01 (2130706433) - partition WITHOUT quorum
>>> Version: 1.1.12-ce09802
>>> 3 Nodes configured
>>> 5 Resources configured
>>>
>>>
>>> Online: [ sl7-01 ]
>>> RemoteOnline: [ snmp1 ]
>>> RemoteOFFLINE: [ snmp2 ]
>>>
>>> Host-rsc1 (ocf::heartbeat:Dummy): Started sl7-01
>>> Remote-rsc1 (ocf::heartbeat:Dummy): Started snmp1
>>> Remote-rsc2 (ocf::heartbeat:Dummy): Started snmp1
>>> snmp1 (ocf::pacemaker:remote): Started sl7-01
>>>
>>> Migration summary:
>>> * Node sl7-01:
>>> snmp2: migration-threshold=1 fail-count=1 last-failure='Thu Mar
> 12
>>> 14:51:44 2015'
>>> * Node snmp1:
>>>
>>> Failed actions:
>>> snmp2_monitor_3000 on sl7-01 'unknown error' (1): call=6,
>> status=Error,
>>> exit-reason='none', last-rc-change='Thu Mar 12
> 14:51:44
>> 2015',
>>> queued=0ms, exec=0ms
>>> -----------------------
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list