[Pacemaker] Corosync over DHCP IP

Mon Feb 11 05:30:10 EST 2013

On Mon, Feb 11, 2013 at 9:24 PM, Viacheslav Biriukov
<v.v.biriukov at gmail.com> wrote:
> It is VM in the OpenStack. So we can't use static IP.
> Right now investigating why interface become down.

Even if you solve that, dynamic IP addresses are fundamentally
incompatible with cluster software.
You're effectively trying to create a cluster out of nodes which
change their name every time they boot.

>
> Thank you!
>
>
> 2013/2/11 Viacheslav Biriukov <v.v.biriukov at gmail.com>
>>
>>
>>
>>
>> 2013/2/11 Dan Frincu <df.cluster at gmail.com>
>>>
>>> Hi,
>>>
>>> On Sun, Feb 10, 2013 at 2:24 PM, Viacheslav Biriukov
>>> <v.v.biriukov at gmail.com> wrote:
>>> > Hi guys,
>>> >
>>> > Got a tricky issue with Corosync and Pacemaker over DHCP IP address
>>> > using
>>> > unicast. Corosync craches periodically.
>>> >
>>> > Packages are from centos 6 repos:
>>> > corosync-1.4.1-7.el6_3.1.x86_64
>>> > corosynclib-1.4.1-7.el6_3.1.x86_64
>>> > pacemaker-cluster-libs-1.1.7-6.el6.x86_64
>>> > pacemaker-libs-1.1.7-6.el6.x86_64
>>> > pacemaker-cli-1.1.7-6.el6.x86_64
>>> > pacemaker-1.1.7-6.el6.x86_64
>>> >
>>> >
>>> > Logs
>>> >
>>> > Feb 09 23:24:33 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 00:24:39 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 01:24:44 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 02:24:48 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 03:24:51 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 04:24:52 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 05:24:54 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 06:25:00 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 07:25:06 host1 lrmd: [5248]: info: rsc:P_SESSION_IP:25: monitor
>>> > Feb 10 07:56:22 corosync [TOTEM ] A processor failed, forming new
>>> > configuration.
>>> > Feb 10 07:56:22 corosync [TOTEM ] The network interface is down.
>>>
>>> This ^^^ is your problem. Corosync doesn't like it, see
>>>
>>> https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface
>>>
>>> Normally DHCP shouldn't take the interface down. Also, since changing
>>> the network configuration in corosync means restarting it, why not go
>>> with static IP's?
>>>
>>> HTH,
>>> Dan
>>>
>>> > Feb 10 07:56:24 corosync [TOTEM ] The network interface [172.17.0.104]
>>> > is
>>> > now up.
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error:
>>> > cfg_connection_destroy:
>>> > Connection destroyed
>>> > Feb 10 07:56:25 [5251] host1       crmd:    error: ais_dispatch:
>>> > Receiving message body failed: (2) Library error: Resource temporarily
>>> > unavailable (11)
>>> > Feb 10 07:56:25 [5246] host1        cib:    error: ais_dispatch:
>>> > Receiving message body failed: (2) Library error: Resource temporarily
>>> > unavailable (11)
>>> > Feb 10 07:56:25 [5249] host1      attrd:    error: ais_dispatch:
>>> > Receiving message body failed: (2) Library error: Resource temporarily
>>> > unavailable (11)
>>> > Feb 10 07:56:25 [5251] host1       crmd:    error: ais_dispatch:
>>> > AIS
>>> > connection failed
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error:
>>> > cpg_connection_destroy:
>>> > Connection destroyed
>>> > Feb 10 07:56:25 [5246] host1        cib:    error: ais_dispatch:
>>> > AIS
>>> > connection failed
>>> > Feb 10 07:56:25 [5251] host1       crmd:     info: crmd_ais_destroy:
>>> > connection closed
>>> > Feb 10 07:56:25 [5249] host1      attrd:    error: ais_dispatch:
>>> > AIS
>>> > connection failed
>>> > Feb 10 07:56:25 [5247] host1 stonith-ng:    error: ais_dispatch:
>>> > Receiving message body failed: (2) Library error: Resource temporarily
>>> > unavailable (11)
>>> > Feb 10 07:56:25 [5246] host1        cib:    error: cib_ais_destroy:
>>> > AIS
>>> > connection terminated
>>> > Feb 10 07:56:25 [5249] host1      attrd:     crit: attrd_ais_destroy:
>>> > Lost
>>> > connection to OpenAIS service!
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:   notice:
>>> > pcmk_shutdown_worker:
>>> > Shuting down Pacemaker
>>> > Feb 10 07:56:25 [5247] host1 stonith-ng:    error: ais_dispatch:
>>> > AIS
>>> > connection failed
>>> > Feb 10 07:56:25 [5249] host1      attrd:   notice: main:
>>> > Exiting...
>>> > Feb 10 07:56:25 [5247] host1 stonith-ng:    error:
>>> > stonith_peer_ais_destroy:
>>> > AIS connection terminated
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:   notice: stop_child:
>>> > Stopping crmd: Sent -15 to process 5251
>>> > Feb 10 07:56:25 [5249] host1      attrd:    error:
>>> > attrd_cib_connection_destroy:       Connection to the CIB terminated...
>>> > Feb 10 07:56:25 [5251] host1       crmd:     info: crm_signal_dispatch:
>>> > Invoking handler for signal 15: Terminated
>>> > Feb 10 07:56:25 [5251] host1       crmd:   notice: crm_shutdown:
>>> > Requesting shutdown, upper limit is 1200000ms
>>> > Feb 10 07:56:25 [5251] host1       crmd:     info: do_shutdown_req:
>>> > Sending shutdown request to host2
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: pcmk_child_exit:
>>> > Child
>>> > process stonith-ng exited (pid=5247, rc=1)
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:  warning: send_ipc_message:
>>> > IPC
>>> > Channel to 5249 is not connected
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:  warning: send_ipc_message:
>>> > IPC
>>> > Channel to 5246 is not connected
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:  warning: send_ipc_message:
>>> > IPC
>>> > Channel to 5247 is not connected
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: pcmk_child_exit:
>>> > Child
>>> > process cib exited (pid=5246, rc=1)
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: pcmk_child_exit:
>>> > Child
>>> > process attrd exited (pid=5249, rc=1)
>>> > Feb 10 07:56:25 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: send_ais_text:
>>> > Sending message 68 via pcmk: FAILED (rc=2): Library error: Connection
>>> > timed
>>> > out (110)
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: do_log:     FSA:
>>> > Input
>>> > I_ERROR from do_shutdown_req() received in state S_NOT_DC
>>> > Feb 10 07:56:27 [5251] host1       crmd:   notice: do_state_transition:
>>> > State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR
>>> > cause=C_FSA_INTERNAL
>>> > origin=do_shutdown_req ]
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: do_recover:
>>> > Action A_RECOVER (0000000001000000) not supported
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: do_log:     FSA:
>>> > Input
>>> > I_TERMINATE from do_recover() received in state S_RECOVERY
>>> > Feb 10 07:56:27 [5251] host1       crmd:   notice: do_state_transition:
>>> > State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE
>>> > cause=C_FSA_INTERNAL origin=do_recover ]
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_shutdown:
>>> > Disconnecting STONITH...
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info:
>>> > tengine_stonith_connection_destroy:         Fencing daemon disconnected
>>> > Feb 10 07:56:27 host1 lrmd: [5248]: info: cancel_op: operation
>>> > monitor[25]
>>> > on ocf::OpenStackFloatingIP::P_SESSION_IP for client 5251, its
>>> > parameters:
>>> > CRM_meta_name=[monitor] crm_feature_set=[3.0.6]
>>> > CRM_meta_timeout=[20000]
>>> > CRM_meta_interval=[5000] ip=[172.24.0.104]  cancelled
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: verify_stopped:
>>> > Resource P_SESSION_IP was active at shutdown.  You may ignore this
>>> > error if
>>> > it is unmanaged.
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_lrm_control:
>>> > Disconnected from the LRM
>>> > Feb 10 07:56:27 [5251] host1       crmd:   notice:
>>> > terminate_ais_connection:
>>> > Disconnecting from AIS
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_ha_control:
>>> > Disconnected from OpenAIS
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_cib_control:
>>> > Disconnecting CIB
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: send_ipc_message:
>>> > IPC
>>> > Channel to 5246 is not connected
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: send_ipc_message:
>>> > IPC
>>> > Channel to 5246 is not connected
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error:
>>> > cib_native_perform_op_delegate:     Sending message to CIB service
>>> > FAILED
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info:
>>> > crmd_cib_connection_destroy:        Connection to the CIB terminated...
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: verify_stopped:
>>> > Resource P_SESSION_IP was active at shutdown.  You may ignore this
>>> > error if
>>> > it is unmanaged.
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_exit:
>>> > Performing
>>> > A_EXIT_0 - gracefully exiting the CRMd
>>> > Feb 10 07:56:27 [5251] host1       crmd:    error: do_exit:    Could
>>> > not
>>> > recover from internal error
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: free_mem:   Dropping
>>> > I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: crm_xml_cleanup:
>>> > Cleaning up memory from libxml2
>>> > Feb 10 07:56:27 [5251] host1       crmd:     info: do_exit:    [crmd]
>>> > stopped (2)
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:    error: pcmk_child_exit:
>>> > Child
>>> > process crmd exited (pid=5251, rc=2)
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:  warning: send_ipc_message:
>>> > IPC
>>> > Channel to 5251 is not connected
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:   notice: stop_child:
>>> > Stopping pengine: Sent -15 to process 5250
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:     info: pcmk_child_exit:
>>> > Child
>>> > process pengine exited (pid=5250, rc=0)
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:   notice: stop_child:
>>> > Stopping lrmd: Sent -15 to process 5248
>>> > Feb 10 07:56:27 host1 lrmd: [5248]: info: lrmd is shutting down
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:     info: pcmk_child_exit:
>>> > Child
>>> > process lrmd exited (pid=5248, rc=0)
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:    error: send_cpg_message:
>>> > Sending message via cpg FAILED: (rc=9) Bad handle
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:   notice:
>>> > pcmk_shutdown_worker:
>>> > Shutdown complete
>>> > Feb 10 07:56:27 [5242] host1 pacemakerd:     info: main:       Exiting
>>> > pacemakerd
>>> >
>>> >
>>> > corosync.conf:
>>> >
>>> > compatibility: whitetank
>>> >
>>> > totem {
>>> >         version: 2
>>> >         secauth: off
>>> >         nodeid: 104
>>> >         interface {
>>> >                 member {
>>> >                         memberaddr: 172.17.0.104
>>> >                 }
>>> >                 member {
>>> >                         memberaddr: 172.17.0.105
>>> >                 }
>>> >                 ringnumber: 0
>>> >                 bindnetaddr: 172.17.0.0
>>> >                 mcastport: 5426
>>> >                 ttl: 1
>>> >         }
>>> >         transport: udpu
>>> > }
>>> >
>>> > logging {
>>> >         fileline: off
>>> >         to_logfile: yes
>>> >         to_syslog: yes
>>> >         debug: on
>>> >         logfile: /var/log/cluster/corosync.log
>>> >         debug: off
>>> >         timestamp: on
>>> >         logger_subsys {
>>> >                 subsys: AMF
>>> >                 debug: off
>>> >         }
>>> > }
>>> > service {
>>> >        # Load the Pacemaker Cluster Resource Manager
>>> >        ver:       1
>>> >        name:      pacemaker
>>> > }
>>> >
>>> > aisexec {
>>> >        user:   root
>>> >        group:  root
>>> > }
>>> >
>>> >
>>> >
>>> > Thank you!
>>> >
>>> > --
>>> > Viacheslav Biriukov
>>> > BR
>>> > http://biriukov.me
>>> >
>>> > _______________________________________________
>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > Project Home: http://www.clusterlabs.org
>>> > Getting started:
>>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > Bugs: http://bugs.clusterlabs.org
>>> >
>>>
>>>
>>>
>>> --
>>> Dan Frincu
>>> CCNA, RHCE
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>> --
>> Viacheslav Biriukov
>> BR
>> http://biriukov.me
>
>
>
>
> --
> Viacheslav Biriukov
> BR
> http://biriukov.me
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>