[ClusterLabs] pacemaker after upgrade from wheezy to jessie
Klaus Wenninger
kwenning at redhat.com
Thu Nov 10 09:31:53 UTC 2016
On 11/10/2016 09:47 AM, Toni Tschampke wrote:
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>
> Thanks for this tip, corosync quorum configuration was the cause.
>
> As we changed validate-with as well as the feature set manually in the
> cib, is there a need for issuing the cibadmin --upgrade --force
> command or is this command just for changing the schemes?
>
Guess no as this would just do automatically (to the latest version
then) what
you've done manually already.
> --
> Mit freundlichen Grüßen
>
> Toni Tschampke | tt at halle.it
> bcs kommunikationslösungen
> Inh. Dipl. Ing. Carsten Burkhardt
> Harz 51 | 06108 Halle (Saale) | Germany
> tel +49 345 29849-0 | fax +49 345 29849-22
> www.b-c-s.de | www.halle.it | www.wivewa.de
>
>
> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>
> Weitere Informationen erhalten Sie unter www.wivewa.de
>
> Am 08.11.2016 um 22:51 schrieb Ken Gaillot:
>> On 11/07/2016 09:08 AM, Toni Tschampke wrote:
>>> We managed to change the validate-with option via workaround (cibadmin
>>> export & replace) as setting the value with cibadmin --modify doesn't
>>> write the changes to disk.
>>>
>>> After experimenting with various schemes (xml is correctly interpreted
>>> by crmsh) we are still not able to communicate with local crmd.
>>>
>>> Can someone please help to determine why the local crmd is not
>>> responding (we disabled our other nodes to eliminate possible corosync
>>> related issues) and runs into errors/timeouts when issuing crmsh or
>>> cibadmin related commands.
>>
>> It occurs to me that wheezy used corosync 1. There were major changes
>> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
>> pacemaker, whereas 2 has quorum built-in.
>>
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>>
>>> examples for not working local commands
>>>
>>> timeout when running cibadmin: (strace attachment)
>>>> cibadmin --upgrade --force
>>>> Call cib_upgrade failed (-62): Timer expired
>>>
>>> error when running a crm resource cleanup
>>>> crm resource cleanup $vm
>>>> Error signing on to the CRMd service
>>>> Error performing operation: Transport endpoint is not connected
>>>
>>> I attached the strace log from running cib_upgrade, does this help to
>>> find the cause of the timeout issue?
>>>
>>> Here is the corosync dump when locally starting pacemaker:
>>>
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:1256
>>>> Corosync Cluster Engine ('2.3.6'): started and ready to provide
>>>> service.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [MAIN ] main.c:1257
>>>> Corosync built-in features: dbus rdma monitoring watchdog augeas
>>>> systemd upstart xmlconf qdevices snmp pie relro bindnow
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>>>> none hash: none
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>>>> none hash: none
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemudp.c:671 The network interface [10.112.0.1] is now up.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync configuration map access [0]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ]
>>>> ipc_setup.c:536 server name: cmap
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync configuration service [1]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ]
>>>> ipc_setup.c:536 server name: cfg
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync cluster closed process group service
>>>> v1.01 [2]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ]
>>>> ipc_setup.c:536 server name: cpg
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync profile loading service [4]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync resource monitoring service [6]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [WD ] wd.c:669
>>>> Watchdog /dev/watchdog is now been tickled by corosync.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:625
>>>> Could not change the Watchdog timeout from 10 to 6 seconds
>>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:464
>>>> resource load_15min missing a recovery key.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD ] wd.c:464
>>>> resource memory_used missing a recovery key.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [WD ] wd.c:581 no
>>>> resources configured.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync watchdog service [7]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [SERV ] service.c:174
>>>> Service engine loaded: corosync cluster quorum service v0.1 [3]
>>>> Nov 07 16:01:59 [24339] nebel1 corosync info [QB ]
>>>> ipc_setup.c:536 server name: quorum
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemudp.c:671 The network interface [10.110.1.1] is now up.
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [TOTEM ]
>>>> totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members
>>>> joined: 1
>>>> Nov 07 16:01:59 [24339] nebel1 corosync notice [MAIN ] main.c:310
>>>> Completed service synchronization, ready to provide service.
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice: main:
>>>> Starting Pacemaker 1.1.15 | build=e174ec8 features: generated-manpages
>>>> agent-manpages ascii-docs publican-docs ncurses libqb-logging
>>>> libqb-ipc lha-fencing upstart systemd nagios corosync-native
>>>> atomic-attrd snmp libesmtp acls
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: main:
>>>> Maximum core file size is: 18446744073709551615
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> qb_ipcs_us_publish: server name: pacemakerd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice:
>>>> get_node_name: Could not obtain a node name for corosync nodeid 1
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> crm_get_peer: Created entry
>>>> 283a5061-34c2-4b81-bff9-738533f22277/0x7f8a151931a0 for node (null)/1
>>>> (1 total)
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> crm_get_peer: Node 1 has uuid 1
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] -
>>>> corosync-cpg is now online
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: error:
>>>> cluster_connect_quorum: Corosync quorum is not configured
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: notice:
>>>> get_node_name: Defaulting to uname -n for the local corosync node
>>>> name
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> crm_get_peer: Node 1 is now known as nebel1
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Using uid=108 and group=114 for process cib
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24342 for process cib
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24343 for process stonith-ng
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24344 for process lrmd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Using uid=108 and group=114 for process attrd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24345 for process attrd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Using uid=108 and group=114 for process pengine
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24346 for process pengine
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Using uid=108 and group=114 for process crmd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> start_child: Forked child 24347 for process crmd
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info: main:
>>>> Starting mainloop
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> pcmk_cpg_membership: Node 1 joined group pacemakerd
>>>> (counter=0.0)
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> pcmk_cpg_membership: Node 1 still member of group pacemakerd
>>>> (peer=nebel1, counter=0.0)
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24341] nebel1 pacemakerd: info:
>>>> mcp_cpg_deliver: Ignoring process list sent by peer for local node
>>>> Nov 07 16:01:59 [24342] nebel1 cib: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24342] nebel1 cib: notice: main: Using
>>>> legacy config location: /var/lib/heartbeat/crm
>>>> Nov 07 16:01:59 [24342] nebel1 cib: info:
>>>> get_cluster_type: Verifying cluster type: 'corosync'
>>>> Nov 07 16:01:59 [24342] nebel1 cib: info:
>>>> get_cluster_type: Assuming an active 'corosync' cluster
>>>> Nov 07 16:01:59 [24342] nebel1 cib: info:
>>>> retrieveCib: Reading cluster configuration file
>>>> /var/lib/heartbeat/crm/cib.xml (digest:
>>>> /var/lib/heartbeat/crm/cib.xml.sig)
>>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info:
>>>> qb_ipcs_us_publish: server name: lrmd
>>>> Nov 07 16:01:59 [24344] nebel1 lrmd: info: main:
>>>> Starting
>>>> Nov 07 16:01:59 [24346] nebel1 pengine: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24346] nebel1 pengine: info:
>>>> qb_ipcs_us_publish: server name: pengine
>>>> Nov 07 16:01:59 [24346] nebel1 pengine: info: main:
>>>> Starting pengine
>>>> Nov 07 16:01:59 [24345] nebel1 attrd: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24345] nebel1 attrd: info: main:
>>>> Starting up
>>>> Nov 07 16:01:59 [24345] nebel1 attrd: info:
>>>> get_cluster_type: Verifying cluster type: 'corosync'
>>>> Nov 07 16:01:59 [24345] nebel1 attrd: info:
>>>> get_cluster_type: Assuming an active 'corosync' cluster
>>>> Nov 07 16:01:59 [24345] nebel1 attrd: notice:
>>>> crm_cluster_connect: Connecting to cluster infrastructure:
>>>> corosync
>>>> Nov 07 16:01:59 [24347] nebel1 crmd: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: main: CRM
>>>> Git Version: 1.1.15 (e174ec8)
>>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info:
>>>> crm_log_init: Changed active directory to
>>>> /var/lib/pacemaker/cores
>>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info:
>>>> get_cluster_type: Verifying cluster type: 'corosync'
>>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: info:
>>>> get_cluster_type: Assuming an active 'corosync' cluster
>>>> Nov 07 16:01:59 [24343] nebel1 stonith-ng: notice:
>>>> crm_cluster_connect: Connecting to cluster infrastructure:
>>>> corosync
>>>> Nov 07 16:01:59 [24347] nebel1 crmd: info: do_log: Input
>>>> I_STARTUP received in state S_STARTING from crmd_init
>>>> Nov 07 16:01:59 [24347] nebel1 crmd: info:
>>>> get_cluster_type: Verifying cluster type: 'corosync'
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:00 [24342] nebel1 cib: notice:
>>>> get_node_name: Could not obtain a node name for corosync nodeid 1
>>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: notice:
>>>> get_node_name: Defaulting to uname -n for the local corosync node
>>>> name
>>>> Nov 07 16:02:00 [24343] nebel1 stonith-ng: info:
>>>> crm_get_peer: Node 1 is now known as nebel1
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> crm_get_peer: Created entry
>>>> f5df58e3-3848-440c-8f6b-d572f8fa9b9c/0x7f0ce1744570 for node (null)/1
>>>> (1 total)
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> crm_get_peer: Node 1 has uuid 1
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] -
>>>> corosync-cpg is now online
>>>> Nov 07 16:02:00 [24342] nebel1 cib: notice:
>>>> crm_update_peer_state_iter: Node (null) state is now member |
>>>> nodeid=1 previous=unknown source=crm_update_peer_proc
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> init_cs_connection_once: Connection to 'corosync': established
>>>> Nov 07 16:02:00 [24345] nebel1 attrd: info: main:
>>>> Cluster connection active
>>>> Nov 07 16:02:00 [24345] nebel1 attrd: info:
>>>> qb_ipcs_us_publish: server name: attrd
>>>> Nov 07 16:02:00 [24345] nebel1 attrd: info: main:
>>>> Accepting attribute updates
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:00 [24342] nebel1 cib: notice:
>>>> get_node_name: Defaulting to uname -n for the local corosync node
>>>> name
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> crm_get_peer: Node 1 is now known as nebel1
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> qb_ipcs_us_publish: server name: cib_ro
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> qb_ipcs_us_publish: server name: cib_rw
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> qb_ipcs_us_publish: server name: cib_shm
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info: cib_init:
>>>> Starting cib mainloop
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> pcmk_cpg_membership: Node 1 joined group cib (counter=0.0)
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> pcmk_cpg_membership: Node 1 still member of group cib
>>>> (peer=nebel1, counter=0.0)
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> cib_file_backup: Archived previous version as
>>>> /var/lib/heartbeat/crm/cib-72.raw
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> cib_file_write_with_digest: Wrote version 0.8464.0 of the CIB
>>>> to disk (digest: 5201c56641a95e5117df4184587c3e93)
>>>> Nov 07 16:02:00 [24342] nebel1 cib: info:
>>>> cib_file_write_with_digest: Reading cluster configuration file
>>>> /var/lib/heartbeat/crm/cib.naRhNz (digest:
>>>> /var/lib/heartbeat/crm/cib.hLaVCH)
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> do_cib_control: CIB connection established
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice:
>>>> crm_cluster_connect: Connecting to cluster infrastructure:
>>>> corosync
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice:
>>>> get_node_name: Could not obtain a node name for corosync nodeid 1
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Created entry
>>>> 43a3b98f-d81d-4cc7-b46e-4512f24db371/0x7f798ff40040 for node (null)/1
>>>> (1 total)
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Node 1 has uuid 1
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> crm_update_peer_proc: cluster_connect_cpg: Node (null)[1] -
>>>> corosync-cpg is now online
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> init_cs_connection_once: Connection to 'corosync': established
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: notice:
>>>> get_node_name: Defaulting to uname -n for the local corosync node
>>>> name
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Node 1 is now known as nebel1
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: info:
>>>> peer_update_callback: nebel1 is now in unknown state
>>>> Nov 07 16:02:00 [24347] nebel1 crmd: error:
>>>> cluster_connect_quorum: Corosync quorum is not configured
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 2
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 2
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: notice:
>>>> get_node_name: Could not obtain a node name for corosync nodeid 2
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Created entry
>>>> c790c642-6666-4022-bba9-f700e4773b03/0x7f79901428e0 for node (null)/2
>>>> (2 total)
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Node 2 has uuid 2
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 3
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> corosync_node_name: Unable to get node name for nodeid 3
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: notice:
>>>> get_node_name: Could not obtain a node name for corosync nodeid 3
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Created entry
>>>> 928f8124-4d29-4285-99de-50038d3c3b7e/0x7f7990142a20 for node (null)/3
>>>> (3 total)
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> crm_get_peer: Node 3 has uuid 3
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> do_ha_control: Connected to the cluster
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> lrmd_ipc_connect: Connecting to lrmd
>>>> Nov 07 16:02:01 [24342] nebel1 cib: info:
>>>> cib_process_request: Forwarding cib_modify operation for section
>>>> nodes to all (origin=local/crmd/3)
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> do_lrm_control: LRM connection established
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> do_started: Delaying start, no membership data
>>>> (0000000000100000)
>>>> Nov 07 16:02:01 [24342] nebel1 cib: info:
>>>> corosync_node_name: Unable to get node name for nodeid 1
>>>> Nov 07 16:02:01 [24342] nebel1 cib: notice:
>>>> get_node_name: Defaulting to uname -n for the local corosync node
>>>> name
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> parse_notifications: No optional alerts section in cib
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> do_started: Delaying start, no membership data
>>>> (0000000000100000)
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> pcmk_cpg_membership: Node 1 joined group crmd (counter=0.0)
>>>> Nov 07 16:02:01 [24347] nebel1 crmd: info:
>>>> pcmk_cpg_membership: Node 1 still member of group crmd
>>>> (peer=nebel1, counter=0.0)
>>>> Nov 07 16:02:01 [24342] nebel1 cib: info:
>>>> cib_process_request: Completed cib_modify operation for section
>>>> nodes: OK (rc=0, origin=nebel1/crmd/3, version=0.8464.0)
>>>> Nov 07 16:02:01 [24345] nebel1 attrd: info:
>>>> attrd_cib_connect: Connected to the CIB after 2 attempts
>>>> Nov 07 16:02:01 [24345] nebel1 attrd: info: main: CIB
>>>> connection active
>>>> Nov 07 16:02:01 [24345] nebel1 attrd: info:
>>>> pcmk_cpg_membership: Node 1 joined group attrd (counter=0.0)
>>>> Nov 07 16:02:01 [24345] nebel1 attrd: info:
>>>> pcmk_cpg_membership: Node 1 still member of group attrd
>>>> (peer=nebel1, counter=0.0)
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: setup_cib:
>>>> Watching for stonith topology changes
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info:
>>>> qb_ipcs_us_publish: server name: stonith-ng
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info: main:
>>>> Starting stonith-ng mainloop
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info:
>>>> pcmk_cpg_membership: Node 1 joined group stonith-ng
>>>> (counter=0.0)
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info:
>>>> pcmk_cpg_membership: Node 1 still member of group stonith-ng
>>>> (peer=nebel1, counter=0.0)
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info:
>>>> init_cib_cache_cb: Updating device list from the cib: init
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: info:
>>>> cib_devices_update: Updating devices to version 0.8464.0
>>>> Nov 07 16:02:01 [24343] nebel1 stonith-ng: notice:
>>>> unpack_config: On loss of CCM Quorum: Ignore
>>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng: notice:
>>>> stonith_device_register: Added 'stonith1Nebel2' to the device list
>>>> (1 active devices)
>>>> Nov 07 16:02:02 [24343] nebel1 stonith-ng: info:
>>>> cib_device_update: Device stonith1Nebel1 has been disabled on nebel1:
>>>> score=-INFINITY
>>>
>>> Current cib settings:
>>>> cibadmin -Q | grep validate
>>>> <cib admin_epoch="0" epoch="8464" num_updates="0"
>>>> validate-with="pacemaker-2.4" crm_feature_set="3.0.10" have-quorum="1"
>>>> cib-last-written="Fri Nov 4 12:15:30 2016" update-origin="nebel3"
>>>> update-client="crm_attribute" update-user="root">
>>>
>>> Any help is appreciated, thanks in advance
>>>
>>> Regards, Toni
>>>
>>> --
>>> Mit freundlichen Grüßen
>>>
>>> Toni Tschampke | tt at halle.it
>>> bcs kommunikationslösungen
>>> Inh. Dipl. Ing. Carsten Burkhardt
>>> Harz 51 | 06108 Halle (Saale) | Germany
>>> tel +49 345 29849-0 | fax +49 345 29849-22
>>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>>
>>>
>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>>
>>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>>
>>> Am 03.11.2016 um 17:42 schrieb Toni Tschampke:
>>>> > I'm guessing this change should be instantly written into the xml
>>>> file?
>>>> > If this is the case something is wrong, greping for validate
>>>> gives the
>>>> > old string back.
>>>>
>>>> We found some strange behavior when setting "validate-with" via
>>>> cibadmin, corosync.log shows the successful transaction, issuing
>>>> cibadmin --query gives the correct value but it is NOT written into
>>>> cib.xml.
>>>>
>>>> We restarted pacemaker and value is reset to pacemaker-1.1
>>>> If signatures for the cib.xml are generated from pacemaker/cib, which
>>>> algorithm is used? looks like md5 to me.
>>>>
>>>> Would it be possible to manual edit the cib.xml and generate a valid
>>>> cib.xml.sig to get one step further in debugging process?
>>>>
>>>> Regards, Toni
>>>>
>>>> --
>>>> Mit freundlichen Grüßen
>>>>
>>>> Toni Tschampke | tt at halle.it
>>>> bcs kommunikationslösungen
>>>> Inh. Dipl. Ing. Carsten Burkhardt
>>>> Harz 51 | 06108 Halle (Saale) | Germany
>>>> tel +49 345 29849-0 | fax +49 345 29849-22
>>>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>>>
>>>>
>>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>>>
>>>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>>>
>>>> Am 03.11.2016 um 16:39 schrieb Toni Tschampke:
>>>>> > I'm going to guess you were using the experimental 1.1 schema
>>>>> as the
>>>>> > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>>>>> > changing the validate-with to pacemaker-next or pacemaker-1.2 and
>>>>> see if
>>>>> > you get better results. Don't edit the file directly though;
>>>>> use the
>>>>> > cibadmin command so it signs the end result properly.
>>>>> >
>>>>> > After changing the validate-with, run:
>>>>> >
>>>>> > crm_verify -x /var/lib/pacemaker/cib/cib.xml
>>>>> >
>>>>> > and fix any errors that show up.
>>>>>
>>>>> strange, the location of our cib.xml differs from your path, our
>>>>> cib is
>>>>> located in /var/lib/heartbeat/crm/
>>>>>
>>>>> running cibadmin --modify --xml-text '<cib
>>>>> validate-with="pacemaker-1.2"/>'
>>>>>
>>>>> gave no output but was logged to corosync:
>>>>>
>>>>> cib: info: cib_perform_op: -- <cib num_updates="0"
>>>>> validate-with="pacemaker-1.1"/>
>>>>> cib: info: cib_perform_op: ++ <cib admin_epoch="0"
>>>>> epoch="8462"
>>>>> num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
>>>>> have-quorum="1" cib-last-written="Thu Nov 3 10:05:52 2016"
>>>>> update-origin="nebel1" update-client="cibadmin" update-user="root"/>
>>>>>
>>>>> I'm guessing this change should be instantly written into the xml
>>>>> file?
>>>>> If this is the case something is wrong, greping for validate gives
>>>>> the
>>>>> old string back.
>>>>>
>>>>> <cib admin_epoch="0" epoch="8462" num_updates="0"
>>>>> validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
>>>>> cib-last-written="Thu Nov 3 16:19:51 2016" update-origin="nebel1"
>>>>> update-client="cibadmin" update-user="root">
>>>>>
>>>>> pacemakerd --features
>>>>> Pacemaker 1.1.15 (Build: e174ec8)
>>>>> Supporting v3.0.10:
>>>>>
>>>>> Should the crm_feature_set be updated this way too? I'm guessing
>>>>> this is
>>>>> done when "cibadmin --upgrade" succeeds?
>>>>>
>>>>> We just get an timeout error when trying to upgrade it with cibadmin:
>>>>> Call cib_upgrade failed (-62): Timer expired
>>>>>
>>>>> Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
>>>>> quite big /var/lib/heartbeat/crm/ folder some permissions changed:
>>>>>
>>>>> -rw------- 1 hacluster root 80K Nov 1 16:56 cib-31.raw
>>>>> -rw-r--r-- 1 hacluster root 32 Nov 1 16:56 cib-31.raw.sig
>>>>> -rw------- 1 hacluster haclient 80K Nov 1 18:53 cib-32.raw
>>>>> -rw------- 1 hacluster haclient 32 Nov 1 18:53 cib-32.raw.sig
>>>>>
>>>>> cib-31 was before upgrading, cib-32 after starting upgraded pacemaker
>>>>>
>>>>>
>>>>> --
>>>>> Mit freundlichen Grüßen
>>>>>
>>>>> Toni Tschampke | tt at halle.it
>>>>> bcs kommunikationslösungen
>>>>> Inh. Dipl. Ing. Carsten Burkhardt
>>>>> Harz 51 | 06108 Halle (Saale) | Germany
>>>>> tel +49 345 29849-0 | fax +49 345 29849-22
>>>>> www.b-c-s.de | www.halle.it | www.wivewa.de
>>>>>
>>>>>
>>>>> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
>>>>> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>>>>>
>>>>> Weitere Informationen erhalten Sie unter www.wivewa.de
>>>>>
>>>>> Am 03.11.2016 um 15:39 schrieb Ken Gaillot:
>>>>>> On 11/03/2016 05:51 AM, Toni Tschampke wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to
>>>>>>> jessie
>>>>>>> (pacemaker 1.1.15, corosync 2.3.6).
>>>>>>> During the upgrade pacemaker was removed (rc) and reinstalled after
>>>>>>> from
>>>>>>> jessie-backports, same for crmsh.
>>>>>>>
>>>>>>> Now we are encountering multiple problems:
>>>>>>>
>>>>>>> First I checked the configuration on a single node running
>>>>>>> pacemaker &
>>>>>>> corosync which dropped a strange error, followed by multiple lines
>>>>>>> stating syntax is wrong. crm configure show then showed up a mixed
>>>>>>> view
>>>>>>> of xml and crmsh singleline syntax.
>>>>>>>
>>>>>>>> ERROR: Cannot read schema file
>>>>>>> '/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
>>>>>>> directory: '/usr/share/pacemaker/pacemaker-1.1.rng'
>>>>>>
>>>>>> pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker
>>>>>> 1.1.12,
>>>>>> as it was used to hold experimental new features rather than as the
>>>>>> actual next version of the schema. So, the schema skipped to 1.2.
>>>>>>
>>>>>> I'm going to guess you were using the experimental 1.1 schema as the
>>>>>> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
>>>>>> changing the validate-with to pacemaker-next or pacemaker-1.2 and
>>>>>> see if
>>>>>> you get better results. Don't edit the file directly though; use the
>>>>>> cibadmin command so it signs the end result properly.
>>>>>>
>>>>>> After changing the validate-with, run:
>>>>>>
>>>>>> crm_verify -x /var/lib/pacemaker/cib/cib.xml
>>>>>>
>>>>>> and fix any errors that show up.
>>>>>>
>>>>>>> When we looked into that folder there was pacemaker-1.0.rng, 1.2
>>>>>>> and so
>>>>>>> on. As a quick try we symlinked the 1.2 to 1.1 and the syntax
>>>>>>> errors
>>>>>>> were gone. When running crm resource show, all resources showed up,
>>>>>>> when
>>>>>>> running crm_mon -1fA the output was unexpected as it showed all
>>>>>>> nodes
>>>>>>> offline, with no DC elected:
>>>>>>>
>>>>>>>> Stack: corosync
>>>>>>>> Current DC: NONE
>>>>>>>> Last updated: Thu Nov 3 11:11:16 2016
>>>>>>>> Last change: Thu Nov 3 09:54:52 2016 by root via cibadmin on
>>>>>>>> nebel1
>>>>>>>>
>>>>>>>> *** Resource management is DISABLED ***
>>>>>>>> The cluster will not attempt to start, stop or recover services
>>>>>>>>
>>>>>>>> 3 nodes and 73 resources configured:
>>>>>>>> 5 resources DISABLED and 0 BLOCKED from being started due to
>>>>>>>> failures
>>>>>>>>
>>>>>>>> OFFLINE: [ nebel1 nebel2 nebel3 ]
>>>>>>>
>>>>>>> we tried to manually change dc-version
>>>>>>>
>>>>>>> when issuing a simple cleanup command I got the following error:
>>>>>>>
>>>>>>>> crm resource cleanup DrbdBackuppcMs
>>>>>>>> Error signing on to the CRMd service
>>>>>>>> Error performing operation: Transport endpoint is not connected
>>>>>>>
>>>>>>> which looks like crmsh is not able to communicate with crmd and
>>>>>>> nothing
>>>>>>> is logged in this case in corosync.log
>>>>>>>
>>>>>>> we experimented with multiple config changes (corosync.conf:
>>>>>>> pacemaker
>>>>>>> ver 0 > 1)
>>>>>>> cib-bootstrap-options: cluster-infrastructure from openais to
>>>>>>> corosync
>>>>>>>
>>>>>>>> Package versions:
>>>>>>>> cman 3.1.8-1.2+b1
>>>>>>>> corosync 2.3.6-3~bpo8+1
>>>>>>>> crmsh 2.2.0-1~bpo8+1
>>>>>>>> csync2 1.34-2.3+b1
>>>>>>>> dlm-pcmk 3.0.12-3.2+deb7u2
>>>>>>>> libcman3 3.1.8-1.2+b1
>>>>>>>> libcorosync-common4:amd64 2.3.6-3~bpo8+1
>>>>>>>> munin-libvirt-plugins 0.0.6-1
>>>>>>>> pacemaker 1.1.15-2~bpo8+1
>>>>>>>> pacemaker-cli-utils 1.1.15-2~bpo8+1
>>>>>>>> pacemaker-common 1.1.15-2~bpo8+1
>>>>>>>> pacemaker-resource-agents 1.1.15-2~bpo8+1
>>>>>>>
>>>>>>>> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64
>>>>>>>> GNU/Linux
>>>>>>>
>>>>>>> I attached our cib before upgrade and after, as well as the one
>>>>>>> with
>>>>>>> the
>>>>>>> mixed syntax and our corosync.conf.
>>>>>>>
>>>>>>> When we tried to connect a second node to the cluster, pacemaker
>>>>>>> starts
>>>>>>> it's deamons, starts corosync and dies after 15 tries with
>>>>>>> following in
>>>>>>> corosync log:
>>>>>>>
>>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>>> (2000ms)
>>>>>>>> crmd: info: do_cib_control: Could not connect to the CIB service:
>>>>>>>> Transport endpoint is not connected
>>>>>>>> crmd: warning: do_cib_control:
>>>>>>>> Couldn't complete CIB registration 15 times... pause and retry
>>>>>>>> attrd: error: attrd_cib_connect: Signon to CIB failed:
>>>>>>>> Transport endpoint is not connected (-107)
>>>>>>>> attrd: info: main: Shutting down attribute manager
>>>>>>>> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
>>>>>>>> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
>>>>>>>> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
>>>>>>>> (2000ms)
>>>>>>>> pacemakerd: warning: pcmk_child_exit:
>>>>>>>> The attrd process (12761) can no longer be respawned,
>>>>>>>> shutting the cluster down.
>>>>>>>> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker
>>>>>>>
>>>>>>> A third node joins without above error, but crm_mon still shows all
>>>>>>> nodes as offline.
>>>>>>>
>>>>>>> Thanks for any advice how to solve this, I'm out of ideas now.
>>>>>>>
>>>>>>> Regards, Toni
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list