[ClusterLabs] attrd does not clean per-node cache after node removal

Vladislav Bogdanov bubble at hoster-ok.com
Wed Mar 23 16:52:21 UTC 2016


23.03.2016 19:39, Ken Gaillot wrote:
> On 03/23/2016 07:35 AM, Vladislav Bogdanov wrote:
>> Hi!
>>
>> It seems like atomic attrd in post-1.1.14 (eb89393) does not
>> fully clean node cache after node is removed.
>
> Is this a regression? Or have you only tried it with this version?

Only with this one.

>
>> After our QA guys remove node wa-test-server-ha-03 from a two-node cluster:
>> * stop pacemaker and corosync on wa-test-server-ha-03
>> * remove node wa-test-server-ha-03 from corosync nodelist on wa-test-server-ha-04
>> * tune votequorum settings
>> * reload corosync on wa-test-server-ha-04
>> * remove node from pacemaker on wa-test-server-ha-04
>> * delete everything from /var/lib/pacemaker/cib on wa-test-server-ha-03
>> , and then join it with the different corosync ID (but with the same node name),
>> we see the following in logs:
>>
>> Leave node 1 (wa-test-server-ha-03):
>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: crm_update_peer_proc: Node wa-test-server-ha-03[1] - state is now lost (was member)
>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Removing all wa-test-server-ha-03 (1) attributes for attrd_peer_change_cb
>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Lost attribute writer wa-test-server-ha-03
>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Removing wa-test-server-ha-03/1 from the membership list
>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]:   notice: Purged 1 peers with id=1 and/or uname=wa-test-server-ha-03 from the membership cache
>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Processing peer-remove from wa-test-server-ha-04: wa-test-server-ha-03 0
>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Removing all wa-test-server-ha-03 (0) attributes for wa-test-server-ha-04
>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Removing wa-test-server-ha-03/1 from the membership list
>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]:   notice: Purged 1 peers with id=0 and/or uname=wa-test-server-ha-03 from the membership cache
>>
>> Join node 3 (the same one, wa-test-server-ha-03, but ID differs):
>> Mar 23 04:21:23 wa-test-server-ha-04 attrd[25962]: notice: crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now member (was (null))
>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]:     crit: Node 1 and 3 share the same name 'wa-test-server-ha-03'
>
> It took me a while to understand the above combination of messages. This
> is not node 3 joining. This is node 1 joining after node 3 has already
> been seen.

Hmmm...
corosync.conf and corosync-cmapctl both say it is 3
Also, cib lists it as 3 and lrmd puts its status records under 3.

Actually issue is that drbd resources are not promoted because their 
master attributes go to section with node-id 1. And that is the only 
reason why we found that. Everything not related to volatile attributes 
works well.

>
> The warnings are a complete dump of the peer cache. So you can see that
> wa-test-server-ha-03 is listed only once, with id 3.
>
> The critical message ("Node 1 and 3") lists the new id first and the
> found ID second. So id 1 is what it's trying to add to the cache.

But there is also 'Node 'wa-test-server-ha-03' has changed its ID from 1 
to 3' -  it goes first. Does that matter?

>
> Did you update the node ID in corosync.conf on *both* nodes?

Sure.
It is automatically copied to a node being joined.

>
>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]:   notice: Node 'wa-test-server-ha-03' has changed its ID from 1 to 3
>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]:     crit: Node 1 and 3 share the same name 'wa-test-server-ha-03'
>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]:   notice: Node 'wa-test-server-ha-03' has changed its ID from 1 to 3
>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296
>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]:  warning: crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - 6c18faa1-f8c2-4b0c-907c-20db450e2e79
>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]:     crit: Node 1 and 3 share the same name 'wa-test-server-ha-03'
>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]:   notice: Node 'wa-test-server-ha-03' has changed its ID from 3 to 1
>> ...
>>
>> On the node being joined:
>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]:   notice: Connecting to cluster infrastructure: corosync
>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]:   notice: crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now member (was (null))
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:   notice: crm_update_peer_proc: Node wa-test-server-ha-04[2] - state is now member (was (null))
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:   notice: Recorded attribute writer: wa-test-server-ha-04
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:   notice: Processing sync-response from wa-test-server-ha-04
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:  warning: crm_find_peer: Node 2/wa-test-server-ha-04 = 0xdfe620 - ad08ca96-295a-4fa4-99f9-8c8a2d0b6ac0
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:  warning: crm_find_peer: Node 3/wa-test-server-ha-03 = 0xd7ae20 - f85bdc4b-a3ee-47ff-bdd5-7c1dcf9fe97c
>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]:     crit: Node 1 and 3 share the same name 'wa-test-server-ha-03'
>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]:   notice: Node 'wa-test-server-ha-03' has changed its ID from 1 to 3
>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]:   notice: Updating all attributes after cib_refresh_notify event
>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]:   notice: Updating all attributes after cib_refresh_notify event
>>
>>
>> CIB status section after that contains entries for three nodes, with IDs 1, 2 and 3.
>> For node 2 (that which remained) there are both transient attributes and lrm statuses
>> For node 1 (that have been removed) - only transient attributes
>> For node 3 (newly joined) - only lrm statuses
>>
>> That makes me think that not everything is removed (stale caches?) from attrd during node leave.
>>
>>
>> Is there some other information I can provide to solve this issue?
>>
>> Best,
>> Vladislav
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>





More information about the Users mailing list