[ClusterLabs] Issue with attrd_updater hang

Mon Jan 9 06:23:52 EST 2017

Hi!

our customers were hit by a quite strange issue with resources populating attributes
in attrd. The most obscure fact is that they see that issue only on a selected
subset of nodes (two nodes in a 8-node cluster). Symptoms are sporadic timeouts of
resources whose RAs call attrd_updater to manage node attributes. In order to debug
an issue we modified resource agents to run attrd_updater under strace and also to
collect attrd blackbox on a stop (after the timed-out monitor).

That way we managed to get a little bit further.

Below are the analysis for a failure registered Dec 17 08:08:19.
Resource has monitor timeout of 30 sec and interval 10 sec, so the monitor operation
was started around Dec 17 08:07:49:

Dec 17 08:08:19 [38018] pfs2n5       lrmd:  warning: child_timeout_callback:    ifspeed-o2ib0_monitor_10000 process (PID 40819) timed out
Dec 17 08:08:24 [38018] pfs2n5       lrmd:     crit: child_timeout_callback:    ifspeed-o2ib0_monitor_10000 process (PID 40819) will not die!
^^ This is because of strace
Dec 17 08:08:48 [38018] pfs2n5       lrmd:  warning: operation_finished:    ifspeed-o2ib0_monitor_10000:40819 - timed out after 30000ms

Strace log for attrd_updater shows the following:
=======
[...]
connect(3, {sa_family=AF_LOCAL, sun_path=@"attrd"}, 110) = 0
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
sendto(3, "\377\377\377\377\0\0\0\0\30\0\0\0\0\0\0\0\0\0\20\0\0\0\0\0", 24, MSG_NOSIGNAL, NULL, 0) = 24
setsockopt(3, SOL_SOCKET, SO_PASSCRED, [0], 4) = 0
recvfrom(3, 0x7ffdb55ade40, 12328, 16640, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, -1)    = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\377\377\377\377\0\0\0\0(0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\20\0"..., 12328, MSG_WAITALL|MSG_NOSIGNAL, NULL, NULL) = 12328
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ee6000
open("/dev/shm/qb-attrd-request-38019-40694-10-header", O_RDWR) = 4
ftruncate(4, 8252)                      = 0
mmap(NULL, 8252, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f6982100000
open("/dev/shm/qb-attrd-request-38019-40694-10-data", O_RDWR) = 5
ftruncate(5, 1052672)                   = 0
mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ce4000
mmap(0x7f6981ce4000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981ce4000
mmap(0x7f6981de5000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981de5000
close(5)                                = 0
close(4)                                = 0
open("/dev/shm/qb-attrd-response-38019-40694-10-header", O_RDWR) = 4
ftruncate(4, 8248)                      = 0
mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f69820fd000
open("/dev/shm/qb-attrd-response-38019-40694-10-data", O_RDWR) = 5
ftruncate(5, 1052672)                   = 0
mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f6981ae2000
mmap(0x7f6981ae2000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981ae2000
mmap(0x7f6981be3000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f6981be3000
close(5)                                = 0
close(4)                                = 0
open("/dev/shm/qb-attrd-event-38019-40694-10-header", O_RDWR) = 4
ftruncate(4, 8248)                      = 0
mmap(NULL, 8248, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0x7f69820fa000
open("/dev/shm/qb-attrd-event-38019-40694-10-data", O_RDWR) = 5
ftruncate(5, 1052672)                   = 0
mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f69818e0000
mmap(0x7f69818e0000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f69818e0000
mmap(0x7f69819e1000, 1052672, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 5, 0) = 0x7f69819e1000
close(5)                                = 0
close(4)                                = 0
poll([{fd=3, events=POLLIN}], 1, 0)     = 0 (Timeout)
sendto(3, "\355", 1, MSG_NOSIGNAL, NULL, 0) = 1
futex(0x7f69820ff010, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1481958460, 283442042}, ffffffff) = 0
poll([{fd=3, events=POLLIN}], 1, 0)     = 0 (Timeout)
exit_group(0)                           = ?
+++ exited with 0 +++
======

At the same time attrd blackbox traces do not show a connect/update attempt at all:

[...]
debug   Dec 17 08:07:19 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000
debug   Dec 17 08:07:29 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000
debug   Dec 17 08:07:39 attrd_client_update(320):0: Broadcasting ifspeed-o2ib0[pfs2n5] = 56000

<<< Here should be lines about ifspeed-o2ib0[pfs2n5] at 08:07:49, but no lines with that timestamp there >>>

debug   Dec 17 08:08:48 attrd_client_update(320):0: Broadcasting fail-count-ifspeed-o2ib0[pfs2n5] = 2
debug   Dec 17 08:08:48 attrd_client_update(320):0: Broadcasting last-failure-ifspeed-o2ib0[pfs2n5] = 1481958528 r_name="last-failure-ifspeed-o2ib0" task="update" attr_value="1481958528" attr_section="status" attr_host="pfs2n5" attr_is_remote="0" attr_is_private="0" attr_is_utilization="0" acl_target="hacluster" attr_user="hacluster"/>

attr_is_utilization is a local patch (http://bugs.clusterlabs.org/show_bug.cgi?id=5165)

As I understand, issue may be either in mainloop implementation, or inside the libqb.

What could be the next to look at?

Of course, we made a workaround now to run attrd_updater several times in a loop under the 'timeout' utility,
but we'd like to get to the issue roots.

What else bothers me is that there is a trash in some messages in a blackbox log:

trace   Dec 17 07:54:36 pcmk_message_common_cs(281):0: Got new^MO^A message (size=85760, 0, 822083584)

corresponding code is:

    crm_trace("Got new%s message (size=%d, %d, %d)",

              msg->is_compressed ? " compressed" : "",

              ais_data_len(msg), msg->size, msg->compressed_size);

I definitely cannot get how that binary "^MO^A" data appeared there except a heap corruption or some unusual
issue with macros. That strings (" compressed" and "") are in the RO data section...
Also numeric data (size=85760, 0, 822083584) -  that is broken too.

Nodes are equipped with ECC RAM, so single-bit errors are corrected, multi-bit ones result in a NMI.
OS is EL6, issue is seen with both pacemaker 1.1.13 and 1.1.15 (e174ec8). libqb is quite recent - 1908e6c1.

Any pointers are appreciated,

Best regards,
Vladislav