[Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Wed Mar 14 12:55:42 EDT 2012

On Mar 14, 2012, at 12:33 PM, Florian Haas wrote:

>> However, sometimes pacemakerd will not stop cleanly.
> 
> OK. Whether this is related to your original problem or not a complete
> open question, jftr.
> 
>> I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.
> 
> You didn't give much other information, so I'm asking this on a hunch:
> does your pacemaker service configuration stanza for corosync (either
> in /etc/corosync/corosync.conf or in
> /etc/corosync/service.d/pacemaker) say "ver: 0" or "ver: 1"?

I'm not sure if this is the same problem or not. I did experience a symptom that looked to my inexperienced eyes very similar before I installed 1.0.9+hg2665-1~bpo60+2 - that is, I'd try to stop pacemaker, and it wouldn't stop, and I'd get that flood of retransmits in syslog.

To answer your question, I am using "ver: 1". It's worth mentioning that the corosync.conf that comes with the packages in squeeze-backports has a service block with ver: 0 in it, which took me some time to discover. However, I've long ago removed it. Syslog seems to verify that ver: 1 is in effect:

Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found 'pacemaker' for option: name
Mar 14 12:02:34 xenhost02 pacemakerd: [7925]: info: get_config_opt: Found '1' for option: ver

After playing with this system more, it seems this problem of "Retransmit List" being flooded to syslog is not only on pacemakerd shutdown. For example, I was just trying to add a DRBD resource, and crm got hung up at "cib commit":

crm(drbd)# cib commit drbd
[long pause, some minutes long]
Could not commit shadow instance 'drbd' to the CIB: Remote node did not respond
ERROR: failed to commit the drbd shadow CIB

"corosync[7915]:   [TOTEM ] Retransmit List: b7 b8 b9" is being flooded to syslog.

Every time I try to reproduce this, I can once or twice, but then no more. I'm beginning to think that to set this up, a node has to have been running for some time. I can reproduce it a few times because I try it on each node. Then I have to restart corosync on each node to get things working again, and after that, everything is fine, until I move on, spend some time reading documentation, and try again.

I'm assuming these "Retransmit List" messages in syslog indicate that corosync attempted to send a message to other nodes, did not receive acknowledgement, and is thus attempting to resend them. I know corosync uses IP multicast to communicate with the other nodes. Is it possible that my network is doing something that breaks multicast connectivity? Multicast IP isn't something I've ever had to deal with, so I'm not really sure. It's hard to find anything that talks about configuring a network for multicast that doesn't start talking about IP routers, which isn't relevant in my setup because all the cluster nodes are on the same VLAN, on the same switch. Could this be an issue? Is there a lower-level utility (like, ping) that I can use to verify multicast IP at a lower level?