<div dir="ltr">Hi guys.<div>I've just started to work with pacemaker and have a problem with monitored service.</div><div><br>I've already configured three Active/Stand-by clusters with pacemaker.<br>Running resources:</div><div>IPaddr2<br>asterisk daemon<br>bacula fd</div><div>snmp daemon<br><br>First and second cluster are working fine - I didn't notice any failures.<br>But the third cluster fails too frequently.</div><div><br></div><div>Schema is similar for all clusters:<br>Master (active) <-corosync-> Slave(Stand-by)</div><div><br></div><div>There is no difference between three cluster, apart server loading.<br>Asterisk that is running on third cluster has 500+ customers and processes much more calls than others.<br><br>So that, cluster periodically thinks that asterisk is not running:<br><div>* ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85, status=complete, exitreason='none',</div><div> last-rc-change='Tue Nov 7 15:06:16 2017', queued=0ms, exec=0ms<br><br>And restarts it (because on-fail=restart parameter for asterisk primitive).<br>But indeed asterisk is working fine and nothing happens with him.<br>I parsed asterisk full log and found nothing, that can explain the behavior of pacemaker.<br><br>All machines are virtual (not containers, but proxmox VMs). They have enough resources, each has - 8 cores 3GHz, 8GB ram.<br>I tried to increase resources on machines - I doubled them up, but it changed nothing.<br>And it seemed to be that machine resources are not the root of the problem, resources monitoring showed that cores are not loaded more than 10%.<br><br></div></div><div>Configurations.</div><div><br>Corosync config:</div><div><div>totem {</div><div> version: 2</div><div> cluster_name: asterisk</div><div><br></div><div> token: 1000</div><div> token_retransmit: 31</div><div> hold: 31</div><div><br></div><div> token_retransmits_before_loss_const: 0</div><div><br></div><div> clear_node_high_bit: yes</div><div><br></div><div> crypto_cipher: none</div><div><br></div><div> crypto_hash: none</div><div> rrp_mode: active</div><div> transport: udpu</div><div><br></div><div> interface {</div><div> member {</div><div> memberaddr: 10.100.1.1</div><div> }</div><div> member {</div><div> memberaddr: 10.100.1.2</div><div> }</div><div><br></div><div> ringnumber: 0</div><div> bindnetaddr: 10.100.1.1</div><div><br></div><div> mcastport: 5405</div><div> ttl: 1</div><div> }</div><div>}</div></div><div><br><div>quorum {</div><div> provider: corosync_votequorum</div><div> expected_votes: 2</div><div>}</div><div><br>logging block is skipped.</div><div><br></div><div><br><br>Pacemaker config:<br><br><div>node 178676749: node1-Master</div><div>node 178676750: node2-Slave</div><div>primitive ASTERISK systemd:asterisk \</div><div><span style="white-space:pre"> </span>op monitor interval=2s timeout=30s on-fail=restart \</div><div><span style="white-space:pre"> </span>op start interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>op stop interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>meta migration-threshold=2 failure-timeout=1800s target-role=Started</div><div>primitive BACULA systemd:bacula-fd \</div><div><span style="white-space:pre"> </span>op monitor interval=30s timeout=60s on-fail=restart \</div><div><span style="white-space:pre"> </span>op start interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>op stop interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>meta migration-threshold=2 failure-timeout=1800s</div><div>primitive IPSHARED IPaddr2 \</div><div><span style="white-space:pre"> </span>params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \</div><div><span style="white-space:pre"> </span>meta migration-threshold=2 target-role=Started \</div><div><span style="white-space:pre"> </span>op monitor interval=20 timeout=60 on-fail=restart</div><div>primitive SNMP systemd:snmpd \</div><div><span style="white-space:pre"> </span>op monitor interval=30s timeout=60s on-fail=restart \</div><div><span style="white-space:pre"> </span>op start interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>op stop interval=0 timeout=30s \</div><div><span style="white-space:pre"> </span>meta migration-threshold=2 failure-timeout=1800s target-role=Started</div><div>order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP</div><div>colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED</div><div>location PREFER_BACULA BACULA 100: node1-Master</div><div>location PREFER_MASTER ASTERISK 100: node1-Master</div><div>location PREFER_SNMP SNMP 100: node1-Master</div><div>property cib-bootstrap-options: \</div><div><span style="white-space:pre"> </span>cluster-recheck-interval=5s \</div><div><span style="white-space:pre"> </span>start-failure-is-fatal=false \</div><div><span style="white-space:pre"> </span>stonith-enabled=false \</div><div><span style="white-space:pre"> </span>no-quorum-policy=ignore \</div><div><span style="white-space:pre"> </span>have-watchdog=false \</div><div><span style="white-space:pre"> </span>dc-version=1.1.16-94ff4df \</div><div><span style="white-space:pre"> </span>cluster-infrastructure=corosync \</div><div><span style="white-space:pre"> </span>cluster-name=virtual2</div></div><div><br><br>Asterisk systemd config:<br><div><br></div><div>[Unit]</div><div>Description=Asterisk</div><div><br></div><div>[Service]</div><div>ExecStart=/etc/init.d/asterisk start</div><div>ExecStop=/etc/init.d/asterisk stop</div><div>PIDFile=/var/run/asterisk/asterisk.pid</div><div><br></div><div><br></div><div><br class="gmail-Apple-interchange-newline">Corosync log:<br><br></div></div><div>Nov 07 15:06:16 [3958] node1-Master crmd: info: process_lrm_event:<span style="white-space:pre"> </span>Result of monitor operation for ASTERISK on node1-Master: 7 (not running) | call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_process_request:<span style="white-space:pre"> </span>Forwarding cib_modify operation for section status to all (origin=local/crmd/106)</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>Diff: --- 0.38.37 2</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>Diff: +++ 0.38.38 (null)</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>+ /cib: @num_updates=38</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>+ /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']: @transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @call-id=85, @last-rc-change=1510059976</div><div>Nov 07 15:06:16 [3953] node1-Master cib: info: cib_process_request:<span style="white-space:pre"> </span>Completed cib_modify operation for section status: OK (rc=0, origin=node1-Master/crmd/106, version=0.38.38)</div><div>Nov 07 15:06:16 [3956] node1-Master attrd: info: attrd_peer_update:<span style="white-space:pre"> </span>Setting fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave</div><div>Nov 07 15:06:16 [3956] node1-Master attrd: info: attrd_peer_update:<span style="white-space:pre"> </span>Setting last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from node2-Slave</div><div>Nov 07 15:06:16 [3955] node1-Master lrmd: info: cancel_recurring_action:<span style="white-space:pre"> </span>Cancelling systemd operation SNMP_status_30000</div><div>Nov 07 15:06:16 [3958] node1-Master crmd: info: do_lrm_rsc_op:<span style="white-space:pre"> </span>Performing key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0</div><div>Nov 07 15:06:16 [3955] node1-Master lrmd: info: log_execute:<span style="white-space:pre"> </span>executing - rsc:SNMP action:stop call_id:89</div><div>Nov 07 15:06:16 [3958] node1-Master crmd: info: process_lrm_event:<span style="white-space:pre"> </span>Result of monitor operation for SNMP on node1-Master: Cancelled | call=87 key=SNMP_monitor_30000 confirmed=true</div><div>Nov 07 15:06:16 [3955] node1-Master lrmd: info: systemd_exec_result:<span style="white-space:pre"> </span>Call to stop passed: /org/freedesktop/systemd1/job/10916</div><div>Nov 07 15:06:18 [3958] node1-Master crmd: notice: process_lrm_event:<span style="white-space:pre"> </span>Result of stop operation for SNMP on node1-Master: 0 (ok) | call=89 key=SNMP_stop_0 confirmed=true cib-update=107</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_process_request:<span style="white-space:pre"> </span>Forwarding cib_modify operation for section status to all (origin=local/crmd/107)</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>Diff: --- 0.38.38 2</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>Diff: +++ 0.38.39 (null)</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>+ /cib: @num_updates=39</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op:<span style="white-space:pre"> </span>+ /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']: @operation_key=SNMP_stop_0, @operation=stop, @transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @call-id=89, @last-run=1510059976, @last-rc-change=1510059976, @exec-time=2047</div><div>Nov 07 15:06:18 [3953] node1-Master cib: info: cib_process_request:<span style="white-space:pre"> </span>Completed cib_modify operation for section status: OK (rc=0, origin=node1-Master/crmd/107, version=0.38.39)</div><div>Nov 07 15:06:18 [3955] node1-Master lrmd: info: cancel_recurring_action:<span style="white-space:pre"> </span>Cancelling systemd operation ASTERISK_status_2000</div><div>Nov 07 15:06:18 [3958] node1-Master crmd: info: do_lrm_rsc_op:<span style="white-space:pre"> </span>Performing key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0</div><div>Nov 07 15:06:18 [3955] node1-Master lrmd: info: log_execute:<span style="white-space:pre"> </span>executing - rsc:ASTERISK action:stop call_id:91</div><div>Nov 07 15:06:18 [3958] node1-Master crmd: info: process_lrm_event:<span style="white-space:pre"> </span>Result of monitor operation for ASTERISK on node1-Master: Cancelled | call=85 key=ASTERISK_monitor_2000 confirmed=true</div><div>Nov 07 15:06:18 [3955] node1-Master lrmd: info: systemd_exec_result:<span style="white-space:pre"> </span>Call to stop passed: /org/freedesktop/systemd1/job/10917</div></div><div><br></div><div><br><br>Asterisk with the same asterisk's configurations works fine on regular virtual machine (not cluster), with the same resource parameters.</div><div>So I think, the problem consists of interaction between asterisk monitor (pacemaker) function and asterisk daemon. May be delays or something like.</div><div><br>Thanks in advance for answers/hints.</div><div><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><span>-- <br></span>BR, Donat Zenichev
<br>Wnet VoIP team
<br>Tel: +380(44) 5-900-808
<br><a href="http://wnet.ua" target="_blank">http://wnet.ua</a></div></div>
</div></div>