[ClusterLabs] monitor operation for ASTERISK on node_name: 7 (not running)

Tue Nov 7 11:07:04 EST 2017

Hi guys.
I've just started to work with pacemaker and have a problem with monitored
service.

I've already configured three Active/Stand-by clusters with pacemaker.
Running resources:
IPaddr2
asterisk daemon
bacula fd
snmp daemon

First and second cluster are working fine - I didn't notice any failures.
But the third cluster fails too frequently.

Schema is similar for all clusters:
Master (active) <-corosync-> Slave(Stand-by)

There is no difference between three cluster, apart server loading.
Asterisk that is running on third cluster has 500+ customers and processes
much more calls than others.

So that, cluster periodically thinks that asterisk is not running:
* ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85,
status=complete, exitreason='none',
    last-rc-change='Tue Nov  7 15:06:16 2017', queued=0ms, exec=0ms

And restarts it (because on-fail=restart parameter for asterisk primitive).
But indeed asterisk is working fine and nothing happens with him.
I parsed asterisk full log and found nothing, that can explain the behavior
of pacemaker.

All machines are virtual (not containers, but proxmox VMs). They have
enough resources, each has - 8 cores 3GHz, 8GB ram.
I tried to increase resources on machines - I doubled them up, but it
changed nothing.
And it seemed to be that machine resources are not the root of the problem,
resources monitoring showed that cores are not loaded more than 10%.

Configurations.

Corosync config:
totem {
        version: 2
        cluster_name: asterisk

        token: 1000
        token_retransmit: 31
        hold: 31

        token_retransmits_before_loss_const: 0

        clear_node_high_bit: yes

        crypto_cipher: none

        crypto_hash: none
        rrp_mode: active
        transport: udpu

        interface {
                member {
                        memberaddr: 10.100.1.1
                }
                member {
                        memberaddr: 10.100.1.2
                }

                ringnumber: 0
                bindnetaddr: 10.100.1.1

                mcastport: 5405
                ttl: 1
        }
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
}

logging block is skipped.

Pacemaker config:

node 178676749: node1-Master
node 178676750: node2-Slave
primitive ASTERISK systemd:asterisk \
op monitor interval=2s timeout=30s on-fail=restart \
op start interval=0 timeout=30s \
op stop interval=0 timeout=30s \
meta migration-threshold=2 failure-timeout=1800s target-role=Started
primitive BACULA systemd:bacula-fd \
op monitor interval=30s timeout=60s on-fail=restart \
op start interval=0 timeout=30s \
op stop interval=0 timeout=30s \
meta migration-threshold=2 failure-timeout=1800s
primitive IPSHARED IPaddr2 \
params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 timeout=60 on-fail=restart
primitive SNMP systemd:snmpd \
op monitor interval=30s timeout=60s on-fail=restart \
op start interval=0 timeout=30s \
op stop interval=0 timeout=30s \
meta migration-threshold=2 failure-timeout=1800s target-role=Started
order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP
colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED
location PREFER_BACULA BACULA 100: node1-Master
location PREFER_MASTER ASTERISK 100: node1-Master
location PREFER_SNMP SNMP 100: node1-Master
property cib-bootstrap-options: \
cluster-recheck-interval=5s \
start-failure-is-fatal=false \
stonith-enabled=false \
no-quorum-policy=ignore \
have-watchdog=false \
dc-version=1.1.16-94ff4df \
cluster-infrastructure=corosync \
cluster-name=virtual2

Asterisk systemd config:

[Unit]
Description=Asterisk

[Service]
ExecStart=/etc/init.d/asterisk start
ExecStop=/etc/init.d/asterisk stop
PIDFile=/var/run/asterisk/asterisk.pid

Corosync log:

Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
process_lrm_event: Result
of monitor operation for ASTERISK on node1-Master: 7 (not running) |
call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106
Nov 07 15:06:16 [3953] node1-Master        cib:     info:
cib_process_request: Forwarding cib_modify operation for section status to
all (origin=local/crmd/106)
Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: Diff:
--- 0.38.37 2
Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: Diff:
+++ 0.38.38 (null)
Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: +
/cib:  @num_updates=38
Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: +
/cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']:
@transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@call-id=85, @last-rc-change=1510059976
Nov 07 15:06:16 [3953] node1-Master        cib:     info:
cib_process_request: Completed cib_modify operation for section status: OK
(rc=0, origin=node1-Master/crmd/106, version=0.38.38)
Nov 07 15:06:16 [3956] node1-Master      attrd:     info:
attrd_peer_update: Setting
fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave
Nov 07 15:06:16 [3956] node1-Master      attrd:     info:
attrd_peer_update: Setting
last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from
node2-Slave
Nov 07 15:06:16 [3955] node1-Master       lrmd:     info:
cancel_recurring_action: Cancelling systemd operation SNMP_status_30000
Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
do_lrm_rsc_op: Performing
key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0
Nov 07 15:06:16 [3955] node1-Master       lrmd:     info: log_execute:
executing
- rsc:SNMP action:stop call_id:89
Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
process_lrm_event: Result
of monitor operation for SNMP on node1-Master: Cancelled | call=87
key=SNMP_monitor_30000 confirmed=true
Nov 07 15:06:16 [3955] node1-Master       lrmd:     info:
systemd_exec_result: Call to stop passed:
/org/freedesktop/systemd1/job/10916
Nov 07 15:06:18 [3958] node1-Master       crmd:   notice:
process_lrm_event: Result
of stop operation for SNMP on node1-Master: 0 (ok) | call=89
key=SNMP_stop_0 confirmed=true cib-update=107
Nov 07 15:06:18 [3953] node1-Master        cib:     info:
cib_process_request: Forwarding cib_modify operation for section status to
all (origin=local/crmd/107)
Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: Diff:
--- 0.38.38 2
Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: Diff:
+++ 0.38.39 (null)
Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: +
/cib:  @num_updates=39
Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: +
/cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']:
@operation_key=SNMP_stop_0, @operation=stop,
@transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@call-id=89, @last-run=1510059976, @last-rc-change=1510059976,
@exec-time=2047
Nov 07 15:06:18 [3953] node1-Master        cib:     info:
cib_process_request: Completed cib_modify operation for section status: OK
(rc=0, origin=node1-Master/crmd/107, version=0.38.39)
Nov 07 15:06:18 [3955] node1-Master       lrmd:     info:
cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000
Nov 07 15:06:18 [3958] node1-Master       crmd:     info:
do_lrm_rsc_op: Performing
key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0
Nov 07 15:06:18 [3955] node1-Master       lrmd:     info: log_execute:
executing
- rsc:ASTERISK action:stop call_id:91
Nov 07 15:06:18 [3958] node1-Master       crmd:     info:
process_lrm_event: Result
of monitor operation for ASTERISK on node1-Master: Cancelled | call=85
key=ASTERISK_monitor_2000 confirmed=true
Nov 07 15:06:18 [3955] node1-Master       lrmd:     info:
systemd_exec_result: Call to stop passed:
/org/freedesktop/systemd1/job/10917

Asterisk with the same asterisk's configurations works fine on regular
virtual machine (not cluster), with the same resource parameters.
So I think, the problem consists of interaction between asterisk monitor
(pacemaker) function and asterisk daemon. May be delays or something like.

Thanks in advance for answers/hints.

-- 
-- 
BR, Donat Zenichev
Wnet VoIP team
Tel:  +380(44) 5-900-808
http://wnet.ua
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20171107/f27c34ca/attachment-0002.html>