[ClusterLabs] Antw: monitor operation for ASTERISK on node_name: 7 (not running)

Wed Nov 8 16:52:38 CET 2017

Hi!

It may be worth the time to find out which command is used to monitor the status of asterisk.
Then maybe run that command repeatedly in a shell loop to find out what it returns.

I guess some thing is slow to respond; maybe jsut increase the timeout...

Regards,
Ulrich

>>> Donat Zenichev <donat.zenichev at gmail.com> schrieb am 07.11.2017 um 17:07 in
Nachricht
<CANLwQCnVpPpM7s7kzH0P42WXyQRQH62Wm3VNZF8gEa0YuH09ew at mail.gmail.com>:
> Hi guys.
> I've just started to work with pacemaker and have a problem with monitored
> service.
> 
> I've already configured three Active/Stand-by clusters with pacemaker.
> Running resources:
> IPaddr2
> asterisk daemon
> bacula fd
> snmp daemon
> 
> First and second cluster are working fine - I didn't notice any failures.
> But the third cluster fails too frequently.
> 
> Schema is similar for all clusters:
> Master (active) <-corosync-> Slave(Stand-by)
> 
> There is no difference between three cluster, apart server loading.
> Asterisk that is running on third cluster has 500+ customers and processes
> much more calls than others.
> 
> So that, cluster periodically thinks that asterisk is not running:
> * ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85,
> status=complete, exitreason='none',
>     last-rc-change='Tue Nov  7 15:06:16 2017', queued=0ms, exec=0ms
> 
> And restarts it (because on-fail=restart parameter for asterisk primitive).
> But indeed asterisk is working fine and nothing happens with him.
> I parsed asterisk full log and found nothing, that can explain the behavior
> of pacemaker.
> 
> All machines are virtual (not containers, but proxmox VMs). They have
> enough resources, each has - 8 cores 3GHz, 8GB ram.
> I tried to increase resources on machines - I doubled them up, but it
> changed nothing.
> And it seemed to be that machine resources are not the root of the problem,
> resources monitoring showed that cores are not loaded more than 10%.
> 
> Configurations.
> 
> Corosync config:
> totem {
>         version: 2
>         cluster_name: asterisk
> 
>         token: 1000
>         token_retransmit: 31
>         hold: 31
> 
>         token_retransmits_before_loss_const: 0
> 
>         clear_node_high_bit: yes
> 
>         crypto_cipher: none
> 
>         crypto_hash: none
>         rrp_mode: active
>         transport: udpu
> 
>         interface {
>                 member {
>                         memberaddr: 10.100.1.1
>                 }
>                 member {
>                         memberaddr: 10.100.1.2
>                 }
> 
>                 ringnumber: 0
>                 bindnetaddr: 10.100.1.1
> 
>                 mcastport: 5405
>                 ttl: 1
>         }
> }
> 
> quorum {
>         provider: corosync_votequorum
>         expected_votes: 2
> }
> 
> logging block is skipped.
> 
> 
> 
> Pacemaker config:
> 
> node 178676749: node1-Master
> node 178676750: node2-Slave
> primitive ASTERISK systemd:asterisk \
> op monitor interval=2s timeout=30s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s target-role=Started
> primitive BACULA systemd:bacula-fd \
> op monitor interval=30s timeout=60s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s
> primitive IPSHARED IPaddr2 \
> params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \
> meta migration-threshold=2 target-role=Started \
> op monitor interval=20 timeout=60 on-fail=restart
> primitive SNMP systemd:snmpd \
> op monitor interval=30s timeout=60s on-fail=restart \
> op start interval=0 timeout=30s \
> op stop interval=0 timeout=30s \
> meta migration-threshold=2 failure-timeout=1800s target-role=Started
> order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP
> colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED
> location PREFER_BACULA BACULA 100: node1-Master
> location PREFER_MASTER ASTERISK 100: node1-Master
> location PREFER_SNMP SNMP 100: node1-Master
> property cib-bootstrap-options: \
> cluster-recheck-interval=5s \
> start-failure-is-fatal=false \
> stonith-enabled=false \
> no-quorum-policy=ignore \
> have-watchdog=false \
> dc-version=1.1.16-94ff4df \
> cluster-infrastructure=corosync \
> cluster-name=virtual2
> 
> 
> Asterisk systemd config:
> 
> [Unit]
> Description=Asterisk
> 
> [Service]
> ExecStart=/etc/init.d/asterisk start
> ExecStop=/etc/init.d/asterisk stop
> PIDFile=/var/run/asterisk/asterisk.pid
> 
> 
> 
> Corosync log:
> 
> Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
> process_lrm_event: Result
> of monitor operation for ASTERISK on node1-Master: 7 (not running) |
> call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106
> Nov 07 15:06:16 [3953] node1-Master        cib:     info:
> cib_process_request: Forwarding cib_modify operation for section status to
> all (origin=local/crmd/106)
> Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: 
> Diff:
> --- 0.38.37 2
> Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: 
> Diff:
> +++ 0.38.38 (null)
> Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: +
> /cib:  @num_updates=38
> Nov 07 15:06:16 [3953] node1-Master        cib:     info: cib_perform_op: +
> /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l
> rm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']:
> @transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @call-id=85, @last-rc-change=1510059976
> Nov 07 15:06:16 [3953] node1-Master        cib:     info:
> cib_process_request: Completed cib_modify operation for section status: OK
> (rc=0, origin=node1-Master/crmd/106, version=0.38.38)
> Nov 07 15:06:16 [3956] node1-Master      attrd:     info:
> attrd_peer_update: Setting
> fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave
> Nov 07 15:06:16 [3956] node1-Master      attrd:     info:
> attrd_peer_update: Setting
> last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from
> node2-Slave
> Nov 07 15:06:16 [3955] node1-Master       lrmd:     info:
> cancel_recurring_action: Cancelling systemd operation SNMP_status_30000
> Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
> do_lrm_rsc_op: Performing
> key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0
> Nov 07 15:06:16 [3955] node1-Master       lrmd:     info: log_execute:
> executing
> - rsc:SNMP action:stop call_id:89
> Nov 07 15:06:16 [3958] node1-Master       crmd:     info:
> process_lrm_event: Result
> of monitor operation for SNMP on node1-Master: Cancelled | call=87
> key=SNMP_monitor_30000 confirmed=true
> Nov 07 15:06:16 [3955] node1-Master       lrmd:     info:
> systemd_exec_result: Call to stop passed:
> /org/freedesktop/systemd1/job/10916
> Nov 07 15:06:18 [3958] node1-Master       crmd:   notice:
> process_lrm_event: Result
> of stop operation for SNMP on node1-Master: 0 (ok) | call=89
> key=SNMP_stop_0 confirmed=true cib-update=107
> Nov 07 15:06:18 [3953] node1-Master        cib:     info:
> cib_process_request: Forwarding cib_modify operation for section status to
> all (origin=local/crmd/107)
> Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: 
> Diff:
> --- 0.38.38 2
> Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: 
> Diff:
> +++ 0.38.39 (null)
> Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: +
> /cib:  @num_updates=39
> Nov 07 15:06:18 [3953] node1-Master        cib:     info: cib_perform_op: +
> /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l
> rm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']:
> @operation_key=SNMP_stop_0, @operation=stop,
> @transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
> @call-id=89, @last-run=1510059976, @last-rc-change=1510059976,
> @exec-time=2047
> Nov 07 15:06:18 [3953] node1-Master        cib:     info:
> cib_process_request: Completed cib_modify operation for section status: OK
> (rc=0, origin=node1-Master/crmd/107, version=0.38.39)
> Nov 07 15:06:18 [3955] node1-Master       lrmd:     info:
> cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000
> Nov 07 15:06:18 [3958] node1-Master       crmd:     info:
> do_lrm_rsc_op: Performing
> key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0
> Nov 07 15:06:18 [3955] node1-Master       lrmd:     info: log_execute:
> executing
> - rsc:ASTERISK action:stop call_id:91
> Nov 07 15:06:18 [3958] node1-Master       crmd:     info:
> process_lrm_event: Result
> of monitor operation for ASTERISK on node1-Master: Cancelled | call=85
> key=ASTERISK_monitor_2000 confirmed=true
> Nov 07 15:06:18 [3955] node1-Master       lrmd:     info:
> systemd_exec_result: Call to stop passed:
> /org/freedesktop/systemd1/job/10917
> 
> 
> 
> Asterisk with the same asterisk's configurations works fine on regular
> virtual machine (not cluster), with the same resource parameters.
> So I think, the problem consists of interaction between asterisk monitor
> (pacemaker) function and asterisk daemon. May be delays or something like.
> 
> Thanks in advance for answers/hints.
> 
> 
> -- 
> -- 
> BR, Donat Zenichev
> Wnet VoIP team
> Tel:  +380(44) 5-900-808
> http://wnet.ua