[ClusterLabs] unexpected fenced node and promotion of the new master PAF - postgres

Tue Jul 13 07:42:48 EDT 2021

Hi guys,
im back with some PAF postgres cluster problems.
tonight the cluster fenced the master node and promote the PAF resource to
a new node.
everything went fine, unless i really dont know why.
so this morning i noticed the old master was fenced by sbd and a new master
was promoted, this happen tonight at 00.40.XX.
filtering the logs i cant find out the any reasons why the old master was
fenced and the start of promotion of the new master (which seems went
perfectly), at certain point, im a bit lost cuz non of us can is able to
get the real reason.
the cluster worked flawessy for days  with no issues, till now.
crucial for me uderstand why this switch occured.

a attached the current status and configuration and logs.
on the old master node log cant find any reasons
on the new master the only thing is the fencing and the promotion.

PS:
could be this the reason of fencing?

grep  -e sbd /var/log/messages
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant
pcmk is outdated (age: 4)
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant
pcmk is healthy (age: 0)

Any though and help is really appreciate.

Damiano
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210713/d786f118/attachment-0001.htm>
-------------- next part --------------
pcs status
Cluster name: ltaoperdbscluster
Stack: corosync
Current DC: ltaoperdbs03 (version 1.1.23-1.el7-9acf116022) - partition with quorum
Last updated: Tue Jul 13 10:06:01 2021
Last change: Tue Jul 13 00:41:05 2021 by root via crm_attribute on ltaoperdbs03

3 nodes configured
4 resource instances configured

Online: [ ltaoperdbs03 ltaoperdbs04 ]
OFFLINE: [ ltaoperdbs02 ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ ltaoperdbs03 ]
     Slaves: [ ltaoperdbs04 ]
     Stopped: [ ltaoperdbs02 ]
 pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started ltaoperdbs03

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled
[root at ltaoperdbs03 pengine]# pcs config show
Cluster Name: ltaoperdbscluster
Corosync Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04
Pacemaker Nodes:
 ltaoperdbs02 ltaoperdbs03 ltaoperdbs04

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-13/bin pgdata=/workspace/pdgs-db/13/data pgport=5432
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=25s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=25s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=172.18.2.10
  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
              start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s)
              stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory)
  demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory-1)
Colocation Constraints:
  pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: ltaoperdbscluster
 dc-version: 1.1.23-1.el7-9acf116022
 have-watchdog: true
 last-lrm-refresh: 1625090339
 stonith-enabled: true
 stonith-watchdog-timeout: 10s

Quorum:
  Options:

stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021

####SBD CONFIG
grep -v \# /etc/sysconfig/sbd | sort | uniq

SBD_DELAY_START=no
SBD_MOVE_TO_ROOT_CGROUP=auto
SBD_OPTS=
SBD_PACEMAKER=yes
SBD_STARTMODE=always
SBD_TIMEOUT_ACTION=flush,reboot
SBD_WATCHDOG_DEV=/dev/watchdog
SBD_WATCHDOG_TIMEOUT=5
-------------- next part --------------
ltaoperdbs03 cluster]# stonith_admin --verbose --history "*"
ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021

[root at ltaoperdbs03 cluster]# grep "Jul 13 00:40:" /var/log/messages
Jul 13 00:40:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85454 of user ltauser.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85455 of user nmon.
Jul 13 00:40:01 ltaoperdbs03 systemd: Started Session 85456 of user nmon.
Jul 13 00:40:02 ltaoperdbs03 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead"
Jul 13 00:40:02 ltaoperdbs03 systemd: Removed slice User Slice of ltauser.
Jul 13 00:40:35 ltaoperdbs03 corosync[228685]: [TOTEM ] A processor failed, forming new configuration.
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] A new membership (172.18.2.12:227) was formed. Members left: 1
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [TOTEM ] Failed to receive the leave message. failed: 1
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG   ] downlist left_list: 1 received
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [CPG   ] downlist left_list: 1 received
Jul 13 00:40:37 ltaoperdbs03 cib[228695]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 cib[228695]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [QUORUM] Members[2]: 2 3
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Lost attribute writer ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Our peer on the DC (ltaoperdbs02) is dead
Jul 13 00:40:37 ltaoperdbs03 corosync[228685]: [MAIN  ] Completed service synchronization, ready to provide service.
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Removing all ltaoperdbs02 attributes for peer loss
Jul 13 00:40:37 ltaoperdbs03 attrd[228698]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: State transition S_NOT_DC -> S_ELECTION
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 pacemakerd[228694]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: State transition S_ELECTION -> S_INTEGRATION
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice: Watchdog will be used via SBD if fencing is required
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Cluster node ltaoperdbs02 will be fenced: peer is no longer part of the cluster
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Node ltaoperdbs02 is unclean
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_demote_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsqld:2_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Action pgsql-master-ip_stop_0 on ltaoperdbs02 is unrunnable (offline)
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Scheduling Node ltaoperdbs02 for STONITH
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Promote    pgsqld:0     ( Slave -> Master ltaoperdbs03 )
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Stop       pgsqld:2     (          Master ltaoperdbs02 )   due to node availability
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]:  notice:  * Move       pgsql-master-ip     ( ltaoperdbs02 -> ltaoperdbs03 )
Jul 13 00:40:37 ltaoperdbs03 pengine[228699]: warning: Calculated transition 0 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-1.bz2
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating cancel operation pgsqld_monitor_16000 locally on ltaoperdbs03
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Requesting fencing (reboot) of node ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)'
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_demote_0 locally on ltaoperdbs03
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer fencing (reboot) targeting ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_demote_0 on ltaoperdbs04
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find anyone to fence (reboot) ltaoperdbs02 with any device
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
Jul 13 00:40:37 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing (reboot) by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700 at ltaoperdbs03.f5d882d5: OK
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Stonith operation 2/1:0:0:665567e9-db35-4f6f-a502-d3e9d33ee25b: OK (0)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Peer ltaoperdbs02 was terminated (reboot) by ltaoperdbs03 on behalf of crmd.228700: OK
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_demote_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_demote_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_stop_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_stop_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_stop_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_stop_0 on ltaoperdbs04
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_promote_0 locally on ltaoperdbs03
Jul 13 00:40:47 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_pre_notify_promote_0 on ltaoperdbs04
Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Promoting instance on node "ltaoperdbs03"
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10469-1] 2021-07-13 00:40:48.399 UTC [228792] LOG:  restartpoint complete: wrote 84513 buffers (2.1%); 0 WAL file(s) added, 0 removed, 36 recycled; write=222.219 s, sync=0.011 s, total=222.261 s; sync files=505, longest=0.002 s, average=0.000 s; distance=683179 kB, estimate=792192 kB
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-1] 2021-07-13 00:40:48.400 UTC [228792] LOG:  recovery restart point at D5B/A815EE88
Jul 13 00:40:48 ltaoperdbs03 postgres[228792]: [10470-2] 2021-07-13 00:40:48.400 UTC [228792] DETAIL:  Last completed transaction was at log time 2021-07-13 00:40:34.15804+00.
Jul 13 00:40:48 ltaoperdbs03 pgsqlms(pgsqld)[34737]: INFO: Current node TL#LSN: 12#14688449270880
Jul 13 00:40:48 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:48 ltaoperdbs03 crmd[228700]:  notice: Initiating promote operation pgsqld_promote_0 locally on ltaoperdbs03
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [11-1] 2021-07-13 00:40:48.735 UTC [228791] LOG:  received promote request
Jul 13 00:40:48 ltaoperdbs03 postgres[228796]: [9-1] 2021-07-13 00:40:48.736 UTC [228796] FATAL:  terminating walreceiver process due to administrator command
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [12-1] 2021-07-13 00:40:48.737 UTC [228791] LOG:  invalid resource manager ID 32 at D5B/EBCD1460
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [13-1] 2021-07-13 00:40:48.737 UTC [228791] LOG:  redo done at D5B/EBCD1438
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [14-1] 2021-07-13 00:40:48.737 UTC [228791] LOG:  last completed transaction was at log time 2021-07-13 00:40:34.15804+00
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [15-1] 2021-07-13 00:40:48.754 UTC [228791] LOG:  selected new timeline ID: 13
Jul 13 00:40:48 ltaoperdbs03 postgres[228791]: [16-1] 2021-07-13 00:40:48.784 UTC [228791] LOG:  archive recovery complete
Jul 13 00:40:49 ltaoperdbs03 postgres[228792]: [10471-1] 2021-07-13 00:40:49.046 UTC [228792] LOG:  checkpoint starting: force
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34771]: INFO: Promote complete
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of promote operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_promote_0 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating notify operation pgsqld_post_notify_promote_0 on ltaoperdbs04
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of notify operation for pgsqld on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating start operation pgsql-master-ip_start_0 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating monitor operation pgsqld_monitor_15000 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Adding inet address 172.18.2.10/24 with broadcast address 172.18.2.255 to device bond0
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: Bringing device bond0 up
Jul 13 00:40:49 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: /usr/libexec/heartbeat/send_arp  -i 200 -r 5 -p /var/run/resource-agents/send_arp-172.18.2.10 bond0 172.18.2.10 auto not_used not_used
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Result of start operation for pgsql-master-ip on ltaoperdbs03: 0 (ok)
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Initiating monitor operation pgsql-master-ip_monitor_10000 locally on ltaoperdbs03
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: No secondary connected to the master
Jul 13 00:40:49 ltaoperdbs03 pgsqlms(pgsqld)[34822]: WARNING: "ltaoperdbs04" is not connected to the primary
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition aborted by nodes-3-master-pgsqld doing modify master-pgsqld=-1000: Configuration change
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: ltaoperdbs03-pgsqld_monitor_15000:27 [ /tmp:5432 - accepting connections\n ]
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition 0 (Complete=41, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete
Jul 13 00:40:49 ltaoperdbs03 pengine[228699]:  notice: Watchdog will be used via SBD if fencing is required
Jul 13 00:40:49 ltaoperdbs03 pengine[228699]:  notice: Calculated transition 1, saving inputs in /var/lib/pacemaker/pengine/pe-input-2185.bz2
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: Transition 1 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2185.bz2): Complete
Jul 13 00:40:49 ltaoperdbs03 crmd[228700]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jul 13 00:40:50 ltaoperdbs03 postgres[228789]: [8-1] 2021-07-13 00:40:50.107 UTC [228789] LOG:  database system is ready to accept connections
Jul 13 00:40:50 ltaoperdbs03 ntpd[1471]: Listen normally on 7 bond0 172.18.2.10 UDP 123
Jul 13 00:40:53 ltaoperdbs03 IPaddr2(pgsql-master-ip)[34821]: INFO: ARPING 172.18.2.10 from 172.18.2.10 bond0#012Sent 5 probes (5 broadcast(s))#012Received 0 response(s)
[root at ltaoperdbs03 cluster]# grep "Jul 13 00:39:" /var/log/messages
Jul 13 00:39:01 ltaoperdbs03 systemd: Created slice User Slice of ltauser.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85451 of user ltauser.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85453 of user nmon.
Jul 13 00:39:01 ltaoperdbs03 systemd: Started Session 85452 of user nmon.
Jul 13 00:39:01 ltaoperdbs03 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead"
Jul 13 00:39:01 ltaoperdbs03 systemd: Removed slice User Slice of ltauser.
[root at ltaoperdbs03 cluster]# grep stonith-ng /var/log/messages
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Node ltaoperdbs02 state is now lost
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Purged 1 peer with id=1 and/or uname=ltaoperdbs02 from the membership cache
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Client crmd.228700.154a9e50 wants to fence (reboot) 'ltaoperdbs02' with device '(any)'
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Requesting peer fencing (reboot) targeting ltaoperdbs02
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Couldn't find anyone to fence (reboot) ltaoperdbs02 with any device
Jul 13 00:40:37 ltaoperdbs03 stonith-ng[228696]:  notice: Waiting 10s for ltaoperdbs02 to self-fence (reboot) for client crmd.228700.f5d882d5
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Self-fencing (reboot) by ltaoperdbs02 for crmd.228700.f5d882d5-a804-4e20-bad4-7f16393d7748 assumed complete
Jul 13 00:40:47 ltaoperdbs03 stonith-ng[228696]:  notice: Operation 'reboot' targeting ltaoperdbs02 on ltaoperdbs03 for crmd.228700 at ltaoperdbs03.f5d882d5: OK
[root at ltaoperdbs03 cluster]# grep  -e sbd /var/log/messages
[root at ltaoperdbs03 cluster]#
-------------- next part --------------
ltaoperdbs02 cluster]# stonith_admin --verbose --history "*"
Could not connect to fencer: Transport endpoint is not connected
[root at ltaoperdbs02 cluster]# ltaoperdbs03 was able to reboot node ltaoperdbs02 on behalf of crmd.228700 from ltaoperdbs03 at Tue Jul 13 00:40:47 2021
-bash: ltaoperdbs03: command not found
[root at ltaoperdbs02 cluster]# grep "Jul 13 00:40:" /var/log/messages
[root at ltaoperdbs02 cluster]# grep "Jul 13 00:39:" /var/log/messages
Jul 13 00:39:01 ltaoperdbs02 postgres[211289]: [13-1] 2021-07-13 00:39:01.520 UTC [211289] LOG:  duration: 5927.941 ms  execute <unnamed>: SELECT partition_id, id_medium, online, capacity, used_space, library_id FROM ism_v_available_media WHERE library_synchronizing = 'f' ORDER BY partition_id, online DESC, used_space DESC, medium ASC ;
Jul 13 00:39:01 ltaoperdbs02 systemd: Created slice User Slice of ltauser.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85420 of user ltauser.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85421 of user nmon.
Jul 13 00:39:01 ltaoperdbs02 systemd: Started Session 85419 of user nmon.
Jul 13 00:39:01 ltaoperdbs02 postgresql-13: M&C|Monitoring|MON|LOG|service=postgresql-13|action=status|retcode=3|message="Id=postgresql-13 SubState=dead"
Jul 13 00:39:01 ltaoperdbs02 systemd: Removed slice User Slice of ltauser.
Jul 13 00:39:22 ltaoperdbs02 postgres[172262]: [18-1] 2021-07-13 00:39:22.372 UTC [172262] LOG:  duration: 664.280 ms  execute <unnamed>:  SELECT  xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
Jul 13 00:39:24 ltaoperdbs02 postgres[219378]: [13-1] 2021-07-13 00:39:24.681 UTC [219378] LOG:  duration: 871.726 ms  statement:  UPDATE  t_srv_inventory  SET  validity = 't'  WHERE  (( (t_srv_inventory.id = 76806878) )) ;
Jul 13 00:39:27 ltaoperdbs02 postgres[219466]: [13-1] 2021-07-13 00:39:27.172 UTC [219466] LOG:  duration: 649.754 ms  statement:  INSERT INTO  t_srv_inventory  ("name", contenttype, contentlength, origindate, checksum, validitystart, validitystop, footprint, validity, filetype_id, satellite_id, mission) VALUES ('S2A_OPER_MSI_L0__GR_SGS__20160810T190057_S20160810T134256_D01_N02.04.tar', 'application/octet-stream', 18388992, '2020-12-20 09:28:19.625', '[{"Algorithm":"MD5","ChecksumDate":"2021-07-13T00:40:07.859Z","Value":"e59dd78dd277a3c18da471056c85b2bc"},{"Algorithm":"XXH","ChecksumDate":"2021-07-13T00:40:07.869Z","Value":"6fcb8cb8f8d0d353"}]', '2016-08-10 13:42:56.000', '2016-08-10 13:42:56.000', 'POLYGON((-28.801828467124398 69.405031066987505,-29.2051411416635 69.059478097980403,-28.570372603499099 68.977450843291805,-28.160811855338 69.320498378428198,-28.801828467124398 69.405031066987505))', 'f', 373, 38, -1)  ;
Jul 13 00:39:27 ltaoperdbs02 postgres[172262]: [19-1] 2021-07-13 00:39:27.270 UTC [172262] LOG:  duration: 516.499 ms  execute <unnamed>:  SELECT  xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936 UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT  xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011 ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT 7265 ;
[root at ltaoperdbs02 cluster]# grep stonith-ng /var/log/messages
[root at ltaoperdbs02 cluster]# grep  -e sbd /var/log/messages
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]: warning: inquisitor_child: Servant pcmk is outdated (age: 4)
Jul 12 14:58:59 ltaoperdbs02 sbd[6107]:  notice: inquisitor_child: Servant pcmk is healthy (age: 0)