I have a Cisco ISR4431 acting internet edge router that has been randomly rebooting every 5 days or so. When it reboots it takes anywhere from 10-60 minutes before it is back up and network traffic is flowing normally. It is running BGP and routing for a /19 and /20 network so it should be a relatively small load for this class of box.
The only suspicious thing I see is 94% of the memory is consumed, so I suspect it is holding more BGP routes than it should, though this same config has been working in an older router for years without becoming unstable. I'm not really sure how to diagnose the issue further and I don't know if this is a hardware of config problem.
Unfortunately the router is on the other side of the country and I have no way of physically getting to it until the quarantine is over.
sh ver:
Cisco IOS XE Software, Version 03.16.04b.S - Extended Support Release
Cisco IOS Software, ISR Software (X86_64_LINUX_IOSD-UNIVERSALK9-M), Version 15.5(3)S4b, RELEASE SOFTWARE (fc1)
sh logging
*Apr 28 14:47:09.074: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/2, changed state to up
*Apr 28 14:47:10.074: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/0/2, changed state to up
*Apr 28 14:50:12.834: %PLATFORM-4-ELEMENT_WARNING:smand: RP/0: Committed Memory value 94% exceeds warning level 90%
*Apr 28 14:52:00.253: %IOSXE_INFRA-6-PROCPATH_CLIENT_HOG: IOS shim client 'fman stats bipc' took 685 msec (runtime: 256 msec) to process a 'tdl_qfpmib_throughput_data' message
*Apr 28 15:00:14.511: %PLATFORM-4-ELEMENT_WARNING:smand: RP/0: Committed Memory value 94% exceeds warning level 90%
sh processes cpu sorted
CPU utilization for five seconds: 13%/0%; one minute: 3%; five minutes: 3%
PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
193 230311 5004 46025 12.39% 1.63% 1.22% 0 BGP Scanner
117 22772 228335 99 0.15% 0.10% 0.10% 0 IOSXE-RP Punt Se
240 31843 1902016 16 0.07% 0.14% 0.15% 0 Inline Power
414 2694 20294 132 0.07% 0.00% 0.00% 0 NTP
284 18520 605984 30 0.07% 0.09% 0.08% 0 HTTP CORE
The BGP section of the config looks like this:
router bgp 7835
no bgp log-neighbor-changes
neighbor ZZ.ZZ.6.113 remote-as XXX
neighbor ZZ.ZZ.6.113 password XXXXXX
!
address-family ipv4
network XX.XX.160.0 mask 255.255.240.0
network YY.YY.64.0 mask 255.255.224.0
network YY.YY.79.0
neighbor ZZ.ZZ.6.113 activate
neighbor ZZ.ZZ.6.113 soft-reconfiguration inbound
neighbor ZZ.ZZ.6.113 filter-list 1 out
exit-address-family
!
Some further diagnostics:
sh platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource Usage Max Warning Critical State
----------------------------------------------------------------------------------------------------
RP0 (ok, active) C
Control Processor 32.12% 100% 90% 95% H
DRAM 3849MB(99%) 3872MB 90% 95% C
ESP0(ok, active) H
QFP H
DRAM 1663176KB(79%) 2097152KB 80% 90% H
IRAM 0KB(0%) 0KB 80% 90% H
Memory
show processes memory sorted
Processor Pool Total: 1688347248 Used: 1417980160 Free: 270367088
lsmpi_io Pool Total: 6295128 Used: 6294296 Free: 832
PID TTY Allocated Freed Holding Getbufs Retbufs Process
510 0 904032136 54730248 901424352 0 0 BGP Router
271 0 257116280 1297600 256693920 0 0 IP RIB Update
0 0 352326368 108678280 227122576 0 0 *Init*
79 0 8209072 12176 7592984 0 0 IOSD ipc task
389 0 3889024 5160 3925856 799092 0 EEM ED Syslog
409 0 1439256 26792 1442328 0 0 EEM Server
155 0 3223184 91024 1057808 0 0 CWAN OIR Handler