RING SQA
The purpose of RING SQA is to detect outages as fast as possible that only affect a subset of all internet destinations.
RING SQA pings all other nodes (IPv4 + IPv6) every 30 seconds to derive a baseline, this baseline is compared to the last 3 minutes of measurements. If the median of the baseline is tripped for three consecutives minutes, an alarm is raised.
When an alarm is raised, three MTRs are immediately launched towards destinations that previously were reachable, but suddenly not anymore. The purpose of these traces is to provide an investigation starting point for your NOC.
All in all super fast outage detection. All participants are invited to use this system! Gratis! :-)
One can simply configure where alerts should be emailed by changing the /etc/ring-sqa/alarm.conf
file on your own RING node(s) to something like this (do keep in mind the indenting!):
job@ringnode01.ring.nlnog.net:~$ sudo cat /etc/ring-sqa/alarm.conf
---
email:
to: noc@yourcompany.com
from: sqa-alert@ your_ring_node .ring.nlnog.net
prefix: 'RING ALERT '
irc:
host: 1.2.3.4
port: 5502
password: derp
channel: ! '#noc'
Afterwards restart the ring-sqa daemons to load the new config:
job@ringnode01:~$ sudo systemctl restart ring-sqa4
job@ringnode01:~$ sudo systemctl restart ring-sqa6
Et voila! After 30 minutes the machine will stand guard over your network. RING participants with multiple hubs or datacenters will benefit from spinning up more nodes, as monitoring is from each RING nodes individual perspective.
We extend a HUGE thank you to Saku Ytti who wrote RING SQA. Please send him beer, chocolate, and flowers.
Below we’ve included an example outage alert.
------------------ Example RING SQA Message ---------------------------
From: sqa@xing02.ring.nlnog.net
To: noc@
Subject: RING ALERT raising ipv4 alarm - 16 new nodes down
Body:
Regarding: xing02.ring.nlnog.net ipv4
This is an automated alert from the distributed partial outage
monitoring system "RING SQA".
At 2014-07-27 10:18:05 UTC the following measurements were analysed
as indicating that there is a high probability your NLNOG RING node
cannot reach the entire internet. Possible causes could be an outage
in your upstream's or peer's network.
The following nodes previously were reachable, but became unreachable
over the course of the last 3 minutes:
- itps01.ring.nlnog.net 128.65.97.93 AS42010 GB
- fullsave01.ring.nlnog.net 141.0.202.201 AS39405 FR
- globalaxs01.ring.nlnog.net 176.10.80.10 AS 9009 GB
- kwaoo01.ring.nlnog.net 178.250.209.33 AS24904 CH
- suretec01.ring.nlnog.net 185.8.92.17 AS199659 GB
- swisscom01.ring.nlnog.net 193.247.170.254 AS 3303 CH
- claranet01.ring.nlnog.net 195.157.9.4 AS 8426 GB
- claranet04.ring.nlnog.net 195.22.19.34 AS 8426 PT
- dcsone01.ring.nlnog.net 203.123.48.14 AS37989 SG
- trueinternet01.ring.nlnog.net 203.144.167.57 AS 7470 TH
- jump01.ring.nlnog.net 212.13.217.117 AS 8943 GB
- lchost01.ring.nlnog.net 213.230.217.125 AS25098 GB
- suomi01.ring.nlnog.net 217.119.42.194 AS16302 FI
- melbourne01.ring.nlnog.net 37.128.187.253 AS39451 GB
- netability01.ring.nlnog.net 46.182.9.20 AS 1197 IE
- viatel02.ring.nlnog.net 46.183.108.2 AS31122 FR
- claranet06.ring.nlnog.net 92.54.7.29 AS 8426 ES
As a debug starting point 3 traceroutes were launched right after
detecting the event, they might assist in pinpointing what broke:
trueinternet01.ring.nlnog.net AS 7470 (TH)
mtr -i0.5 -c5 -r -w -n 203.144.167.57
1.|-- 109.233.156.241 0.0% 6 0.5 0.5 0.5 0.6 0.0
2.|-- 109.233.156.1 0.0% 5 0.8 0.9 0.8 1.1 0.1
3.|-- 109.233.156.2 0.0% 5 0.8 0.8 0.8 0.9 0.0
4.|-- 64.209.88.33 0.0% 5 0.9 1.0 0.9 1.5 0.3
5.|-- 159.63.23.198 60.0% 5 265.1 264.9 264.7 265.1 0.3
6.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
7.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
8.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
9.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
10.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
11.|-- 203.144.144.30 80.0% 5 297.4 297.4 297.4 297.4 0.0
12.|-- ??? 100.0 4 0.0 0.0 0.0 0.0 0.0
fullsave01.ring.nlnog.net AS39405 (FR)
mtr -i0.5 -c5 -r -w -n 141.0.202.201
1.|-- 109.233.156.241 0.0% 6 0.5 0.5 0.5 0.5 0.0
2.|-- 109.233.156.1 0.0% 5 0.8 3.2 0.8 12.2 5.0
3.|-- 109.233.156.2 0.0% 5 0.8 0.9 0.8 1.0 0.1
4.|-- 109.233.156.37 0.0% 5 1.0 1.0 0.9 1.5 0.3
5.|-- 149.11.106.1 0.0% 5 1.1 1.4 1.1 1.7 0.2
6.|-- 130.117.3.137 0.0% 5 1.5 1.7 1.5 1.8 0.2
7.|-- 154.54.62.77 0.0% 5 11.4 11.7 11.3 13.0 0.7
8.|-- 154.54.75.154 0.0% 5 201.0 166.9 66.9 323.0 101.5
9.|-- 154.54.56.214 0.0% 5 23.0 23.0 22.8 23.0 0.1
10.|-- 149.11.58.62 80.0% 5 26.4 26.4 26.4 26.4 0.0
11.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
12.|-- 141.0.202.201 80.0% 5 25.0 25.0 25.0 25.0 0.0
globalaxs01.ring.nlnog.net AS 9009 (GB)
mtr -i0.5 -c5 -r -w -n 176.10.80.10
1.|-- 109.233.156.241 0.0% 6 0.4 0.5 0.4 0.5 0.0
2.|-- 109.233.156.1 0.0% 5 0.9 1.8 0.7 5.3 1.9
3.|-- 81.201.115.41 0.0% 5 0.9 0.9 0.8 1.0 0.1
4.|-- 62.209.32.18 40.0% 5 1.3 1.2 1.2 1.3 0.1
5.|-- 80.81.192.165 0.0% 5 1.3 9.3 1.2 41.5 18.0
6.|-- 193.27.64.245 60.0% 5 191.9 108.1 24.3 191.9 118.5
7.|-- 193.27.64.66 80.0% 5 43.6 43.6 43.6 43.6 0.0
8.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
9.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
10.|-- 176.10.80.2 80.0% 5 26.1 26.1 26.1 26.1 0.0
11.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0
12.|-- ??? 100.0 4 0.0 0.0 0.0 0.0 0.0
13.|-- ??? 100.0 3 0.0 0.0 0.0 0.0 0.0
14.|-- ??? 100.0 2 0.0 0.0 0.0 0.0 0.0
15.|-- 176.10.80.10 0.0% 1 24.3 24.3 24.3 24.3 0.0
An alarm is raised under the following conditions: every 30 seconds
your node pings all other nodes. The amount of nodes that cannot be
reached is stored in a circular buffer, with each element representing
a minute of measurements. In the event that the last three minutes are
1.2 above the median of the previous 27 measurement slots, a partial
outage is assumed. The ring buffer's output is as following:
29 min ago 41 measurements failed (baseline)
28 min ago 41 measurements failed (baseline)
27 min ago 41 measurements failed (baseline)
26 min ago 42 measurements failed (baseline)
25 min ago 41 measurements failed (baseline)
24 min ago 41 measurements failed (baseline)
23 min ago 41 measurements failed (baseline)
22 min ago 41 measurements failed (baseline)
21 min ago 41 measurements failed (baseline)
20 min ago 41 measurements failed (baseline)
19 min ago 41 measurements failed (baseline)
18 min ago 41 measurements failed (baseline)
17 min ago 41 measurements failed (baseline)
16 min ago 41 measurements failed (baseline)
15 min ago 41 measurements failed (baseline)
14 min ago 41 measurements failed (baseline)
13 min ago 41 measurements failed (baseline)
12 min ago 41 measurements failed (baseline)
11 min ago 41 measurements failed (baseline)
10 min ago 41 measurements failed (baseline)
9 min ago 41 measurements failed (baseline)
8 min ago 41 measurements failed (baseline)
7 min ago 41 measurements failed (baseline)
6 min ago 41 measurements failed (baseline)
5 min ago 41 measurements failed (baseline)
4 min ago 41 measurements failed (baseline)
3 min ago 45 measurements failed (baseline)
2 min ago 66 measurements failed (raised alarm)
1 min ago 65 measurements failed (raised alarm)
0 min ago 65 measurements failed (raised alarm)
Exporting metrics to InfluxDB
If you’re running InfluxDB it is possible to export metrics from RING SQA. Both InfluxDB v1.8 and v2 are supported. We assume you have a running InfluxDB server available with credentials to publish metrics in a bucket or database.
You need to add the following section to your configuration (/etc/ring-sqa/main.conf
) to export metrics to InfluxDB:
influxdb:
url: "http://server:8086" # your InfluxDB server
bucket: "ring-sqa" # for InfluxDB v1.8 this is the database
token: "" # for InfluxDB v1.8 use format: <username>:<password>
org : "nlnog" # for InfluxDB v1.8 this can be left empty
This provides the ring-sqa_measurements
metric which has the following fields:
latency
to indicate the round trip time in microseconds to a destinationstate
to indicate the availability of a destination (0
for unreachable,1
for available)
All measurements are tagged with the following labels:
afi = ipv4/ipv6
src_node = Source node
dst_node = Destination node
dst_cc = Destination node country
dst_lat = Destination node latitude
dst_lon = Destination node longitude
Exporting metrics to Graphite
Metrics can be exported to Graphite as well. The following statement needs to be added to your configuration to export metrics:
graphite: hostname:port
The following metrics will be exported:
nlnog.ring_sqa.<address family>.<host>.<countrycode>.<node>.state
nlnog.ring_sqa.<address family>.<host>.<countrycode>.<node>.latency