RING SQA

The purpose of RING SQA is to detect outages as fast as possible that only affect a subset of all internet destinations.

RING SQA pings all other nodes (IPv4 + IPv6) every 30 seconds to derive a baseline, this baseline is compared to the last 3 minutes of measurements. If the median of the baseline is tripped for three consecutives minutes, an alarm is raised.

When an alarm is raised, three MTRs are immediately launched towards destinations that previously were reachable, but suddenly not anymore. The purpose of these traces is to provide an investigation starting point for your NOC.

All in all super fast outage detection. All participants are invited to use this system! Gratis! :-)

One can simply configure where alerts should be emailed by changing the /etc/ring-sqa/alarm.conf file on your own RING node(s) to something like this (do keep in mind the indenting!):

job@ringnode01.ring.nlnog.net:~$ sudo cat /etc/ring-sqa/alarm.conf
---
email:
  to: noc@yourcompany.com
  from: sqa-alert@ your_ring_node .ring.nlnog.net
  prefix: 'RING ALERT '
irc:
  host: 1.2.3.4
  port: 5502
  password: derp
  channel: ! '#noc'

Afterwards restart the ring-sqa daemons to load the new config:

job@ringnode01:~$ sudo systemctl restart ring-sqa4
job@ringnode01:~$ sudo systemctl restart ring-sqa6

Et voila! After 30 minutes the machine will stand guard over your network. RING participants with multiple hubs or datacenters will benefit from spinning up more nodes, as monitoring is from each RING nodes individual perspective.

We extend a HUGE thank you to Saku Ytti who wrote RING SQA. Please send him beer, chocolate, and flowers.

Below we’ve included an example outage alert.

------------------ Example RING SQA Message ---------------------------

From: sqa@xing02.ring.nlnog.net
To: noc@
Subject: RING ALERT raising ipv4 alarm - 16 new nodes down
Body:

Regarding: xing02.ring.nlnog.net ipv4

This is an automated alert from the distributed partial outage 
monitoring system "RING SQA".

At 2014-07-27 10:18:05 UTC the following measurements were analysed 
as indicating that there is a high probability your NLNOG RING node 
cannot reach the entire internet. Possible causes could be an outage 
in your upstream's or peer's network.

The following nodes previously were reachable, but became unreachable 
over the course of the last 3 minutes:

- itps01.ring.nlnog.net            128.65.97.93 AS42010 GB
- fullsave01.ring.nlnog.net       141.0.202.201 AS39405 FR
- globalaxs01.ring.nlnog.net       176.10.80.10 AS 9009 GB
- kwaoo01.ring.nlnog.net         178.250.209.33 AS24904 CH
- suretec01.ring.nlnog.net          185.8.92.17 AS199659 GB
- swisscom01.ring.nlnog.net      193.247.170.254 AS 3303 CH
- claranet01.ring.nlnog.net         195.157.9.4 AS 8426 GB
- claranet04.ring.nlnog.net        195.22.19.34 AS 8426 PT
- dcsone01.ring.nlnog.net         203.123.48.14 AS37989 SG
- trueinternet01.ring.nlnog.net  203.144.167.57 AS 7470 TH
- jump01.ring.nlnog.net          212.13.217.117 AS 8943 GB
- lchost01.ring.nlnog.net        213.230.217.125 AS25098 GB
- suomi01.ring.nlnog.net         217.119.42.194 AS16302 FI
- melbourne01.ring.nlnog.net     37.128.187.253 AS39451 GB
- netability01.ring.nlnog.net       46.182.9.20 AS 1197 IE
- viatel02.ring.nlnog.net          46.183.108.2 AS31122 FR
- claranet06.ring.nlnog.net          92.54.7.29 AS 8426 ES


As a debug starting point 3 traceroutes were launched right after 
detecting the event, they might assist in pinpointing what broke:

trueinternet01.ring.nlnog.net  AS 7470 (TH)
mtr -i0.5 -c5 -r -w -n 203.144.167.57
  1.|-- 109.233.156.241        0.0%     6    0.5   0.5   0.5   0.6   0.0
  2.|-- 109.233.156.1          0.0%     5    0.8   0.9   0.8   1.1   0.1
  3.|-- 109.233.156.2          0.0%     5    0.8   0.8   0.8   0.9   0.0
  4.|-- 64.209.88.33           0.0%     5    0.9   1.0   0.9   1.5   0.3
  5.|-- 159.63.23.198         60.0%     5  265.1 264.9 264.7 265.1   0.3
  6.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 11.|-- 203.144.144.30        80.0%     5  297.4 297.4 297.4 297.4   0.0
 12.|-- ???                   100.0     4    0.0   0.0   0.0   0.0   0.0

fullsave01.ring.nlnog.net      AS39405 (FR)
mtr -i0.5 -c5 -r -w -n 141.0.202.201
  1.|-- 109.233.156.241        0.0%     6    0.5   0.5   0.5   0.5   0.0
  2.|-- 109.233.156.1          0.0%     5    0.8   3.2   0.8  12.2   5.0
  3.|-- 109.233.156.2          0.0%     5    0.8   0.9   0.8   1.0   0.1
  4.|-- 109.233.156.37         0.0%     5    1.0   1.0   0.9   1.5   0.3
  5.|-- 149.11.106.1           0.0%     5    1.1   1.4   1.1   1.7   0.2
  6.|-- 130.117.3.137          0.0%     5    1.5   1.7   1.5   1.8   0.2
  7.|-- 154.54.62.77           0.0%     5   11.4  11.7  11.3  13.0   0.7
  8.|-- 154.54.75.154          0.0%     5  201.0 166.9  66.9 323.0 101.5
  9.|-- 154.54.56.214          0.0%     5   23.0  23.0  22.8  23.0   0.1
 10.|-- 149.11.58.62          80.0%     5   26.4  26.4  26.4  26.4   0.0
 11.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 12.|-- 141.0.202.201         80.0%     5   25.0  25.0  25.0  25.0   0.0

globalaxs01.ring.nlnog.net     AS 9009 (GB)
mtr -i0.5 -c5 -r -w -n 176.10.80.10
  1.|-- 109.233.156.241        0.0%     6    0.4   0.5   0.4   0.5   0.0
  2.|-- 109.233.156.1          0.0%     5    0.9   1.8   0.7   5.3   1.9
  3.|-- 81.201.115.41          0.0%     5    0.9   0.9   0.8   1.0   0.1
  4.|-- 62.209.32.18          40.0%     5    1.3   1.2   1.2   1.3   0.1
  5.|-- 80.81.192.165          0.0%     5    1.3   9.3   1.2  41.5  18.0
  6.|-- 193.27.64.245         60.0%     5  191.9 108.1  24.3 191.9 118.5
  7.|-- 193.27.64.66          80.0%     5   43.6  43.6  43.6  43.6   0.0
  8.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 10.|-- 176.10.80.2           80.0%     5   26.1  26.1  26.1  26.1   0.0
 11.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 12.|-- ???                   100.0     4    0.0   0.0   0.0   0.0   0.0
 13.|-- ???                   100.0     3    0.0   0.0   0.0   0.0   0.0
 14.|-- ???                   100.0     2    0.0   0.0   0.0   0.0   0.0
 15.|-- 176.10.80.10           0.0%     1   24.3  24.3  24.3  24.3   0.0



An alarm is raised under the following conditions: every 30 seconds 
your node pings all other nodes. The amount of nodes that cannot be 
reached is stored in a circular buffer, with each element representing 
a minute of measurements. In the event that the last three minutes are 
1.2 above the median of the previous 27 measurement slots, a partial 
outage is assumed. The ring buffer's output is as following:

29 min ago  41 measurements failed (baseline)
28 min ago  41 measurements failed (baseline)
27 min ago  41 measurements failed (baseline)
26 min ago  42 measurements failed (baseline)
25 min ago  41 measurements failed (baseline)
24 min ago  41 measurements failed (baseline)
23 min ago  41 measurements failed (baseline)
22 min ago  41 measurements failed (baseline)
21 min ago  41 measurements failed (baseline)
20 min ago  41 measurements failed (baseline)
19 min ago  41 measurements failed (baseline)
18 min ago  41 measurements failed (baseline)
17 min ago  41 measurements failed (baseline)
16 min ago  41 measurements failed (baseline)
15 min ago  41 measurements failed (baseline)
14 min ago  41 measurements failed (baseline)
13 min ago  41 measurements failed (baseline)
12 min ago  41 measurements failed (baseline)
11 min ago  41 measurements failed (baseline)
10 min ago  41 measurements failed (baseline)
 9 min ago  41 measurements failed (baseline)
 8 min ago  41 measurements failed (baseline)
 7 min ago  41 measurements failed (baseline)
 6 min ago  41 measurements failed (baseline)
 5 min ago  41 measurements failed (baseline)
 4 min ago  41 measurements failed (baseline)
 3 min ago  45 measurements failed (baseline)
 2 min ago  66 measurements failed (raised alarm)
 1 min ago  65 measurements failed (raised alarm)
 0 min ago  65 measurements failed (raised alarm)

Exporting metrics to InfluxDB

If you’re running InfluxDB it is possible to export metrics from RING SQA. Both InfluxDB v1.8 and v2 are supported. We assume you have a running InfluxDB server available with credentials to publish metrics in a bucket or database.

You need to add the following section to your configuration (/etc/ring-sqa/main.conf) to export metrics to InfluxDB:

influxdb:
   url: "http://server:8086"    # your InfluxDB server
   bucket: "ring-sqa"           # for InfluxDB v1.8 this is the database
   token: ""                    # for InfluxDB v1.8 use format: <username>:<password>
   org : "nlnog"                # for InfluxDB v1.8 this can be left empty

This provides the ring-sqa_measurements metric which has the following fields:

latency to indicate the round trip time in microseconds to a destination
state to indicate the availability of a destination (0 for unreachable, 1 for available)

All measurements are tagged with the following labels:

 afi       = ipv4/ipv6
 src_node  = Source node
 dst_node  = Destination node
 dst_cc    = Destination node country
 dst_lat   = Destination node latitude
 dst_lon   = Destination node longitude

Exporting metrics to Graphite

Metrics can be exported to Graphite as well. The following statement needs to be added to your configuration to export metrics:

graphite: hostname:port

The following metrics will be exported:

nlnog.ring_sqa.<address family>.<host>.<countrycode>.<node>.state
nlnog.ring_sqa.<address family>.<host>.<countrycode>.<node>.latency