RING SQA

The purpose of RING SQA is to detect outages as fast as possible that only affect a subset of all internet destinations.

RING SQA pings all other nodes (v4 + v6) every 30 seconds to derive a baseline, this baseline is compared to the last 3 minutes of measurements. If the median of the baseline is tripped for three consecutives minutes, an alarm is raised.

When an alarm is raised, three MTRs are immediately launched towards destinations that previously were reachable, but suddenly not anymore. The purpose of these traces is to provide an investigation starting point for your NOC.

All in all super fast outage detection. All participants are invited to use this system! Gratis! :-)

One can simply configure where alerts should be emailed by changing the /etc/ring-sqa/alarm.conf file on your own RING node(s) to something like
this (do keep in mind the indenting!):


    job@ringnode01.ring.nlnog.net:~$ sudo cat /etc/ring-sqa/alarm.conf
    ---
    email:
      to: noc@yourcompany.com
      from: sqa-alert@ your_ring_node .ring.nlnog.net
      prefix: 'RING ALERT '
    irc:
      host: 1.2.3.4
      port: 5502
      password: derp
      channel: ! '#noc'

Afterwards restart the ring-sqa daemons to load the new config:


    job@ringnode01:~$ sudo restart ring-sqa4
    ring-sqa4 start/running, process 18715
    job@ringnode01:~$ sudo restart ring-sqa6
    ring-sqa6 start/running, process 18727

Et voila! After 30 minutes the machine will stand guard over your network. RING participants with multiple hubs or datacenters will benefit from spinning up more nodes, as monitoring is from each RING nodes individual perspective.

We extend a HUGE thank you to Saku Ytti who wrote RING SQA. Please send him beer, chocolate and flowers.

Below we’ve included an example outage alert.


------------------ Example RING SQA Message ---------------------------

From: sqa@xing02.ring.nlnog.net
To: noc@
Subject: RING ALERT raising ipv4 alarm - 16 new nodes down
Body:

Regarding: xing02.ring.nlnog.net ipv4

This is an automated alert from the distributed partial outage 
monitoring system "RING SQA".

At 2014-07-27 10:18:05 UTC the following measurements were analysed 
as indicating that there is a high probability your NLNOG RING node 
cannot reach the entire internet. Possible causes could be an outage 
in your upstream's or peer's network.

The following nodes previously were reachable, but became unreachable 
over the course of the last 3 minutes:

- itps01.ring.nlnog.net            128.65.97.93 AS42010 GB
- fullsave01.ring.nlnog.net       141.0.202.201 AS39405 FR
- globalaxs01.ring.nlnog.net       176.10.80.10 AS 9009 GB
- kwaoo01.ring.nlnog.net         178.250.209.33 AS24904 CH
- suretec01.ring.nlnog.net          185.8.92.17 AS199659 GB
- swisscom01.ring.nlnog.net      193.247.170.254 AS 3303 CH
- claranet01.ring.nlnog.net         195.157.9.4 AS 8426 GB
- claranet04.ring.nlnog.net        195.22.19.34 AS 8426 PT
- dcsone01.ring.nlnog.net         203.123.48.14 AS37989 SG
- trueinternet01.ring.nlnog.net  203.144.167.57 AS 7470 TH
- jump01.ring.nlnog.net          212.13.217.117 AS 8943 GB
- lchost01.ring.nlnog.net        213.230.217.125 AS25098 GB
- suomi01.ring.nlnog.net         217.119.42.194 AS16302 FI
- melbourne01.ring.nlnog.net     37.128.187.253 AS39451 GB
- netability01.ring.nlnog.net       46.182.9.20 AS 1197 IE
- viatel02.ring.nlnog.net          46.183.108.2 AS31122 FR
- claranet06.ring.nlnog.net          92.54.7.29 AS 8426 ES


As a debug starting point 3 traceroutes were launched right after 
detecting the event, they might assist in pinpointing what broke:

trueinternet01.ring.nlnog.net  AS 7470 (TH)
mtr -i0.5 -c5 -r -w -n 203.144.167.57
  1.|-- 109.233.156.241        0.0%     6    0.5   0.5   0.5   0.6   0.0
  2.|-- 109.233.156.1          0.0%     5    0.8   0.9   0.8   1.1   0.1
  3.|-- 109.233.156.2          0.0%     5    0.8   0.8   0.8   0.9   0.0
  4.|-- 64.209.88.33           0.0%     5    0.9   1.0   0.9   1.5   0.3
  5.|-- 159.63.23.198         60.0%     5  265.1 264.9 264.7 265.1   0.3
  6.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  8.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 10.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 11.|-- 203.144.144.30        80.0%     5  297.4 297.4 297.4 297.4   0.0
 12.|-- ???                   100.0     4    0.0   0.0   0.0   0.0   0.0

fullsave01.ring.nlnog.net      AS39405 (FR)
mtr -i0.5 -c5 -r -w -n 141.0.202.201
  1.|-- 109.233.156.241        0.0%     6    0.5   0.5   0.5   0.5   0.0
  2.|-- 109.233.156.1          0.0%     5    0.8   3.2   0.8  12.2   5.0
  3.|-- 109.233.156.2          0.0%     5    0.8   0.9   0.8   1.0   0.1
  4.|-- 109.233.156.37         0.0%     5    1.0   1.0   0.9   1.5   0.3
  5.|-- 149.11.106.1           0.0%     5    1.1   1.4   1.1   1.7   0.2
  6.|-- 130.117.3.137          0.0%     5    1.5   1.7   1.5   1.8   0.2
  7.|-- 154.54.62.77           0.0%     5   11.4  11.7  11.3  13.0   0.7
  8.|-- 154.54.75.154          0.0%     5  201.0 166.9  66.9 323.0 101.5
  9.|-- 154.54.56.214          0.0%     5   23.0  23.0  22.8  23.0   0.1
 10.|-- 149.11.58.62          80.0%     5   26.4  26.4  26.4  26.4   0.0
 11.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 12.|-- 141.0.202.201         80.0%     5   25.0  25.0  25.0  25.0   0.0

globalaxs01.ring.nlnog.net     AS 9009 (GB)
mtr -i0.5 -c5 -r -w -n 176.10.80.10
  1.|-- 109.233.156.241        0.0%     6    0.4   0.5   0.4   0.5   0.0
  2.|-- 109.233.156.1          0.0%     5    0.9   1.8   0.7   5.3   1.9
  3.|-- 81.201.115.41          0.0%     5    0.9   0.9   0.8   1.0   0.1
  4.|-- 62.209.32.18          40.0%     5    1.3   1.2   1.2   1.3   0.1
  5.|-- 80.81.192.165          0.0%     5    1.3   9.3   1.2  41.5  18.0
  6.|-- 193.27.64.245         60.0%     5  191.9 108.1  24.3 191.9 118.5
  7.|-- 193.27.64.66          80.0%     5   43.6  43.6  43.6  43.6   0.0
  8.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
  9.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 10.|-- 176.10.80.2           80.0%     5   26.1  26.1  26.1  26.1   0.0
 11.|-- ???                   100.0     5    0.0   0.0   0.0   0.0   0.0
 12.|-- ???                   100.0     4    0.0   0.0   0.0   0.0   0.0
 13.|-- ???                   100.0     3    0.0   0.0   0.0   0.0   0.0
 14.|-- ???                   100.0     2    0.0   0.0   0.0   0.0   0.0
 15.|-- 176.10.80.10           0.0%     1   24.3  24.3  24.3  24.3   0.0



An alarm is raised under the following conditions: every 30 seconds 
your node pings all other nodes. The amount of nodes that cannot be 
reached is stored in a circular buffer, with each element representing 
a minute of measurements. In the event that the last three minutes are 
1.2 above the median of the previous 27 measurement slots, a partial 
outage is assumed. The ring buffer's output is as following:

29 min ago  41 measurements failed (baseline)
28 min ago  41 measurements failed (baseline)
27 min ago  41 measurements failed (baseline)
26 min ago  42 measurements failed (baseline)
25 min ago  41 measurements failed (baseline)
24 min ago  41 measurements failed (baseline)
23 min ago  41 measurements failed (baseline)
22 min ago  41 measurements failed (baseline)
21 min ago  41 measurements failed (baseline)
20 min ago  41 measurements failed (baseline)
19 min ago  41 measurements failed (baseline)
18 min ago  41 measurements failed (baseline)
17 min ago  41 measurements failed (baseline)
16 min ago  41 measurements failed (baseline)
15 min ago  41 measurements failed (baseline)
14 min ago  41 measurements failed (baseline)
13 min ago  41 measurements failed (baseline)
12 min ago  41 measurements failed (baseline)
11 min ago  41 measurements failed (baseline)
10 min ago  41 measurements failed (baseline)
 9 min ago  41 measurements failed (baseline)
 8 min ago  41 measurements failed (baseline)
 7 min ago  41 measurements failed (baseline)
 6 min ago  41 measurements failed (baseline)
 5 min ago  41 measurements failed (baseline)
 4 min ago  41 measurements failed (baseline)
 3 min ago  45 measurements failed (baseline)
 2 min ago  66 measurements failed (raised alarm)
 1 min ago  65 measurements failed (raised alarm)
 0 min ago  65 measurements failed (raised alarm)

 

Comments are closed.