Network fault on 11.12.2023

24.12.2023

Last week, our network was disrupted on Monday between 14:24 and 14:31. We take a look at the technical background to this disruption, which is the first and so far only total outage of our network since the launch of our own AS (58212).

Context

We took our first steps towards operating our own network in 2018, when we were still largely dependent on our data center at the time. Naturally, these first steps were not yet comparable with our current - yet very stable - network operation. We learned a lot from our mistakes in the first few years and then put our own AS (autonomous system) into operation at the beginning of 2020. We invested a lot of time and money to eradicate all the teething troubles of our first setup. Of course, carriers (of which we have several) can fail, Internet exchange points such as DE-CIX (to which we have been directly connected for years) can be disrupted, ... but such problems resolve themselves by design within seconds or minutes and only ever affect individual paths. Until December 11, 2023, when the dataforest network was completely offline for almost eight minutes in the afternoon, we had not had a single total failure of the current network setup, which has thus been running for over three and a half years. It almost became four.

Monday

Although a monitoring message concerning our network is usually accompanied by an increased alert for the person on call, it is usually nothing out of the ordinary. In this case, however, it was clear two minutes after the start of the fault that we had a major problem, so all available employees were alerted in order to be able to deal with the many calls to our hotline. This was successful thanks to well-functioning escalation processes. As we have many technically interested readers here, we want to share the key results of our analysis transparently. For those who are not interested, you can find a brief conclusion below.

If our monitoring reports "all gone" and this can then be confirmed, the first thing we do is to connect to our management network, which we operate "out of band", i.e. completely independently of our own network. This setup is extremely important to ensure that we are never locked out ourselves, whether due to a misconfiguration or a network failure. In an absolute emergency, access to the serial interfaces of our routing and switching infrastructure is also possible via this network.

As the management network was accessible, we were able to rule out a power failure - which probably made every administrator breathe a sigh of relief, despite the 100% SLA for power and air conditioning at the maincubes site in our case. A login to the active router then initially showed that a failover from the primary routing engine to the standby routing engine had taken place - this can happen once but nothing fails "by design" - and it never has. Our network structure strives for an availability of almost 100%, which essentially means: There is no central component whose failure on the hardware or software side is likely to cause a serious disruption. So what happened?

Emergency analysis

All system-relevant processes were running, all line cards and power supply units were in place and operating without malfunction, no errors - which our monitoring also recorded, but which would currently still have been sent via SNMP and therefore somewhat delayed - were visible. Apart from the alarm about the fact that the standby routing engine was currently active - and even the primary routing engine, which was initially thought to be dead, was still running and had not undergone a "spontaneous reboot" or similar. A bad premonition was confirmed:

root@re1-mx480.mc-fra01.as58212.net> show bgp summary 
error: the routing subsystem is not running

root@re1-mx480.mc-fra01.as58212.net> show route table inet.0
error: the routing subsystem is not running

This is really not a message you want to see. But it made sense in the context of the previous process check, because we already knew that the "Routing Protocol Process" ("rpd" for short) had just been restarted. A validating look at the logs confirmed that it had just crashed and was still recovering. We could foresee that everything would be online again shortly, and so it was. At the time of this realization, about five minutes had passed since the start of the malfunction for the initial analysis and two minutes later the first recovery messages came from Opsgenie. In the end, we didn't have to intervene at all for the actual fault clearance - not a nice incident, but the self-healing has already restored a bit of confidence.

Cause(s) of the outage

As described above, a failover to the standby routing engine had taken place, and it was reasonable to assume that this process was also the cause. It quickly became clear that this was not the case - the network actually continued to run for almost exactly five minutes after the failover. Why the failover - and why the five minutes? These were the key questions that needed to be answered.

Once does not count

Our Junos configuration (Junos is the operating system of the network equipment manufacturer Juniper Networks) provides for a switchover to the standby routing engine in the event of a crash of the "rpd", which plays the main role in this incident, even if the rest of the active routing engine is working properly. This is usually (exceptions confirm the rule) best practice and will remain so for us. The configuration ensures that everything continues to run smoothly in the event of such a crash. Although the line cards do not initially need "rpd" to forward data packets, routing protocols such as BGP would lose their status and this in turn would lead to at least a partial failure within a very short time. The setting therefore makes sense and did exactly what it was supposed to do here, namely change the routing engine in the event of a crash without failures. Because the routing engines are permanently synchronized, a change is possible at any time without failure (and has already been carried out by us several times for maintenance purposes without any problems or failures). The first question was thus answered.

But two times is twice too many

The failover as a result of the rpd crash took place at 14:21. At 14:23, the process also crashed on the standby routing engine. And then disaster struck: The "rpd" on the primary routing engine had simply not yet been fully restarted, so that the routing engine was simply "not ready for switchover" in Junos language. There were still 1-2 minutes of tolerance time left until all our BGP upstreams had "forgotten" about our network that was then thrown out of the routing table, as BGP was no longer running on our side. The fact that the linecards continued to forward the traffic "disoriented" for a while did not help us either, as our AS gradually disappeared from the Internet.

Root cause and solution

A crash of the rpd can happen sometimes. Very rarely. The fact that it did several times in a row and led to a network outage however is unusual and has never happened to us. After intensive analysis of the core dumps and log files available to us, we were able to locate a Junos bug that clearly corresponds to the behavior observed here and has been fixed with a newer release, which we also installed later.

Bad timing

Back on 03.12.2023, we had a six-minute disruption to a virtualization rack, which affected around 50% of our SolusVM-based vServer host systems - but fortunately no other products such as dedicated servers or colocation. This network problem was due to a misconfiguration. On 10.12.2023 around 3am, one of our carriers had a total outage, which affected 80% of our IPv4 prefixes for a good two minutes and was repeated on 20.12.2023 around 11am. None of the incidents had anything to do with our network outage, which is the subject of this article, but of course we understand that this cluster has raised general questions about network stability in December for some customers and we are very keen to be able to deliver a long period of stability again - at least in terms of what is within our control.

Conclusion and improvement potential

The fact that our setup had a problem after almost four years once does not shake our basic trust in it, because hardware and software are not error-free, such bugs occur (rarely) and can usually be solved quickly. The Juniper MX series that we use has been regarded as "rock solid" on the market for almost two decades for the operation of high-availability network infrastructure. And so we are convinced that we have really solved the problem with the update we have installed. Nevertheless, we can learn a few lessons from it:

Release of our status page: Admittedly, you and we haven't just been wishing for this disruption, but it should be reason enough to finally create a sensible solution for communicating outages and maintenance work with status.dataforest.net and retire the mix of social media and emails. From the beginning of January 2024, this will now be the central hub. Of course, the status page can also be subscribed to accordingly (initially via email, Slack, Atom/RSS).
Emergency maintenance performed suboptimally: We had announced packet loss in the seconds range for 17.12.2023, 3:00 am. On the one hand, the announcement came at very short notice, which was due to the fact that the bug suddenly reappeared several times the previous evening after a week without any problems (without failures in each case, but we were of course alarmed). In future, we would announce a corresponding software update as soon as possible after the outage and then bring it forward again if necessary. On the other hand, there was an annoying downtime of a full five minutes, which was due to the fact that the router's line cards completely rebooted. This was simply a misinterpretation on our part, for which we can now only apologize. In future, we will consider whether we should migrate our customers and carriers to a replacement router for such an update to avoid a real downtime - or simply calculate with a longer downtime and not only explicitly announce it, but also schedule it for an even less critical time. However, if you want to avoid downtime as far as possible, such a migration is by no means a matter of minutes and is therefore not always a realistic option.
Better separation of edge and core routing: This was also already in progress regardless of the disruption - in future, we will physically separate our edge routing, i.e. the termination of BGP sessions to our upstreams, from core routing, where the actual customer VLANs are terminated. This will reduce convergence time in the event of faults and maintenance work of this kind.

With this in mind: Thank you for reading and Merry Christmas!

Tim Lauderbach

Likes networks, DDoS protection, virtualization, and more recently Ceph.