Accessibility could be restored immediately by local backup transit, but such a hard break-off of BGP sessions is always accompanied by some convergence time, which is why it took 1-5 minutes until the situation had stabilized for all servers located in the maincubes datacenter. The IPv6 connection was affected for a few minutes longer.
Maintenance
In order to permanently solve the underlying problem, we will be carrying out maintenance work on the connection between our datacenters on 20.06.2024 between 1-5 a.m (German time). As planned, this will not result in any downtime.
However, there is always a residual risk with critical interventions, which is why we are hereby announcing the work to be on the safe side and carrying it out during the period with the lowest traffic volume. There may also be slightly increased latencies, as some upstreams and peerings in particular will be temporarily unavailable.
Technical background
Our Interxion PoP is redundantly connected to the maincubes datacenter via our own redundant dark fibres and terminates on the maincubes side on two redundant switches from Juniper Networks. This means that there is theoretically no single point of failure - especially because we connect the switches to each other using MC-LAG technology, which means that each device can be controlled independently. We have deliberately avoided solutions in which several switches form a stack that can be controlled like a single switch ("Virtual Chassis") in such a critical area of our network, as firmware bugs can easily tear down a Virtual Chassis in such a way that it is no longer controllable until the devices are physically restarted (power off, power on). Our devices remained controllable, which is why we were able to restart them as normal and restore full redundancy in a very short time.
Nevertheless, practical experience has shown that our concept did not work as perfectly as we would have liked and we apologize for this.
Our demand for the connection between our datacenters is that it should never fail completely. Over the last ten days, we have therefore investigated many options and ultimately decided to completely remove the guilty aggregation switches and instead implement a direct (of course also redundant) router connection without upstream aggregation switches. As the last components required for the upcoming migration of the dark fibres have reached us today, we can now announce the maintenance.
You are welcome to follow us on Instagram​​ if you want to get a small insight into the work.