I’ve always been fascinated by random processes and by the weather in particular. I’ve spent entire holiday breaks exploring the distribution of prime numbers while my family enjoyed the beach. If I could go back in time (or hit the lottery which I don’t play so I guess that’s out) I would go back to school to study Computational Meteorology. Sadly, this major simply didn’t exist in the early 90’s. The advances in weather forecasting over the last few decades is simply astounding and we certainly all benefit from it today. I even recently purchased a small weather station for home, knowing that I easily could just stick my head outside or check the National Weather Service website for my area.
What does any of this have to do with Networking?
Weather is a result of a complex set of interactions at small scales whose aggregate effects we experience at the macro level. Physics dictates how the individual particles and systems interact and we can predict with some accuracy how the weather system will play out over the next few days. There is no central coordinating body determining when it will rain or how hard the wind will blow. Computer networks have far fewer variables and as a system the components utilize protocols that dictate how those variables behave. Each node should behave in a deterministic fashion and the end-to-end traffic flow works without any centralized coordination. This interaction of protocols works extremely well and is the reason you can view this page. However, the resiliency of these protocols and networks can mask underlying issues and seemingly small changes can lead to emergent conditions – the oft-quoted butterfly effect. It is in that dynamicism that I see the connection between the larger study of systems in general and the parallels of computer networks to the weather.
Some network engineers tend to build the most complicated beast possible, optimized to perfection with default settings be damned – present company included. While newer tech and fancy GUIs can abstract some of the manual efforts of network configuration, the underlying issues I’ll describe remain. Networking legend Russ White has some thoughts on this at the following posts: https://rule11.tech/its-time-for-a-short-lecture-on-complexity/ and https://rule11.tech/the-resilience-problem/
The network diagram below outlines just some of the protocols in use for a typical MPLS L3VPN connecting a branch office (left) to a Data Center (right). The client is communicating with the server via TLS/TCP/IP and each of the routing protocols employ timers that dictate how long it will take for a change in a particular state table to be propagated to the rest of the network. Consider the number of variables in play for the client to successfully connect to the server and achieve bidirectional flow:
Further imagine that the server depicted is just the presentation level of a N-tier application with an unknown number of compute nodes, databases and other dependencies and you start to get a sense for just how many systems can be involved for even the simplest traffic flows for an application.
Now, take this example bidirectional flow and multiply the number of clients by 1000 and the servers by 10 and you get an idea of a small company network. Now, image the Service Provider in the middle is supporting 100 more companies just like this on its infrastructure.
The routing protocols in use will dictate the flow of traffic at a macro level – in most cases these algorithms do not take into account the conditions on the network beyond the reach-ability of peers and the link speeds. In a steady-state the network will converge on some local or global maxima and the client can connect to the server. Technologies like MPLS TE (Traffic Engineering) and newer Software-Defined (SDWAN) can take the underlying conditions into account to dynamically route traffic over the objectively best path.
The Network is Slow
The problem typically appears with no apparent cause and disappears just as mysteriously. One example of this type of behavior dubbed Route Flapping, is an example of how the network can converge, quickly de-converge and start the process over again. The flapping can occur over several minutes or with some seemingly random periodicity over several hours or even days. (Route Damping was developed to deal or at least localize the effects of this condition, but it introduces yet another variable into the mix.) If you’re lucky the flapping will be due to an easily spotted configuration error or condition. If you’re not, it could be the result of only a perfect storm of conditions. (storm, get it?) For example, I once worked on a network that was extremely stable as long as there was no traffic on it. It turned out that the topologically best path for a particular flow worked just fine – until there was enough traffic to overrun some obscure receive buffer, resulting in some routing protocol packets to be dropped sporadically; causing the neighbors to drop, thus withdrawing the routes and the whole process would start over again.
Here are some of the routing protocols and interactions in our sample network above:
- LAN to Wan Router – OSPF
- WAN Router (CE) to PE – OSPF
- LDP between PE routers
- BFD monitoring links
- MPLS-TE within the SP network
- PE to PE – MP-BGP
- PE to WAN (CE) router – OSPF
- Spanning Tree in the campus remote office network. (Before you flame me with cries of “routed access rulez!” remember that routed access is just a smaller version of the above.)
Here are some of the variables that can affect performance and/or the logical network topology itself:
- Link utilization
- Buffer utilization
- hardware forwarding failures
- WAN optimization technologies
- TCP sliding window and global synchronization
- Application conditions that result in Thundering Herd
- Disk IOPS
If I didn’t have a day job, I would attempt to build a model of this network in Mathematica as a system of equations and use it to explore just how sensitive it is to changes and the impact on the flow between the client and server.
Lastly, here’s great video from Computerphile that describes TCP-meltdown – a condition that can arise when using a TCP-based application over a VPN leveraging TCP for transport.