Traditionally, datacenter network infrastructure for large companies or large compute farms was built based on a three-layer hierarchical model, which Cisco calls the “hierarchical inter-networking model”. It consists of core layer switches ($$$) which connect to distribution layer switches ($$) (sometimes called aggregation switches), which in turn connect to access layer switches ($). Access layer switches are frequently located at the top of a rack, so, these are also known as top-of-rack (ToR) switches. Most network infrastructure is still laid out this way today.
The Hierarchical Model in a Datacenter Network Switching Architecture
The good news with this hierarchical model is that traffic between two nodes in the same rack, if at Layer 2 of the network stack, is sent with low latency. If the access switches are 10Gb, then the communication can have high throughput as well. Also, this type of configuration allows for a vast number of ports at the access layer.
The bad news is, well, everything else about this model. It’s expensive. East-west communication, say between racks of gear, means that traffic travels to the aggregation layer and frequently to the data center core. These multiple hops, frequently across over-subscribed backplanes, take a very long time – 50uSec or more via traditional vendors and their traditional solutions. Any Layer 3 traffic needs to leave the rack and reach the aggregation tier of switches before being routed, even back to the same rack it came from.
And east-west traffic isn’t the exception in modern application deployments, it’s the norm. This traffic is from applications talking to each other, talking to databases, and talking to IP-attached storage.
The problems with this typical hierarchical networking multiply when virtual machines run on the servers. When the servers in the racks are running virtual machine managers and virtual machines, limits abound. East-west traffic is even more prevalent, because virtualization essentially randomizes the locations of the (virtual) servers. With traditional architecture, the datacenter manager could load a rack with components that were likely to communicate with each other (say application servers and database servers). With virtualization those components could be anywhere within the virtualized infrastructure. Virtualization also pushes the limits of IP addressing. For example, the maximum number of VLANs is 4096 (a limit based on the IEEE 802.1Q standard), which can drive artificial limits within a virtualized facility. While a facility might naturally need thousands of VLANs for multi-tenancy, because of the VLAN limit the facility may need to be divided into multiple small virtualization clusters. This limits resource management options, for example preventing a VM from being able to be moved to the least loaded server if that server is in some other cluster.
VM migrations between VLANs can happen based on network infrastructure and protocols, for example using generic routing encapsulation (GRE) to tunnel Layer 2 packets through Layer 3 infrastructure. VMware has its own solution that works with specific components (vDS) providing MAC-in-MAC encapsulation, removing VLANs in favor of Port Group Isolation. These solutions are problematic because of proprietary vendor lock-in, extra overhead, or extra complexity.
Other bad news is delivered if the datacenter managers want to make any changes to their existing architecture. Once this hierarchical infrastructure is put in place, change is difficult. Another rack of gear not only means another ToR switch, but possibly another aggregation switch or even more ports in the core switch. If an application running in a rack needs more throughput, how is it delivered? Trunking multiple ethernet connections into a single host helps, but what if the throughput is needed to applications running in other racks? With Spanning Tree Protocol (STP), there are serious limits to how many connections can be added between the switches, leading to bottlenecks above and beyond the existing high latency.
And let’s hope that no errors, odd behaviors, connection drops, or performance issues occur in this traditional architecture, because visibility into traffic is limited and debugging is a challenge. Quality of service, traffic prioritization, packet or traffic capture (for regulations or debugging are all challenges (or down-right impossible). In fact many network admins need to worry not only about mean-time-to-recovery (MTTR) from a problem, but also MTTI – Mean-time-to-innocence.
When there is a problem in the infrastructure frequently networking is the first area blamed, because it’s difficult to prove that the problem is not in the network.
So far we are ignoring desirable networking features such as fire-walling and load balancing. Fire-walling is typically provided outside of this traditional network infrastructure, or via ad hoc methods. VMware again can firewall within is hypervisors, but what happens if those virtual servers need to talk to a physical database server or NAS storage? In almost all cases that traffic is not fire-walled, due to the fire-walling cost and performance impact. Even datacenters that are willing to make those compromises are challenged. If general-purpose firewall features are desired, then all pertinent traffic needs to get to a firewalling device (perhaps a line card in a core router), taking multiple hops and adding latency just to get to the point of being filtered. Specific-purpose firewalls could be added, for example between the database server and all other servers, but the need to firewall other traffic means adding more and more physical firewalls (with resulting cost, complexity, and frailty). Managing such infrastructure, especially if different firewalling methods are used at different tiers, again increases complexity.
Load-balancing may seem like a point solution to specific needs, such as sending traffic to multiple web servers. It’s getting more important however as load-balancing is being required in areas a diverse as email services (Exchange 2010 and beyond) and NAS storage (NetApp Cluster-mode NAS appliances). Load-balancing will like be more pervasive over time, resulting in challenges like those of firewalling to network infrastructure managers.
A more modern design “flattens” this hierarchical network to increase performance for east-west traffic. Those networks remove the aggregation layer, requiring more ports in the core layer (with variations depending on the networking vendor). While that does provide step-wise improvement over the previous designs, it still suffers in the areas of flexibility, performance, manageability, functionality and cost.
So far we’ve painted a bleak picture for network designers and administrators. In future posts we’ll explore solutions to the data center networking problems.
On a side note, We’re also happy to announce that we now have a twitter feed and will be actively tweeting @pluribusnet