Part Four of a Five-Part Series on Software-Defined Data Centers in a Multi-Cloud World
In my last post, SDN for Physical and Virtual Networks in Space- and Cost-Constrained Environments, I wrote about a controllerless implementation for SDN automation of the underlay and a virtualized network overlay fabric that leverages the distributed processing power of open networking switches. The result of this approach is a very efficient and highly integrated network automation solution for smaller data center environments where traditional SDN approaches are simply too expensive, consume too much space and power and struggle to span geographically distributed multi-site or edge data center locations.
This novel approach is very powerful and necessary but not sufficient – there is another layer of functionality required to support comprehensive data center automation. In order to monitor the network and quickly identify and troubleshoot performance issues, granular telemetry on every flow that traverses the fabric is essential. In fact, major vendors like Cisco, with their Tetration offering, have heavily validated the need for application analytics for today’s modern applications. But these traditional approaches are not optimal for smaller environments as they require a set of external test access points (TAPs), probes and packet brokers that effectively overlay the network fabric, not to mention a number of servers to execute the analytics. This results, again, in high cost and space and power consumption, as well as additional complexity.
Traditional Application Analytics
Traditional switches and routers switch billions of packets per second between servers and clients at sub-microsecond latencies using custom ASICs but have limited capability to record enough telemetry detail to provide a truly useful picture of network performance over time. It is a very similar story for OpenFlow-based switches, which use merchant silicon but have insufficient telemetry. As such, external TAPs and monitoring networks have to be built to get a sense of what is actually going on in the infrastructure. The figure below shows what monitoring today looks like.
This is where challenges arise. A typical data center network that connects servers runs a combination of 10, 25, 40 and 100 GbE today. These switches typically have many servers connected to them that are pumping traffic at high speed.
Some possible approaches to instrumenting the network today are as follows:
- Provision a copper or fiber optic TAP at every link and divert a copy of every packet to a packet broker fabric, which in turn routes traffic to the monitoring tools. With the fiber optics TAP and passives, every packet is mirrored, and the monitoring tools need to deal with a few Tb/s or 1B+ packets per second from each switch. However, the reality is that this approach is impossibly expensive, and thus no one deploys it.
- Selectively place copper or fiber optic TAPs at uplinks or edge ports. Mirror these edge packets to a packet broker fabric, which in turn routes traffic to the monitoring tools. While this is less costly, it means the inner network becomes a black hole with no visibility. Many of us have learned the hard way over time that without 100% visibility, you can’t fix a problem very efficiently. In addition, even this selective deployment of hardware makes the cost go up dramatically, as more switches are deployed and require monitoring – the monitoring fabric needs more capacity and the monitoring software gets more complex and needs more hardware resources.
- Using the networking switches themselves to selectively sample traffic (e.g., sFlow with standard hardware or NetFlow with proprietary hardware) and send this traffic and flow information to monitoring tools. This approach is built upon the premise of sampling, where the sampling rates are typically 1 in 5,000 to 10,000 packets – any more than this runs into scale challenges. This approach is better than nothing, but does not really have enough raw detail to attain a full picture of the network.
Another Approach to Telemetry and Analytics
As described in the previous blog, it makes sense in constrained environments to leverage the distributed processing power of white box switches when possible. Similar to what can be done with SDN and network virtualization, one can write clever software that leverages the CPU and memory of the switch as well as the packet processing ASIC to monitor every TCP connection across the fabric. This includes traffic within the VXLAN tunnels across the entire fabric at the speed of the network to track east/west and north/south traffic flows, as well as virtualized workloads to expose important network and application performance characteristics.
Specifically, rich telemetry from the SDN and virtual network fabric can be gathered, where each switch in the fabric collects the metadata for every flow and sends it to an analytics application via REST API. In particular, more recent open networking switches feature dual 10G NICs that run between the CPU and the network processing ASIC, providing plenty of throughput to transport the data to the CPU. The bulk of the processing would happen on the local OS instance, and only metadata would be peeled off and sent to the analytics application, which allows this solution to scale to billions of flows.
With this approach one can effectively capture every TCP flow across the fabric at wire speed, including TCP connection states (SYN, SYNACK, EST, FIN, etc.) by service, client, domains and many other options over time, and store the metadata in a repository for deep analysis. Also, multiple options to tag IP addresses, VLANs, MAC addresses and switch ports with metadata/contextual tags can be offered, and then one can aggregate or filter flows based on the custom tags. In addition to flows, of course, it is important to have port telemetry and device diagnostics via a selection of searchable options such as fabric node, switch port, vport (virtual port) and state, including a dashboard of all ports in the fabric. This is extremely valuable as it provides real time or historical data analytics to identify performance concerns, root-cause network outages or to quickly understand security threats like DDOS attacks.
Pluribus Netvisor ONE has implemented this novel software approach described above, leveraging the CPU, memory and packet processing ASIC to provide comprehensive flow and switch telemetry. Performance metrics are stored within the fabric and delivered as lightweight metadata that can be viewed using CLI from the fabric or can be delivered via APIs or IPFIX to other monitoring systems, security information and event management (SIEM) platforms or the Pluribus Insight Analytics solution, which is an optional software module in our UNUM product offering. This solution can store up to 2.5 billion flows over a time window and has an analytics engine and a rich set of reports that allow network operations teams to drill down to a single flow and identify performance issues or bad actors. Its powerful search engine UI and simple query syntax can help isolate and filter specific flows among millions in a fraction of a second. This can help quickly identify and rectify performance issues for regular reporting to senior management.
As applications become more distributed with both east-west and north-south traffic, and services are deployed within private clouds, the ability to monitor each and every connection is of paramount importance for both performance and security reasons. Given the amount of data, traditional sampled data network analytics sources do not scale. More traditional packet monitoring solutions designed to overcome this limitation unfortunately require significant hardware overlay infrastructures that are expensive, complex and consume space and power – not ideal for smaller data center environments. The best approach is to leverage the distributed processing power of the network switches themselves with some clever software to provide the data sources and analytics tools the ability to observe every packet and flow at a fraction of the cost of traditional hardware-based solutions.
In my next and final blog in this series, The Importance of Network Segmentation for Security and Multi-Tenancy, I will try and wrap up by talking about some of the unique benefits that can be achieved and services that can be delivered with open networking and a controllerless approach to SDN control of the underlay, network virtualization and granular analytics.
You can find more information on Insight Analytics here.
Webinar replay: If you would like more detail on how Pluribus helps put SDDC and private cloud within reach for every IT team, then watch the replay of our webinar “Realizing the SDDC: Simple, Affordable SDN and Network Virtualization for Any Size Data Center.” In this webinar I am joined by Drew Schulke, VP Product Management, Dell EMC and Alessandro Barbieri, VP Product Management, Pluribus Networks. You can see the replay here.
Subscribe to our updates and be the first to hear about the latest blog posts, product announcements, thought leadership and other news and information from Pluribus Networks.
About the Author
Mike is Chief Marketing Officer of Pluribus Networks. Mike has over 20 years of marketing, product management and business development experience in the networking industry. Prior to joining Pluribus, Mike was VP of Global Marketing at Infinera, where he built a world class marketing team and helped drive revenue from $400M to over $800M. Prior to Infinera, Mike led product marketing across Cisco’s $6B service provider routing, switching and optical portfolio and launched iconic products such as the CRS and ASR routers. He has also held senior positions at Juniper Networks, Pacific Broadband and Motorola.