fbpx

Data Center Fabric Monitoring & Visibility: Tools and Options

Share:

So far in our blog series on data center network trends, we have focused on options and best practices for building and automating DC fabrics. That leaves one critical missing piece: monitoring and visibility, sometimes also referred to as “observability.”

As every network operations (NetOps) team knows, even the most elegant network architecture built with the best hardware and software will experience faults and performance hiccups that have to be seen to be fixed. When applications slow down or stop running, it’s often the network team that feels the heat first. According to EMA Research over 1/3 of network problems are reported by users before NetOps is aware of them.

Keeping users happy requires detecting faults before they cause problems, diagnosing them quickly and minimizing the mean-time-to-repair. And when users mistakenly blame the network for problems, NetOps teams need to establish quickly that the problem is not actually in the network. Network engineers sometimes jokingly refer to this goal as minimizing the “mean-time-to-innocence.”

All joking aside, a good monitoring and visibility framework should not just be focused on network availability and performance to ensure the innocence of the NetOps team. It should also support applications teams and the security operations (SecOps) team in identifying and eliminating application performance problems and security threats.

Polling versus Streaming

Network management systems historically relied on simple network management protocol (SNMP) polling to gather data on network device state and traffic parameters such as port and link utilization. In larger networks with lots of devices, this approach can create a lot of challenges. Polling devices too frequently can create processor overload in the devices and bandwidth bottlenecks, both of which can impact production traffic. Polling less frequently leads to polling gaps and delays in recognizing faults or performance problems, as can SNMP packet loss.

For these reasons, many organizations are moving toward more reliance on push-based network telemetry models, in which devices send important data without waiting to be polled. They employ highly efficient protocols and structured data models that align better with machine-to-machine big data collection and analytics approaches. When the data updates are scheduled on regular intervals, this is often called “streaming telemetry.”

Even the IETF, the home of SNMP, has embraced the benefits of moving to network telemetry, but much of the industry momentum is led by hyperscale cloud operators and affiliated groups such as OpenConfig.

Figure 1. Streaming Telemetry Overview
(source: Google/OpenConfig at NANOG71)

A few of these hyperscalers have been pushing hard to move to exclusive use of streaming telemetry, and in 2018 Google even declared provocatively that “SNMP is dead.” However, the reality for most organizations is that SNMP will continue to live on for a variety of reasons, especially to monitor device health and utilization, but telemetry data will play an increasingly important role in their visibility frameworks, especially for traffic flow data.

Traffic Flow Visibility: Embedded Monitoring vs. Out-of-band Networks

The broad adoption of application virtualization and multi-tiered applications means applications are no longer tied to specific devices or locations in the network. Understanding where applications live and how they are communicating becomes harder still as private clouds become more distributed across multiple data center and edge sites with high-availability, active-active architectures, and as applications migrate between sites or run in multiple sites simultaneously. As a result, effective visibility to support application performance and security assurance requires seeing much more than just the state and utilization of network devices and links. It requires visibility into traffic flows.

A traffic flow is simply a set of packet transmissions between an entry point and an exit point in a network, typically between two IP addresses. Figure 2 shows a flow between two physical computing devices, but the flow endpoints may also be virtual machines, containerized applications, storage devices or other devices. Flows can be further categorized by TCP source and destination ports and protocol.

Figure 2. A traffic flow through a fabric between a client and server

Embedded Flow Monitoring

Network devices have historically offered TCP/IP flow-monitoring capabilities, such as Cisco’s proprietary NetFlow (now standardized as IPFIX), and sFlow. Due to device CPU limitations, typically only a small percentage of flows could be monitored, and in some cases only a small percentage of packets from each flow could be sampled, especially on higher-speed ports. More comprehensive embedded flow monitoring required very high-end routers with custom ASICs and high-powered CPUs on every line card, which few network operators could cost-justify. As a result, many NetOps teams found that embedded flow monitoring offered limited visibility, making it extremely difficult to understand what applications are using the network and troubleshoot application issues when they arise.

These trade-offs in embedded flow monitoring are starting to disappear based on the increasing power of merchant silicon switching ASICs from Broadcom and others and the availability of higher-powered CPUs on even the most cost-effective disaggregated white box switching and routing platforms. The Pluribus monitoring, telemetry, and analytics solution builds on these hardware advances to provide visibility into 100% of TCP connections across the entire network fabric at full line rate (i.e. with no performance degradation).

Figure 3. Pluribus Monitoring and Analytics Solution

The Netvisor ONE network operating system (NOS), running on disaggregated white box switches, incorporates comprehensive built-in data collection and streams data via Netvisor flow (nvFlow) telemetry to the UNUM management and analytics system. UNUM can be scaled from smaller virtual machine deployments to high-scale, high-availability hardware platforms depending on the network size and customer requirements.

Pluribus UNUM Insight Analytics software provides data visualization and analytics to monitor every flow from every device or virtualized workload to support use cases from network troubleshooting to application performance assurance and security threat isolation.

Out-of-Band Monitoring Networks

As application and security visibility demands have increased, some NetOps and SecOps teams have deployed sophisticated packet processing and analysis tools. These tools ingest “out-of-band” traffic flows replicated from the production network, derived from either a passive traffic Test Access Point (TAP) or from a production network switch that replicates traffic to a special port for monitoring, a technique offering referred to as port mirroring or switch port analyzer (SPAN).

As the number of tools increased for different use cases, and the number of production network monitoring locations increased to reduce visibility gaps, new problems arose: how to ensure any traffic flow from any part of the production network can reach any monitoring device or tool, and how to maximize the utilization of these tools, some of which can be extremely expensive.

Enter the Network Packet Broker (NPB) a specialized type of network devoted to routing and processing out-of-band traffic flows from TAP/SPAN/Mirror sources to monitoring tools (Figure 3). At its simplest, and NPB may be just a single switching device but as networks and tool farms grow, the NPB needs to scale to incorporate more ports and more NPB devices, much like a production network.

Figure 3

Figure 4. Network Packet Broker

Many NPBs have been built using special-purpose hardware that can provide high levels of sophisticated packet processing and deep packet inspection. Unfortunately, these high-end NPBs can be costly and complex to deploy and scale as needs change and as a result they tend to be used only in selected network locations, or not at all. While many enterprises would like the increased visibility that an NPB can provide, according to EMA Research, only 46% of enterprises are actually using them today.

Some of these cost and scalability challenges are being addressed by rethinking the NPB architecture and creating a “software-defined packet broker” (SDPB). SDPBs, such as the Pluribus NPB solution, are built on a foundation of disaggregated network switching, using open networking hardware based on commodity switching silicon and disaggregated network operating system (NOS) software. SDPB’s can lower cost and increase scalability, enabling wider deployment for increased visibility and observability.

Some customers may adopt a hybrid approach, using SDPBs to ensure broad coverage and visibility by aggregating traffic flows from every site of a highly distributed data center network, while employing a limited number of higher cost NPB nodes for specialized packet pre-processing and deep inspection at a centralized “tool farm.”

Summary of Flow Monitoring Options

As the above discussion illustrates, there is no one-size-fits-all approach for flow monitoring and traffic visibility in a data center fabric, but the available options are improving. The limitations of traditional embedded flow monitoring and the high costs of traditional NPBs are being overcome with newer approaches. Table 1 summarizes some of the key options and trade-offs to consider.

For some network operators, the best answer may be advanced embedded flow monitoring and analytics that can provide complete traffic flow visibility without requiring the extra cost of an NPB. For those who do need to complement embedded monitoring with NPB functionality, an SDPB may offer the best option for controlling cost and maximizing visibility.

Table 1

Table 1. Options for Data Center Fabric Traffic Flow Monitoring

Summary

Achieving pervasive visibility across distributed, multi-site private cloud data center fabrics is becoming increasingly important for NetOps and SecOps teams. Fortunately, with the right network monitoring technologies, this goal is also increasingly achievable and affordable. Advanced embedded flow monitoring and telemetry can be built into the data center fabric to achieve 100% flow visibility across every data center and edge site and every application, and enable sophisticated analytics for use cases from network and application performance monitoring to security threat detection and isolation. If needed, out-of-band monitoring can be added to complement embedded monitoring using software-defined packet brokers for scalable and cost-effective traffic aggregation from every site to centralized filtering and analysis tools.

In our next two blogs, we will bring together all of the trends discussed in this series in two illustrative use cases. First, an enterprise use case for active-active data centers.

Share:
Share:

Subscribe to our updates and be the first to hear about the latest blog posts, product announcements, thought leadership and other news and information from Pluribus Networks.

Subscribe to Updates

Share:

About the Author

Jay Gill

Jay Gill

Jay Gill is Senior Director of Marketing at Pluribus Networks, responsible for product marketing and open networking thought leadership. Prior to Pluribus, he guided product marketing for optical networking at Infinera, and held a variety of positions at Cisco focused on growing the company’s service provider business. Earlier in his career, Jay worked in engineering and product development at several service providers including both incumbents and startups. Jay holds a BSEE and MSEE from Stanford and an MBA from UCLA Anderson.