In the last blog we talked about mainstream automation approaches leveraging Linux scripting tools and automation framework platforms, third party multi-vendor external automation systems and vendor-owned external automation systems. We concluded that while scripting plus Ansible is a popular approach to network automation it is still a heavy lift by the NetOps team and scripts need to be continuously updated and managed as vendors of the underlying network solutions release new versions of software offering new capabilities. We also reviewed external automation solutions, whether third party or vendor-based. Here we concluded that these external solutions often struggle to keep pace with changes of the underlying vendor network solutions. This in turn forces the NetOps team back to box-by-box CLI or writing scripts for the features that are not automated yet by the external systems.
SDN as an Approach to Network Automation
There are many different definitions out there around Software Defined Networking. At Pluribus we view SDN from the value perspective not the architectural perspective. SDN should allow a NetOps team to create a virtualized network that offers single-touch fabric-wide automation where network services can be deployed at the speed of cloud. With this as a first principle, one can start from a clean sheet of paper and consider alternative approaches to network automation. Most come to the same conclusion, which is to design the network operating system and associated networking elements from the beginning with automation in mind. This makes it possible to deliver a much more innovative solution that can provide a quantum leap in network automation capabilities.
This is the approach that Pluribus Networks takes as well as Cisco Application Centric Infrastructure (ACI). This of course begs the question – why did Cisco invest hundreds of millions of dollars in research and development to build a new SDN-automated data center network solution when they already had a successful portfolio of Nexus data center switches supporting the BGP EVPN fabric approach? Because the Cisco team is smart and realized that if networking automation were designed into the networking solution from the beginning, that they could change the way network teams could operate to deliver services faster, with better consistency and with lower operational effort. This in turn would give Cisco a highly automated solution that customers would desire and allow them to better compete with companies that do not offer SDN automation such as Arista, Juniper, Aruba and Nvidia/Cumulus.
From the Cisco Blog Cisco ACI, What is It?
“As software defined networking becomes more popular and even necessary, Cisco ACI changes the way we’ve traditionally thought about networking. Traditional networking uses an imperative model which basically means we control what the network devices do. We give them commands and expect them to follow them as “written.” ACI uses a declarative control system where we specify what we want the end result to be and the network devices interpret it and do what they need to return that result.”
However, the challenge with ACI is the architecture and associated cost and scale. Cisco ACI requires special operating system releases, special Nexus ACI Mode switch hardware with custom ASICs and firmware as well as Application Policy Infrastructure Controllers (APIC). The recommend deployment for ACI for two or more sites is based on the Cisco ACI 3.0 which is shown on the righthand side of the diagram below. This Multi-Site Architecture requires three APICs to be deployed at every site and a controller of controllers called the Multi-site Orchestrator (MSO) to stitch these sites together. Clearly this is a lot of external components to automate the network, components that have to be licensed on Day 0 and that impose additional costs for future features via the Cisco SIA license. These external components also unnecessarily consume space and power and also incur integration and deployment complexity.
Furthermore, the fundamental ACI architecture requires the leaves (top of rack switches) to register with the spine using a proprietary COOP protocol and all forwarding decisions are made by the spines. This means that the architecture is closed requiring that both leaf and spine come from Cisco, limits the topology to leaf/spine only and typically requires a total hardware refresh of both leaf and spine layers to deploy ACI. Furthermore, this architecture also puts significant strain on the spines as they need to take complex action on all aggregated packets and this limits scalability of the overall fabric solution.
Pluribus has taken a directionally similar approach by delivering underlay and overlay fabric automation using an SDN control plane and a declarative model based on networking “objects”.
However, the Pluribus architecture is much more elegant and as a result more scalable and lower cost:
- The Pluribus Netvisor ONE operating system runs on cost-effective commodity disaggregated switching hardware instead of proprietary hardware eliminating costly custom ASICs and vendor lock-in.
- The Pluribus Adaptive Cloud Fabric SDN control plane is integrated into the Netvisor ONE OS. The ACF SDN control plane leverages the multi-core CPUs, DRAM and solid state drives that are built into every data center switch as a distributed compute platform, eliminating the expense and complexity of 3 or more external controllers at every DC site and the controller-of-controllers required for multi-site data centers.
- The SDN intelligence and distributed state database is contained in the leaf switches and thus the spines are only used for simple layer 3 transport. This results in a solution that is much more scalable and can work with any existing third-party spines. It also allows the network architecture team to design fabrics in topologies other than leaf/spine if required such as rings, double star and may others.
- This approach also dramatically simplifies stretching across sites because VXLAN tunnels are set up leaf-to-leaf over any layer 3 transport and there is no complex issue around site-to-site controller synchronization because the fabric intelligence is built into the leaf switches directly and the fabric intelligence ensures real time synchronization.
- Eliminating external controllers also makes the solution optimal for distributed edge data center sites which can be power and space constrained.
Effectively this results in the SDN automation that is not only superior to Cisco ACI but which also eliminates the complexity and cost of multiple external controllers and the scalability issues of spine-based SDN architecture.
BGP EVPN or SDN for DC Fabrics – Which makes the most sense?
This question is one that often is decided based on business objectives or sometimes philosophic principles. From a business objectives standpoint it is quite clear that SDN and a protocol-free approach is dramatically easier to deploy and manage and more scalable than BGP EVPN. Hover some NetOps teams prefer BGP EVPN because it is an IETF standard and implemented by multiple vendors who have not had the resources or decided not to invest in developing a full SDN fabric solution. Pluribus recently released Netvisor ONE R6.1 which includes a highly automated BGP EVPN implementation as a solution for interoperating with third party BGP EVPN fabrics as well as for connecting multiple Pluribus fabrics together to create larger fabrics. However inside each Pluribus fabric we use a highly automated and scalable SDN control plane.
With Pluribus the NetOps team can deploy a network service declaratively, fabric-wide with one or two commands. For example, using the network object ‘VRF’ the NetOps team can use CLI or the equivalent REST API or the graphical UNUM Fabric Manger to deploy a VRF L3 segment fabric-wide with two commands:
vlan-create id 101 scope fabric ports none description BLUE-L2VNI-101 auto-vxlan
subnet-create scope fabric vxlan 500101 vrf BLUE network 10.1.101.0/24 anycast-gw-ip 10.1.101.1
The Pluribus SDN control plane then takes care of deploying the necessary network objects to every switch in the fabric. The implication? You can see from the diagram below in a 32 switch fabric it takes 832 lines of configuration using a protocol-based BGP EVPN approach to deploy a similar service versus 2 lines of configuration for the Adaptive Cloud Fabric protocol-free SDN control plane. And as it turns out this scales non-linearly, because if the DC operator is provisioning a 256 switch fabric it takes over 5000 lines of config to deploy a single service and with Pluribus it still only requires 2 commands.
Furthermore, if any single switch in the fabric cannot accept the config for the network service that config is rolled back from all switches in the fabric until the issue is rectified on the switch in question or that switch is evicted from the fabric. This ensures consistent deployment of a service or a security policy across the fabric.
Back to an earlier point, the advantage to having the SDN automation built into the OS versus external automation solutions – there are typically no synchronization issue between the system doing the automation and the system being automated – they are developed, tested and released together. This results in new features and enhancement being fully automated, pre-integrated and fully tested so everything works right out of the box.
Ultimately this means that NetOps team only needs to understand the declarative language of Pluribus or how to use our UNUM GUI and they do not need to go to great lengths to deeply learn one or more scripting languages. That said, while the Pluribus solution is highly automated it is also highly programmable, with a REST API that is also fabric-wide and 100% equivalent to the CLI. So NetOps teams that want to program the Pluribus Adaptive Cloud Fabric infrastructure with traditional Linux tools can easily do so with tools such as Python, Ansible or the Pluribus UNUM Fabric Manager.
In the next blog my colleague Jay Gill will talk about the network visibility and analytics. It is obviously critical to be able to measure performance and have the tools for rapid troubleshooting when anomalies are discovered. Stay tuned.
Council of Oracle Protocol (COOP) is used to communicate the mapping information (location and identity) to the spine proxy. A leaf switch forwards endpoint address information to the spine switch ‘Oracle’ using Zero Message Queue (ZMQ). Cisco APIC Security Configuration Guide.
Subscribe to our updates and be the first to hear about the latest blog posts, product announcements, thought leadership and other news and information from Pluribus Networks.
About the Author
Mike is Chief Marketing Officer of Pluribus Networks. Mike has over 20 years of marketing, product management and business development experience in the networking industry. Prior to joining Pluribus, Mike was VP of Global Marketing at Infinera, where he built a world class marketing team and helped drive revenue from $400M to over $800M. Prior to Infinera, Mike led product marketing across Cisco’s $6B service provider routing, switching and optical portfolio and launched iconic products such as the CRS and ASR routers. He has also held senior positions at Juniper Networks, Pacific Broadband and Motorola.