Figure 1 shows the CORD network infrastructure. The underlay fabric at the heart of the implementation is responsible for interconnecting all parts of CORD, including access I/O blades, compute nodes (servers), and upstream routers. In addition, the fabric control application that runs on ONOS interacts with a number of other applications to provide CORD services.
Figure 1. CORD Architecture.
In the current implementation, there are actually two sets of ONOS controllers with different responsibilities. The first ONOS cluster (onos-cord) is responsible for the overlay infrastructure (virtual networking and service composition) and the access infrastructure. This cluster hosts the VTN and vOLT applications, respectively.
The second ONOS cluster (onos-fabric) is responsible for controlling the fabric and interfacing with conventional upstream routers. This cluster hosts the Fabric Control and vRouter applications, respectively.
Multicast control is via two additional applications, IGMP snooping and PIM-SSM, where the former runs on onos-cord and the latter runs on onos-fabric. (For simplicity, we show only a single Multicast Control application in Figure 1.)
In principle, all the applications could run on a single ONOS cluster. We chose to split responsibilities to have better isolation and separation of concerns, which was especially helpful during development. However, to simplify the exposition and diagrams, we show all the applications running on a single ONOS cluster in Figure 1.
Services in CORD are composed using the best practices of Cloud operations. In CORD, services are instantiated by the network operator using XOS (Figure 5). In turn XOS presents ONOS with a service graph for subscriber traffic. This graph is then decomposed into flow-rules that are programmed in the CORD networking infrastructure by the VTN application in ONOS. Details of XOS and service graphs are discussed in other design notes. This section gives a brief overview of the implementation choices made for realizing service composition in the network infrastructure.
Figure 5. Service Composition.
In CORD, service composition is implemented using overlay-networks and network virtualization. At a high level,
Services have their own virtual networks (VNs)—the VMs or containers that instantiate the service are part of the same virtual network, and these instances can be created in any part of the CORD cloud (ie. on any compute-node in any rack).
It is possible to dynamically grow and shrink the number of VMs/containers (and hence the virtual network) that instantiate a service. This feature is essential for achieving scale in the cloud.
Each compute node hosts VMs or containers (belonging to multiple service VNs) that are connected to OVS acting as a highly programmable OpenFlow controlled hypervisor switch.
Each Virtual Network (or service) has its own Load-Balancer distributed across every OVS in the network. The load-balancers job is to select a VM instance instantiating the service, amongst all the VM’s within the service’s virtual network.
Service composition walkthrough: Let’s say a VM S1B, which is an instance in Service 1 VN, figures out that the next service some traffic needs to go to is Service 2. Instead of sending it directly to a VM in service 2, it sends the traffic to the load-balancer for Service 2, which then selects the VM (say S2C). It is also possible that the operator wishes that traffic goes directly from S2C specifically to VM S3A, instead of the load-balancer for Service 3.
XOS and ONOS’s VTN app coordinate to keep the virtual infrastructure state updated. As VMs are added/deleted for a service, and service chains are created, VTN updates tables in a special purpose OVS pipeline to reflect the desired subscriber service flow. A functional block-diagram of the OVS forwarding pipeline is shown in Figure 6.
Subscriber traffic is NATted by the vSG. After the vSG, subscriber traffic can directly head out to the Internet, or proceed through a series of services and then head out to the Internet. To transport subscriber traffic between services, we use VxLAN encapsulation. VxLAN tunnels are used to interconnect VMs and services. The encapsulated traffic is routed or bridged by the underlay fabric (which we describe next) using infrastructure IP/MAC address belonging to the fabric and the compute nodes. Finally, when routing out to the Internet and back, the fabric routes non-encapsulated NATted IP packets, with help from BGP routes learned by the vRouter application.
Figure 6. CORD OvS Pipeline.
Fabric Operation as an Underlay
For transporting VxLAN encapsulated traffic between services in the overlay network, the fabric acts as an underlay. The fabric control application’s job is to effectively utilize the cross-sectional bandwidth provided by the leaf-spine architecture. For intra-rack traffic between servers in the same rack, the fabric does regular L2 bridging. For inter-rack traffic between servers in different racks, the fabric does L3 routing (with MPLS). In other words, the underlay fabric behaves like an IP/MPLS network that routes traffic between attached subnets, where each rack has its own subnet. Traffic within a subnet (and therefore within a rack) is L2 bridged.
Figure 7. Fabric as an underlay.
With reference to Figure 7, each rack has its own subnet 10.200.x.0/24. Servers in the same rack are assigned (infrastructure) IP addresses in the same subnet (for example 10.200.1.11 and 1.12 in rack 1). Traffic between these servers is bridged by the ToR (leaf switch) they are connected to (s101 in Figure 7). Traffic destined to another rack gets routed by the same ToR. Here we use concepts from MPLS Segment Routing, where we attach an MPLS label to designate the ToR switch in the destination rack, and then hash the flows up to multiple spines. The spines do only MPLS label lookups to route the traffic to the destination ToR. A packet walk-through is shown in Figure 8.
The fabric design choices come out of a desire to have forwarding separation between edge and core (in this case the leaves and spines). Ideally, for a multi-purpose fabric, it is best if the core remains simple and ‘same’ for all use-cases, and only the input to the core changes on a per-use case basis. In our case, this means that the core forwards traffic only on the basis of MPLS labels, whereas the input at the leaf switch can comprise of IPv4, or IPv6, or multicast, or MAC, or VLAN, or VLAN+MAC or QinQ or even other MPLS labels.
Figure 8. Overlay and Underlay packet walk-through.
The fact that we use MPLS labels to achieve this edge-core (leaf-spine) separation is immaterial. We could have just as well used VLAN tags. However in our experience with current ASICs, it is easier to use labels instead of tags, as they are treated differently in current ASIC pipelines and there are different rules to follow and associated limitations in usage.
Furthermore, Segment Routing is just a convenient way to use MPLS labels. SR gives us the concept of globally significant labels which we assign to each leaf and spine switch. This leads to less label state in the network, compared to traditional MPLS networks where locally-significant labels have to be swapped at each node. In addition, by default, SR requires the use of ECMP shortest paths, which fits well with the leaf-spine fabric, where leaf switches ECMP hash traffic to all the spine switches. Finally, SR is source-routing, where we can change the path traffic takes through the network, by simply changing the label assignment at the source (leaf) switch - there is only one switch to ‘touch’ instead of the entire ‘path’ of switches. This is how we have performed elephant-flow traffic-engineering in the past using external analytics [REF].
In addition, an overlay/underlay architecture simplifies network design, as it introduces a separation of concerns - the underlay hardware fabric can be simple and multi-purpose without knowing anything about virtual networks, while the overlay network can be complex but implemented in software switches where more sophisticated functionality like service-chaining and distributed load-balancing can be achieved. Importantly, this architecture does not mean that we cannot have the overlay and underlay network interact with each other and gain visibility into each other. This is especially true in an SDN network where we have the same controller-based control plane for both overlays and underlays.
Fabric and Multicast
Fabric and vRouter
 OpenFlow Data Plane Abstraction (OF-DPA). https://www.broadcom.com/collateral/pb/OF-DPA-PB100-R.pdf
 Virtual Subscriber Gateway (vSG). CORD Design Notes (March 2016).
 Virtual OLT (vOLT). CORD Design Notes (March 2016).