CrossBow: Solaris Network Virtualization & Resource Provisioning

March 25, 2010 at 6:32 am 7 comments

1. CrossBow (the name):

It makes some sense to explain the relatonship between the technology (Network Virtualization and Resource Control) and the project name (CrossBow). It is believed that Crossbow was invented in 341B.C. in China but the use became prevalent in middle ages specially when steel was used to make the weapon. More powerful Crossbows could penetrate the armour at 200 yards and gave the typical horse mounted knights real nightmares. But the biggest differentiator was the simplicity in their use. Crossbow could be used effectively after a week of
training, while a comparable single-shot skill with a longbow could take years of practice.

Similary, if you take a look at the existing QOS mechanisms on a end host, they are very difficult to use and normally take a very skilled administrator to use effectively. Even then, the existing QOS mechanism come with heavy performance penalties which is also pretty common with any kind of virtualization as well. In Solaris land, we have invented a new way of imposing bandwidth resource control as attribute to a real or a virtual NIC such that it is built in as part of the Solaris network stack and comes without any performance penalties. Since the virtualization aspects and/or resource control aspects are just the attributes of the NIC/VNIC (specified when a NIC or Virtual NIC is created), a normal user and configure them without
needing a docterate in QOS or virtualization. “CrossBow” was the most suitable name for this project since we are trying to achieve similar results in the field of virtualization and resource control as the weapon did in medivial times in the battlefield.

2. CrossBow (the background):

Crossbow provides the building blocks for network virtualization and resource control by creating virtual stacks around any service (HTTP, HTTPS, FTP, NFS, etc.), protocol (TCP, UDP, SCTP, etc.), or Virtual machines like Containers, Xen and ldoms.

The project allows the system administrator to carve out any physical NIC into multiple virtual NICs which are pretty similar to real NICs and are administered just like real NICs. Each Virtual NIC can be assigned its own priority and band-width on a shared NIC without causing any performance degradation. The virtual NICs can have their own NIC hardware resources (Rx/Tx rings, DMA channels), MAC addresses, kernel threads and queues which are private to the VNIC and are not shared accross all traffic. In case of Solaris Containers, the Container can be assigned a virtual Stack Instance as well along with one or more virtual NICs. As such traffic for one VNIC can be totally isolated from other traffic and assigned any kind of limits or guarantees on amount of bandwidth it can use.

3. Overview:

Project Crossbow extends Solaris reach in several markets.

3a. OS/Network/Server Consolidation:

The application, network and server consolidation environments where both OS and network virtualization play a big role. This market is typically driven by the cost of owning and managing physical machines and physical networks. The sweet spot for these horizontally scaled environment are the 2-4 socket machines which appear as 4-8 CPU machines in case of x86/x64 systems and 32-64 CPU machines in case of SUN’s new Niagara based servers. From total cost of ownership perspective, these blades have only one physical NIC (1Gb or 10Gb) but
are trying to run multiple virtual machines (Xen, Containers, ldoms) which have to share the NIC resources and the available bandwidth.

The problem gets worse because for 3 decades we have been designing application to go as fast as possible and any congestion control is the job of the transport layer (if at all). So if one virtual machine is using UDP based traffic, then other virtual machines on the same system using TCP traffic will suffer badly. Even within same transport (TCP for instance), bulk through put applications like ftp/http etc will have a very negetive impact on interactive traffic and latency sensitive applications.

The goal of the project Crossbow is to different virtual machines share the common NIC in a fair manner and allow system administrators to set preferential policies where necessary (e.g. the ISP selling limited B/W on a common pipe) without any performance impact.

3b. Traditional QOS and application consolidation:

Exisiting host based QOS mechanism are very complex to setup and typically come with a sizable performance penalty and increase in latency. The big part of the problem is the interrupt based delivery mechanism for inbound packets and the QOS being implemented by a separate layer (typically between NIC driver and IP). The network and transport layer of the host stack is unware about the QOS layer. The packets are already delivered to the host memory by means of interrupts and the QOS layer needs to classify the packets to various queues before it can apply the policies. In case the packet can not be processed because the bandwidth usage for that class is exceeded, it sits in a queue while still consuming system memory.

Project Crossbow integrates stack virtualization and QOS as part of the stack architecture itself to offer a large subset of QOS type functionality at zero performance penalty and simple administrative interfaces. It also integrates diffserv with the stack where a virtual NIC can set and read the diffserv based labels. Since Crossbow architecture is limited in differentiating the traffic based on layer 2, 3, and 4 headers only i.e. the VLAN tag, local mac address, local IP address, protocol, and ports; the functionality offered is a subset of exisiting QOS mechanism although it covers 90% of the use cases without any performance penalty. This is the prime reason why project Crossbow refers to the bandwidth related policies as ‘Bandwidth resource control’ instead of QOS.

3c. Horizontally scaled markets:

This is the market segment made up of low priced volume servers (typically 2-4 socket machines) which offer services which require little or no sharing of data between them. The small servers can be standalone machines in a rack or blades in a chassis. Grids are another way to use volume servers to achieve the output of the traditional large SMP machines or main frames.

In case of blades which share a common 10Gb NIC to the chasis, Crossbow again provides the sharing of bandwidth in a fair manner. In addition, the Crossbow provided APIs for network management, virtualization and bandwidth resource control can be use by 3rd party management softwares to propagate the common policy throughout the server farm or all the blades in the chassis. In a Solaris based homogenous environments, its very easy to mark an application or a virtual machine (based on port or IP address) as critical and propagate the same policy through all the machines. The diffserv labels can be added appropriately such that the policy is honored by all machines and network element in the center.

4. Technical problems in exisiting architectures:

As mentioned earlier, the host based QOS systems work as a layer between the network stack and as such are pretty inefficient in providing the QOS services required of them. But that is not all.

The exisiting interrupt driven packet delivery model pecludes any kind of policy enforcement and fair sharing. When a NIC interrupt is raise, it is at a highest priority and the CPU has to context switch whatever processing to deal with the interrupt. Most of the time, the processing of a critical packet is interrupted to deal with the arrival of a non critical packet.

The anonymous packet processing in the kernel is another major problem in virtualizing the stack and enforcing any kind of bandwidth resource control (including fairness). 80% of the work is already done for an incoming packet when the stack determines that no one is actually interested in the packet and it needs to drop it. In other words, the cost of dropping unwanted packets is too high.

Everything in the host flows through common queues and is processed by common threads which make enforcing policies based on traffic type very difficult. Recv or xmit of each packet impacts processing on any other packet on that particular CPU.

In most of the virtualized environments, the pseudo NIC in the virtual machines has no way of knowing about the hardware capabilities of the real hardware (even simple things like hardware checksum) because of the presence of the bridge in between and ends up making negative performance impact. In addition, there is no mechanism to share the NIC in a fair manner. The transition of typical packet from the dom0 to domU also causes severe performance problems.

5. CrossBow Architecture:

The Crossbow architecture starts out by integrating network virtualization and resource control as part of the stack architecture. The Solaris 10 network stack has already been designed
for the next decade where the connection to CPU affinity is maintained and the upper stack has tight control over the NIC resources.

Crossbow builds on top of that by pushing the classification of packets based on services, protocols or virtual machines as far below as possible. If the NIC hardware itself has ability to divide on board memory into segments/queues (know as Rx and Tx rings) which can preferably have their own DMA channels and MSI-X interrupts, the stack programs the NIC classifier to classify packets based on configured policies to different Rx rings. Each Rx/Tx ring is owned by a CPU and a separate kernel queue know as SRS which controls the rate of packet arrival into the system based on configured bandwidth.

The Rx/Tx ring, the associated DMA channel, MSI-X interrupt, the serialization queue, the CPU, and processing threads are all unique for the service, protocol or virtual machine in question and can be assigned a unique MAC address and a Virtual NIC which becomes the administration entity that can be administered like a normal NIC. The NIC classifier drives the incoming packets to the correct RX ring from where the SRS owning the Rx ring (and VNIC) will pull the packets via polling mode based on fair sharing of resources or configured bandwidth. As shown in the figure below, each such path is called a hardware lane and each hardware lane has dedicated hardware and software resources and operates independently from each other. The interrupt mode is used only when the SRS has no packets to process and the Rx ring is empty. Each individual Rx ring is dynamically switched between interrupt and polling mode. Incoming
packets that exceed the configured bandwidth limit remain in the NIC itself in their corresponding Rx ring and are pulled in the system only when they are ready to be processed.

Crossbow Hardware Lanes

The creation of an administrative entity (VNIC) is optional and typically associated with a virtual machine like Solaris containers, Xen or ldoms. For application or protocol based resource control, a separate data path is created to provide the isolation and resource control but a VNIC is not configured.

As mentioned above the VNIC is just an administrative entity. If the classification has already been done by the NIC to a particular Rx ring, the packets as delivered directly to IP layer by means of function calls when Rx ring is interrupt mode or the SRS residing in MAC layer pulls the packet chain directly from the Rx ring when in the polling mode. In essence, the entire data link layer is bypassed resulting in improved performance and lower latencies. If the VNIC is placed in promiscous mode, the data link bypass is abandoned and the Rx ring delivers packets via the VNIC layer which creates a copy of the packet for promiscous stream.

The entire layered architecture is built on function pointers know as ‘upcall_func’ and ‘downcall_func’ with corresponding ‘upcall_arg’ and ‘downcall_arg’ for context. Every layer provides a pointer of its recv function as ‘upcall_func’ and a context as ‘upcall_arg’ to the layer below. Similarly, every layer provides pointer to its transmit function as ‘downcall_func’ and a context cookie as ‘downcall_arg’ to layer above. This is how the packet path is constructed. Any layer can short circuit itself out by providing the ‘upcall_func’ and ‘upcall_arg’ of the layer above to layer below (and same for transmit side if needed). All context cookies for a layer work on reference based system when each layer pointed to it gets a reference and ensure that data structures don’t get freed till all references are dropped.

In case, the NIC hardware does not have classification capability (unlikely since most of intel, broadcom and SUN 1Gb NICs and pretty much all 10Gb NICs shipping for past several years have this capability) or have run out of the classification capability, the architecture provides a classification capability in the mac layer and employs soft rings which are similar to functionality as NIC hardware classifier and RX rings. The NIC hardware layer coupled with lower MAC layer and soft rings are termed as ‘Pseudo Hardware layer’. A request by administartor to create a new VNIC or flow will always return successful from the pseudo hardware layer. The pseudo hardware layer manages the hardware and software classification capability and Rx rings and soft rings transparently from upper layers.

6. The administrative model:

Crossbow introduces a new command called ‘flowadm’ and further augments ‘dladm’ which was introduced as part of the new high performance device driver framework (GLDv3) in Solaris 10.

‘dladm (1M)’ – This is primarily used to create, modify and destroy VNIC based on mac or IP addresses. The created VNIC is visible and managed by ifconfig just like any otehr NIC and can get its IP address assigned via DHCP if necessary.

The examples below can illustrate this better:


Example 1: Configuring VNICs

To create two VNICs interfaces with vinc-ids 1 and 2
over a single physical device bge0, enter the following com-
mands:

# dladm create-vnic -l bge0 vnic1
# dladm create-vnic -l bge0 2vnic
The new links will be called vnic1 and vnic2.

Example 2: Configuring VNICs and allocating bandwidth & priority

To create two VNIC interfaces with vinc-ids 1 and 2
over a single physical device bge0 and make vnic1 a higher
priority VNIC using factory assigned MAC address with guarantee
to use upto 90% of the bandwidth and vnic2 having a lower priority
with a random MAC address and a hard limit of 100Mbps:

# dladm create-vnic -l bge0 -m factory -b 90% -G -p high vnic1
# dladm create-vnic -l bge0 -m random -b 100M -L -p low vnic2

Example 3: Configure a VNIC by choosing a factory MAC address

To create a VNIC interface with vinc-id 1 by first
listing the factory available MAC address and then using one
of them:

# dladm show-dev -l bge0 -m
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 No

# dladm create-vnic -l bge0 -m factory -n 2 vnic1

# dladm show-dev -l bge0
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 Yes

Example 4: Configuring VNICs sharing a MAC address

To create two VNICs with vnic-id 1 and 2 by first listing the
available factory assigned MAC addresses and then picking one
that will be shared by the newly created VNICs

# dladm show-dev -l bge0 -m
bge0
link: up speed: 1000 Mbps duplex: full
MAC addresses:
slot-ident Address In Use
1 0:e0:81:27:d4:47 Yes
2 8:0:20:fe:4e:a5 No

# dladm create-vnic -l bge0 -m shared -n 2 vnic1
# dladm create-vnic -l bge0 -m shared -n 2 vnic2

Example 5: Creating a VNIC with user specified MAC address

To create a VNIC with vnic-id 1 by providing a user specified
mac address

# dladm create-vnic -l bge0 -m 8:0:20:fe:4e:b8

‘flowadm (1M)’ – This command is primarily used to provide isolation
and private resources to an application traffic or protocol. In
addition, we can also configure bandwidth limits and guarantees for
the flows. Again some example can illustrate the usage better:


Example 1: Create a policy around mission critical port 443 traffic
which is https service.

To create a policy around inbound https traffic on a https server
so that https gets it dedicated NIC hardware and kernel TCP/IP
resources. The policy-id specified is https-1 which is used to
later modify of delete the policy.

# flowadm add-flow -l bge0 -a transport=tcp,local_port=443 https-1

Example 2: Modify an existing policy to add bandwidth resource control

To modify https-1 policy to add bandwidth control and give it a
high priority

# flowadm set-flowprop -p maxbw=500M,priority=high https-1

Example 3: Limit the bandwidth usage of UDP protocol

To create a policy for UDP protocol so that it can not consume more
than 10% of available bandwidth. The flow-id is called limit-udp-1.

# flowadm add-flow -l bge0 -a transport=UDP -p maxbw=100M, \
priority=low limit-udp-1

8. Crossbow Observability – Stats, history and APIs:

Apart from the functionality related to network virtualization and bandwidth resource control, Crossbow offers a whole range of news tools and mechanism to understand the bandwidth usage. Administrators can see real time bandwidth usage for various VNICs or configured flows (via ‘flowadm’) without causing any performance penalties.

The Rx rings, SRS and squeues dealing with a particular flow keep track of normal stats which are pulled by a user land daemon from time to time. The daemon also logs the information in special log files which allows users to see history at any given time. A user can request usage for a time period in past to understand the system behavior.

Crossbow will provide more tools to help capacity planning by allowing the system to be put under capacity planning mode where bandwdith usage for top traffic is monitored and displayed.

All the observability and administrative interfaces can be accessed by APIs which allow other applications to use and manage the system.

9. Resources:

Crossbow project page on OpenSolaris is a good source of information http://www.opensolaris.org/os/project/crossbow

The Crossbow mailing list is where all the day to day business for the project is conducted. Anyone can join the mailing list crossbow-discuss@opensolaris.org.

About these ads

Entry filed under: Uncategorized. Tags: .

Solaris 10 Networking – The Magic Revealed Network in a box

7 Comments Add your own

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Top Rated

Recent Posts

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 31 other followers


Follow

Get every new post delivered to your Inbox.

Join 31 other followers