<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Sunay Tripathi&#039;s Blog</title>
	<atom:link href="http://sunaytripathi.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://sunaytripathi.wordpress.com</link>
	<description>Sun Distingusihed Engineers Blog on OS, Networking, Virtualization, Cloud Computing, Solaris Architecture, etc</description>
	<lastBuildDate>Thu, 05 Jan 2012 07:42:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='sunaytripathi.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Sunay Tripathi&#039;s Blog</title>
		<link>http://sunaytripathi.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://sunaytripathi.wordpress.com/osd.xml" title="Sunay Tripathi&#039;s Blog" />
	<atom:link rel='hub' href='http://sunaytripathi.wordpress.com/?pushpress=hub'/>
		<item>
		<title>How does Openflow and SDN help Virtualization/Cloud</title>
		<link>http://sunaytripathi.wordpress.com/2011/12/21/how-does-openflow-and-sdn-help-virtualizationcloud/</link>
		<comments>http://sunaytripathi.wordpress.com/2011/12/21/how-does-openflow-and-sdn-help-virtualizationcloud/#comments</comments>
		<pubDate>Wed, 21 Dec 2011 03:04:13 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=90</guid>
		<description><![CDATA[Introduction to Software Defined Networking and OpenFlow Often time I hear the term Openflow and Software Defined Networking Networking used in many different context which range from solving something simple and useful to literally solving the world hunger problem (or fixing the world economy for that matter). I often get asked to explain the various [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=90&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>Introduction to Software Defined Networking and OpenFlow</h3>
<p>Often time I hear the term Openflow and Software Defined Networking Networking used in many different context which range from solving something simple and useful to literally solving the world hunger problem (or fixing the world economy for that matter). I often get asked to explain the various aspects of how Openflow is changing our lives. So here goes a explanation of the religion called Openflow (and Software Defined Networking) and various ways its manifesting itself in our day to day life. Again its too much to write in one article so I will make it a series of 3 articles. This one focuses on the protocol itself. The 2nd article will focus on how people are trying to develop it and some end user perspective that I have accumulated in last year or so. The last article in series will discuss the challenges and what are we doing to help.<br />
</p>
<h3>Value Proposition</h3>
<p>The basic piece of <a href="http://Openflow.org">Openflow</a> is nothing more than a wire protocol that allows a piece of code to talk to another piece of code. The idea is that for a typical network equipment, instead of logging in and configuring it via its embedded web or command line interface (the way you configure your home wifi router), you can get the <b>Controller</b> from someone other than the equipment vendor. Now technically and in short term, you are probably worse off because you are getting the equipment from one guy and the management interface from other guy and there are bound to be rough edges.</p>
<p>Openflow creates a standard around how the management interface or Controller talks to the equipment so the equipment vendors can design their equipment without worrying about the management piece and someone else can create a management piece knowing well that it will manage any equipment that support Openflow. So people who understand standards ask whats the big deal? I still can&#8217;t do more than what the equipment is designed to do!! And that is the holy grail around any standard. By creating the standard, you are separating the guys who make equipment to focus on their expertise and guys doing management to make the controllers better. This is in no way different than how computers work today. Intel/AMD creates the key chips, vendors like Dell, HP etc. create the servers and Linux community (or BSD, OpenSolaris, etc) creates the OS and it all works together offering a better solution. It achieves one more thing &#8211; it drives the H/W cost lower and creates more competition while allowing a end user to pick the best H/W (from their point of view) and the best controller based on features, reliability, etc. There is no monopoly, plenty of choices and its all great for end user.</p>
<p>Specially in the networking space where innovation was lacking for a while and few companies were used to huge margins because users had no choice. One trend that is driving the fire behind SDN is virtualization. Both Server and storage side (H/W and OS) have made good progress on this front but Network is  far behind. By opening up the space, SDN is allowing people like me (who are OS and Distributed Systems people) to step into this world and drive the same innovation on network side. So Openflow/SDN are great standards for the end user and people who understand it see the power behind it.<br />
</p>
<h3>Key Features</h3>
<p>Openflow Spec 1.1.2 is just out with minor improvements while 1.1.1 has been out for few months. Most of the vendors only have 1.0.0 implemented. So if you look at the spec, you will see data structures and message syntax needed for a controller to talk to a device it wants to control. Functionality wise, its can be grouped in following parts (understand that I am trying to help people who don&#8217;t want to read hundreds of pages of specs):</p>
<ul>
<li>
The <b>device discovery</b> and connection establishment part where you tie in a controller to a device that it wants to control.
</li>
<li>
Creating the <b>Flows</b>. In a typical network, there is different type of traffic mixed in, packets for which can be grouped together in the form of flow. If you look at layer 2 header, packets for the same VLAN can be a flow, packets belonging to a pair of mac addresses can be a flow and so on. Similarly packets belonging to a IP subnet or IP address plus TCP/UDP port (service) can be termed as a flow. Any combination of Layer 2, 3 and 4 headers that allows us to uniquely identify a packet stream on the wire is term as flow and Openflow protocol makes special efforts to specify these flows. A Openflow control can specify a flow to a switch which can apply to specific ports or to all ports and ask the switch to take special actions when it matches a packet to a flow.
</li>
<li>
<b>Action</b> on matching a flow. As part of specifying the flow, the protocol allows the controller to specify what action to take when a packet matches the flow. The action can range from copy the packet, decrement Time to Live, change/add QoS label, etc. But the most important action (in my view) is the ability to direct the original packet (or a copy) to specific port or to the controller itself.
 </li>
<li>
<b>Flow Table</b> where the flows are created. For actual device, this is typically the TCAM where the flow is instantiated and applied to incoming packets. Most of devices are pretty limited by this and can typically support a very small set of flows today. The protocol allows for specifying multiple tables and the ability to pipeline across those tables  but given the state of today&#8217;s and mid term hardware, single table is all we can work with.
</li>
<li>
The the last piece is the <b>Counters</b>. Most of the devices support port level counters which the openflow controllers can read. In addition the protocol supports flow level counters but  the current set of devices are very limited on that as well.
</li>
</ul>
<h3>Putting it all together</h3>
<p>So now we understand the components, we can see how it works. A controller (which a piece of code) running on standard server box starts and discovers a device that it wants to manage. In today&#8217;s<br />
world, that device typically is a ethernet switch. Once connected, it puts the device under it control and sets flow with actions and reads status from the device.</p>
<p>As an example, assume that a user is experimenting with new Layer 3 protocol and he can add a flow that makes the switch redirect all matching packets to the controller where the packet gets modified appropriately and redirect through a specific egress port on the device. Much easier to implement since controller itself is a piece of  code running on standard OS so adding code to it to do something experimental is pretty easy. The most powerful thing here is that the user is not impacting the rest of the network and doesn&#8217;t need his/her  own dedicated network.</p>
<p>My own favorite (that we have experimented with) is debugging application for a data center or enterprise where the user needs to debug his own client/server application. The user can try and capture the packets on multiple machines running his clients and server but the easier thing would be to set a flow on the switch based on server IP address and TCP port (for the service) and a action that allows a copy of all matching packets to be sent to the controller with a timestamp. This allows the user to debug his application much more easily.</p>
<p>Again, the power of Openflow and Software Defined Networking is that it allows people to innovate and requires someone to solve their problem by writing simple code (or use code provided by others). Its important to keep in mind that switch is a really powerful device since everything goes through it and allowing it to be controlled by C, Java, or Perl code is very powerful. The control moves from the switch designer to application developers (to the discomfort of the switch vendors <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<h3>So finally, how does it help Virtualization and Cloud?</h3>
<p>This is the reason why I am so excited and ended up spending time writing the blog. The key premise in world of virtualization is dynamic control for resource utilization. Again, network utilization and SLA are important but the key part we need to solve is the utilization of servers. The holy grail is a large pool of servers each running 20-50 virtual machines that are controlled by Software which optimizes for CPU/memory utilization. The key issue is the Virtual Machines are grouped together in terms of application they run or the application developer that controls them. To prevent free for all, they typically are tied together with some VLAN, ACL code, have a network identity in terms of IP/MAC addresses, and SLA/QoS etc. For the controlling Software to migrate the VM freely, it want to manage the VM network parameters on the target switch port as well. And this is where the current generation of switches fail. They require human intervention to configure the various network parameters on the switch  that match the VM.</p>
<p>So in order for a VM to migrate freely under software control, it still requires human intervention on the network side. With Openflow, the Software orchestrating the server utilization by scheduling the VMs based on policies/SLA, can set the matching network policies without human intervention.</p>
<p>Just the way a typical server OS has a policy driven schedular which control the various application threads on dozens of CPUs (yes even a low end dual socket server has 6 core each with multiple hardware threads), the Openflow allows us to build a combined server/storage/network scheduler that can optimize the VM placement based on configured policies.</p>
<p>Again, Openflow is just a wire protocol and a pseudo standard but it allows people like me add huge value which wasn&#8217;t possible before. In next article, we will go deeper into what people are trying to build and look at some more specific use cases. Stay Tuned and Happy Holidays!!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/90/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/90/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/90/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=90&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2011/12/21/how-does-openflow-and-sdn-help-virtualizationcloud/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>
	</item>
		<item>
		<title>Network 2.0: Virtualization without Limits</title>
		<link>http://sunaytripathi.wordpress.com/2011/06/19/network-2-0-virtualization-without-limits/</link>
		<comments>http://sunaytripathi.wordpress.com/2011/06/19/network-2-0-virtualization-without-limits/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 18:49:53 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=80</guid>
		<description><![CDATA[So the theme of the day is Network Virtualization, Software defined networks and taking virtualization to its logical conclusion i.e. server, storage and network in a giant resource pool that can be allocated/assigned any which way. Although its easier said then done. Server and Storage virtualization were a bit simpler since we were dealing with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=80&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>So the theme of the day is Network Virtualization, Software defined networks and taking virtualization to its logical conclusion i.e. server, storage and network in a giant resource pool that can be allocated/assigned any which way. Although its easier said then done. Server and Storage virtualization were a bit simpler since we were dealing with one OS that needed to provide the right abstraction layer. The H/W resource pool (disk, cpu, network, memory, etc) was managed by the single OS so provisioning it between various virtual machines or storage pool was a bit simpler. The network by definition is useful only when multiple devices are connected and trying to treat them as a single resource pool is harder. A virtual networks has to deal with not just links, bandwidth, latency and queues but also<br />
higher level functionality like routing, load balancing, firewalling, DNS, DHCP, VPN, etc. etc. And we haven&#8217;t even talked about how this all will hook up together along with virtual machines and virtual storage pool in a easy manner. Now before you argue that every component is already virtualized (which is very true), one could argue that it still doesn&#8217;t give me a virtual network. It is same as someone wanting a dinner and is instead served raw potatoes, onions, tomatoes, eggs, etc and shown the stove to make his own Omelette.</p>
<p>So having pioneered virtual switching and resource control in the server OS (Solaris to be specific &#8211; the project was called <a href="http://sunaytripathi.wordpress.com/2010/03/25/crossbow-solaris-network-virtualization-resource-provisioning/trackback">Crossbow</a> that I started in 2003 and got integrated in OpenSolaris in 2008), I set out to do the same for larger networks in the form of <a href="http://www.pluribusnetworks.com">Pluribus Networks Inc</a> and apply the hard lessons we learned from enterprise customers. This is what we call <b>Network 2.0: Virtualization without Limits</b>. The real reason it is a tough problem to solve is due to switching needing to be very high performance and low latency. It forces all the switching functionality to be inside a very highly complicated ASIC which does all the hard work in shuffling 1.2 Terabits per seconds of data and sub micro second latencies and as such doesn&#8217;t need much software on top. The embedded OS controlling the switch is mostly used for just configuring the switch chip using a cli (command line interface) that allows the administrator to control and configure each component on the switch but almost nothing else. So when we started playing with some of the prototype next generation boxes that our friends at Fulcrum and Broadcom gave us, we just kept asking if I could have a real OS running the chip to be able to do something more useful. So we asked our friends again if there was someway to put a full fledged OS on top (being the OS person I have been for most of my life). And that was when I realized that to solve the network virtualization problem, we really need a OS that understand resource pools and virtualization on the chip. But a single switch by itself is not very interesting so we need a OS that controls all the switches. Hmm &#8211; one OS that controls them all (borrowing from LOTR which reminds me to ask Peter Jackson whatever happened to the prequel)!! So before we can even start building anything more complicated, we built a network hypervisor that has semantics similar to a tight coupled cluster but controls a collection of switches and scales from one instance to hundred plus instances. </p>
<p>The Network OS is finally taking life and is able to treat the network exactly as a one giant resource pool. Please don&#8217;t confuse the Network OS with typical management layer that manages a collection of devices. We do still need a management layer to configure and manage the OS but the policy enforcement, congestion control and resource management across all devices is done by the OS. It is same as a server cluster doesn&#8217;t get rid of the management layer but actually gives the management layer something that is more manageable.  </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/80/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/80/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/80/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=80&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2011/06/19/network-2-0-virtualization-without-limits/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>
	</item>
		<item>
		<title>Solaris as an Open Source alternative to Linux</title>
		<link>http://sunaytripathi.wordpress.com/2010/10/23/solaris-as-an-open-source-alternative-to-linux/</link>
		<comments>http://sunaytripathi.wordpress.com/2010/10/23/solaris-as-an-open-source-alternative-to-linux/#comments</comments>
		<pubDate>Sat, 23 Oct 2010 09:23:06 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Solaris]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=72</guid>
		<description><![CDATA[When I left Solaris after the Sun/Oracle marger, it was because I wanted to try some new things in life possibly based on OpenSolaris. I had led Solaris in networking and network virtualization space for a long time and wanted to make a bigger mark in that space compared to what Oracle might have wanted. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=72&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When I left Solaris after the Sun/Oracle marger, it was because I wanted to try some new things in life possibly based on OpenSolaris. I had led Solaris in networking and network virtualization space for a long time and wanted to make a bigger mark in that space compared to what Oracle might have wanted. But my hope was that Solaris as a Open Source Operating System would continue to prosper and I could possibly use OpenSolaris as a base for whatever I decided to do next. Well, the exodus from Solaris has continued over the past few months and now <a href="http://blogs.sun.com/mws">Mike</a> has also decided to call it quits. Mike was one of my counterparts, running the storage side of the house (other leaders in storage and filesystem space, like <a href="http://blogs.sun.com/bonwick">Jeff</a> and <a href="http://dtrace.org/blogs/bmc">Bryan</a> had already bailed out of Solaris few months after I left).</p>
<p>So at this point, I am forced to consider the fact that Solaris and OpenSolaris are on the brink of death unless something serious is done about it. Having spent so much time and energy in last 15 years on Solaris (including bringing it back from life after the last tech bust when Solaris had been labeled <em>Slowlaris</em>), I would personally like to see it go on. Given the richness of Solaris and what it offers to developers, the opensource community doesn&#8217;t deserve to lose it. We had relentlessly added APIs for all the networking and virtualization code (<em>Crossbow and Zones</em>) in past few years to name a few. Dealing with creating VNICs, walking links, creating Zones, etc from a developer point of view is very easy (more on what&#8217;s there for developers some other day).</p>
<p>So the question I have been pondering for last few weeks is what does it take to create a truly vibrant OpenSource kernel as an alternative to Linux. During Sun days, we had tried to set up Solaris as an open source alternative to Linux and we moved all development, process, architectural review, etc in the open but somehow the community never truly believed us. But now with Oracle having closed source the OS and struggling to keep it alive, there might be truly an opportunity for OpenSolaris to be reborn as an true opensource alternative to Linux. There seems to some effort already in form of <a href="http://illumos.org">illumos</a> led by Garrett and an <a href="http://openindiana.org">OpenIndiana</a> distro. Now Garrett works for Nexenta who has a business based on Solaris I do believe they can throw some resources to keep it running.</p>
<h2>What does an open source OS need?</h2>
<p>There are several things that are needed on short term that doesn&#8217;t take too many resources. Not in any particular order of importance:</p>
<ul>
<li><strong>Drivers for new device</strong><br />
I have seen Garrett personally dish out drivers faster then people can install and test them so he can at least keep one part of the OS<br />
alive i.e. drivers for new devices.</li>
<li><strong>Packaging, Delivery and Install</strong><br />
Then there is the packaging, delivery etc. which someone has to pick up. Perhaps the OpenIndiana guys can make that their core competency. Maybe they can finish the <strong>IPS</strong> system and make changes for the file based URI that it had already started to go towards. Things like:&nbsp;</p>
<ul>
<li>Allow someone to make a non network clone of the repo (at least for the true opensource packages including the kernel). This allows enough people to feel that they truly have control over all aspects on the kernel without need a repo server that they don&#8217;t control. Maybe something as simple as installing a server from the network repo along the lines of
<pre>% pkg image-create file:///var/pkg file:///my_repo
% pkg image-create -g origin_server file:///my_repo</pre>
</li>
<li>Allowing someone to save a copy of a package locally that he can later install. Something like this
<pre>% pkg clone [-g origin_server] pkg_fmri_pattern /local_path</pre>
</li>
<li> And allow some to install the saved package over riding the dependency check if needed.
<pre>% pkg install [-f] file:///net/hostname/local_path
% pkg install [-f] [-g origin_server] pkg_fmri_pattern</pre>
</li>
</ul>
<p>Of course, the directions above are just some thoughts &#8211; the details of which need to be refined based on input from the end users. The hosting etc necessary for repo servers for delivery is perhaps the easy part. The install is another story altogether, given that a large amount of code for automatic install has not even been opensourced but I think we can go by for a few years before that has to be addressed.</li>
<li><strong>Bug fixing</strong><br />
Again, given the Solaris talent now outside of Oracle, this is easy. For Bryan, who works at Joyent which again depends on Solaris, doing some P1 fixes is easy. I can do some critical fixes when necessary and there are so many others now outside of Oracle that we need to reach out to</li>
<li><strong>New Platform Support</strong><br />
As new chips come out, you need to add support at the minimum. Now we are getting into some tricky business. I have been discussing the idea of an OpenSolaris Foundation with some companies that support such open source initiatives. I have discussed that with some of the ex Sun DEs and Sr. Engineers as well and it seems appealing to people. Two things needed to happen for this to take off. One was Oracle having truly killed OpenSolaris so there is a clean fork (which I think has already happened). The other is a harder problem and which is the last thing on my list.</li>
<li><strong>Mission and Innovation</strong><br />
The open source OS for the sake of another OS is not very palatable to people who can fund the foundation. I got suggestions around mobile<br />
space for which I don&#8217;t think Solaris is ideal OS (yet). There seem to be interest along the direction of moving Solaris as a distributed OS in the cloud space. This gels with what I have been working on along with few others &#8211; a distributed network operating system geared<br />
towards clouds. Maybe moving my effort on top of Solaris would provide the mission in one direction (and my own requirement to use Solaris<br />
has been now met i.e. Oracle should kill it so there is no ambiguity). There is a revolution happening in that space already with Openflow, Trill,<br />
etc. and I am trying to figure out how to break the last barrier in the datacenter.</li>
</ul>
<p>So to answer the question <em>What have I been upto?</em> -  now you know. And to answer the question <em>Will I allow my work on Solaris to die</em>?  I guess the answer is resounding- <strong>No</strong>. <em>Will I port things like Crossbow and Zones to freebsd?</em> &#8211; the answer is if OpenSolaris truly dies, then hell yes. And before someone points out that Zones is really same as BSD Jails, you should look again carefully.</p>
<p>So Solaris users and Solaris lovers, love to hear your thoughts.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/72/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/72/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/72/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=72&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2010/10/23/solaris-as-an-open-source-alternative-to-linux/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>
	</item>
		<item>
		<title>Its not a goodbye. Leaving Oracle but not Solaris!!</title>
		<link>http://sunaytripathi.wordpress.com/2010/04/02/its-not-a-goodbye-leaving-oracle-but-not-solaris/</link>
		<comments>http://sunaytripathi.wordpress.com/2010/04/02/its-not-a-goodbye-leaving-oracle-but-not-solaris/#comments</comments>
		<pubDate>Fri, 02 Apr 2010 07:17:52 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=67</guid>
		<description><![CDATA[This is probably one of the most difficult entries I have ever written. I have decided to leave my job at Oracle. Don&#8217;t have a forward destination yet but I intend to take some time thinking about it before I take the next step. I am leaving Oracle but I will still be involved with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=67&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This is probably one of the most difficult entries I have ever written. I have decided to leave my job at Oracle. Don&#8217;t have a forward destination yet but I intend to take some time thinking about it before I take the next step. I am leaving Oracle but I will still be involved with Solaris and OpenSolaris in some form or the other. Having spent 14 years writing million+ lines of code and architecting some of the most complex subsystems, I don&#8217;t intend to just walk away.</p>
<p>The last 2-3 days have been a very emotional journey for me. I thought I was a very strong willed person but it was amazing how many times I came close to tears when so many people stopped by. All I can say is that I am so grateful that the community feels that I had done something useful (both personally and professionally) for Solaris. The journey has been nothing but wonderful and I will surely miss everyone. But I have learned one thing in last several years &#8211; to not say goodbye ever because our paths will cross again!!</p>
<p>Best of luck to everyone in the Solaris community who help it to be the best operating system on the planet and especially to the people who help write it. Keep the flag flying!! With Oracle, Solaris will reach more places &#8230;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/67/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/67/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/67/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=67&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2010/04/02/its-not-a-goodbye-leaving-oracle-but-not-solaris/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>
	</item>
		<item>
		<title>Network in a box</title>
		<link>http://sunaytripathi.wordpress.com/2010/03/25/60/</link>
		<comments>http://sunaytripathi.wordpress.com/2010/03/25/60/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 07:10:42 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=60</guid>
		<description><![CDATA[Crossbow Virtual Wire allows us to create a full fledged network comprising of Hosts, Switches and Routers as a Virtual Network on a laptop. The  Virtual Network is created using OpenSolaris project Crossbow Technology and the hosts etc are created using Solaris Zones (a light weight virtualization technology). All the steps necessary to create the virtual topology are explained.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=60&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p style="text-align:center;">Demo/Workshop/Learning Module</p>
<h2 style="text-align:center;">Crossbow: Network Virtualization &amp; Resource Control</h2>
<h2 style="text-align:center;">http://www.opensolaris.org/os/project/crossbow</h2>
<p style="text-align:center;">
<h3 style="text-align:center;">Sunay.Tripathi@Sun.Com</h3>
<h3>Objective</h3>
<p>Create a real network comprising of Hosts, Switches and Routers as a Virtual Network on a laptop. The  Virtual Network is created using OpenSolaris project Crossbow Technology and the hosts etc are created using Solaris Zones (a light weight virtualization technology). All the steps necessary to create the virtual topology are explained.</p>
<p>The users can use this hands on demo/workshop and exercises in the end to become an expert in</p>
<ul>
<li>Configuring 	IPv4 and IPv6 networks</li>
<li>Hands 	on experience with OpenSolaris</li>
<li>Configure 	and manage a real Router</li>
<li>IP 	Routing technologies including RIP, OSPF and BGP</li>
<li>Debugging 	configuration and connectivity issues</li>
<li>Network 	performance and bottleneck Analysis</li>
</ul>
<p>The users of this module need not have access to a real network, router and switches. All they need is a laptop or desktop running OpenSolaris Project Crossbow snapshot 2/28/2008 or later which can be found at http://www.opensolaris.org/os/project/crossbow/snapshots.</p>
<h3>Introduction</h3>
<p>Crossbow (Network Virtualization and Resource Control) allows users to create  a Virtual Network with fixed link speeds in a box. Multiple subnet connected via a Virtual Router is pretty easy to configure. This allows the network administrators to do a full network configuration, verify IP address, subnet masks and router ports and addresses. They can test connectivity and link speeds and when fully satisfied, they can instantiate the configuration on the real network.</p>
<p>Another big application is to debug problems by simulating a real network in a box. If network administrators are having issues with connectivity or performance, they can create a virtual network and debug their issues using snoop, kernel stats and dtrace. They don&#8217;t need to use the expensive H/W based network analyzers.</p>
<p>The network developers and researchers working with protocols (like high speed TCP) can use OpenSolaris to write their implementation and then try it out with other production implementations. They can debug and fine tune their  protocol quite a bit before sending even a single packet on the real network.</p>
<p>Note1: Users can use Solaris Zones, Xen or ldom guests to create the virtual hosts while Crossbow provides the virtual network building blocks.</p>
<p>Note2: The Solaris protocol code executed for a virtual network or Solaris acting a real router or host is common all the way to bottom of MAC layer. In  case of virtual networks, the device driver code for a physical NIC is the only code that is not needed.</p>
<h3>Try it Yourself</h3>
<p>Lets do a simple exercise. As part of this exercise, you will learn</p>
<ul>
<li>How 	to configure a virtual network having two subnets and connected via 	a  Virtual Router using Crossbow and Zones</li>
<li>How 	to set the various link speeds to simulate multiple speed network</li>
<li>Do 	some performance runs to verify connectivity</li>
</ul>
<p>What you need</p>
<p>A laptop or machine running Crossbow snapshot or OpenSolaris 2009.06 or later</p>
<h3>Virtual Network Example</h3>
<p>Lets take a physical network. The example in Fig 1a is representing the real network showing how my desktop connects to the Lab servers. The desktop is on 20.0.0.0/24 network while the server machines (host1 and host2) are on 10.0.0.0/24 network. In addition, host1 has got a 10/100 Mbps NIC limiting its connectivity to 100Mbps.</p>
<div id="attachment_61" class="wp-caption aligncenter" style="width: 862px"><a href="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1a.gif"><img class="size-full wp-image-61" title="Fig 1a" src="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1a.gif?w=455" alt=""   /></a><p class="wp-caption-text">Fig 1a (Physical Topology)</p></div>
<p>We will represent the network shown in Fig 1a on my Crossbow enabled laptop as  a Virtual Network. We use Zones to act as host1, host2 and the Router while the global zone (gz) acts as the client (as a user exercise, create another client zone and assign VNIC6 to it to act as a client).</p>
<div id="attachment_62" class="wp-caption aligncenter" style="width: 869px"><a href="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1b.gif"><img class="size-full wp-image-62" title="Fig 1B" src="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1b.gif?w=455" alt=""   /></a><p class="wp-caption-text">Fig 1B (P2V topology)</p></div>
<p>Note 3: The Crossbow MAC layer itself does the switching between the VNICs. The Etherstub is craeated as a dummy device to connect the various virtual NICs. User can imagine etherstub as a Virtual Switch to help visualize the virtual network as a replacement for a physical network where each physical switch is replaced by a virtual switch (implemented by a Crossbow etherstub).</p>
<h3>Create the Virtual Network</h3>
<p>Lets start by creating the 2 virtual switches using the dladm command<br />
<code><br />
gz# dladm create-etherstub vswitch1<br />
gz# dladm create-etherstub vswitch3</code></p>
<p>gz# dladm show-etherstub<br />
LINK<br />
vswitch1<br />
vswitch3</p>
<p>Create the necessary Virtual NICs. VNIC1 has a limited speed of 100Mbs while others have no limit</p>
<p>gz# dladm create-vnic -l vswitch1 vnic1<br />
gz# dladm create-vnic -l vswitch1 vnic2<br />
gz# dladm create-vnic -l vswitch1 vnic3</p>
<p>gz# dladm create-vnic -l vswitch3 vnic6<br />
gz# dladm create-vnic -l vswitch3 vnic9</p>
<p>gz# dladm show-vnic</p>
<p>LINK        OVER             SPEED  MACADDRESS         MACADDRTYPE<br />
vnic1       vswitch1      &#8211; Mbps  2:8:20:8d:de:b1    random<br />
vnic2       vswitch1      &#8211; Mbps  2:8:20:4a:b0:f1    random<br />
vnic3       vswitch1      &#8211; Mbps  2:8:20:46:14:52    random<br />
vnic6       vswitch3      &#8211; Mbps  2:8:20:bf:13:2f    random<br />
vnic9       vswitch3      &#8211; Mbps  2:8:20:ed:1:45     random</p>
<p>Create the hosts and assign them the VNICs. Also create the Virtual Router and assign it VNIC3 and VNIC9 over vswitch1 and vswitch3 respectively. Both the Virtual Router and Hosts are created using Zones in this example but you can easily use Xen or logical domains.</p>
<p>Create a base Zone which we can clone.<br />
<code><br />
gz# zfs create -o mountpoint=/vnm rpool/vnm<br />
gz# chmod 700 /vnm</code></p>
<p>gz# zonecfg -z vnmbase<br />
vnmbase: No such zone configured<br />
Use &#8216;create&#8217; to begin configuring a new zone.<br />
zonecfg:vnmbase&gt; create<br />
zonecfg:vnmbase&gt; set zonepath=/vnm/vnmbase<br />
zonecfg:vnmbase&gt; set ip-type=exclusive<br />
zonecfg:vnmbase&gt; add inherit-pkg-dir<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/opt<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/etc/crypto<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; end<br />
zonecfg:vnmbase&gt; verify<br />
zonecfg:vnmbase&gt; commit<br />
zonecfg:vnmbase&gt; exit</p>
<p>This part takes 15-20 minutes</p>
<p>gz# zoneadm -z vnmbase install</p>
<p>Now lets create the 2 hosts and the Virtual Router as follow</p>
<p>gz# zonecfg -z host1<br />
host1: No such zone configured<br />
Use &#8216;create&#8217; to begin configuring a new zone.<br />
zonecfg:vnmbase&gt; create<br />
zonecfg:vnmbase&gt; set zonepath=/vnm/host1<br />
zonecfg:vnmbase&gt; set ip-type=exclusive<br />
zonecfg:vnmbase&gt; add inherit-pkg-dir<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/opt<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/etc/crypto<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; end<br />
zonecfg:vnmbase&gt; add net<br />
zonecfg:vnmbase:net&gt; set physical=vnic1<br />
zonecfg:vnmbase:net&gt; end<br />
zonecfg:vnmbase&gt; verify<br />
zonecfg:vnmbase&gt; commit<br />
zonecfg:vnmbase&gt; exit</p>
<p>gz# zoneadm -z host1 clone vnmbase</p>
<p>gz# zoneadm -z host1 boot</p>
<p>gz# zlogin -C host1</p>
<p>Connect to the console and go through the sysid config. For this example, we assign 10.0.0.1/24 as IP address for vnic1. You can specify this during sysidcfg. For default route, specify 10.0.0.3 as the default route. You can say &#8216;none&#8217; for naming service, IPv6, kerberos etc for the purpose of this example.</p>
<p>Similarly create host2 and configure it with vnic2 i.e.<br />
<code><br />
gz# zonecfg -z host2<br />
host2: No such zone configured<br />
Use 'create' to begin configuring a new zone.<br />
zonecfg:vnmbase&gt; create<br />
zonecfg:vnmbase&gt; set zonepath=/vnm/host2<br />
zonecfg:vnmbase&gt; set ip-type=exclusive<br />
zonecfg:vnmbase&gt; add inherit-pkg-dir<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/opt<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/etc/crypto<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; end<br />
zonecfg:vnmbase&gt; add net<br />
zonecfg:vnmbase:net&gt; set physical=vnic2<br />
zonecfg:vnmbase:net&gt; end<br />
zonecfg:vnmbase&gt; verify<br />
zonecfg:vnmbase&gt; commit<br />
zonecfg:vnmbase&gt; exit</code></p>
<p>gz# zoneadm -z host2 clone vnmbase<br />
gz# zoneadm -z host2 boot<br />
gz# zlogin -C host2</p>
<p>Connect to the console and go through the sysid config. For this example, we assign 10.0.0.2/24 as IP address for vnic2. You can specify this during sysidcfg. For default route, specify 10.0.0.3 as the default route. You can say &#8216;none&#8217; for naming service, IPv6, kerberos etc for the purpose of this example.</p>
<p>Lets now create the Virtual Router as<br />
<code><br />
gz# zonecfg -z vRouter<br />
vRouter: No such zone configured<br />
Use 'create' to begin configuring a new zone.<br />
zonecfg:vnmbase&gt; create<br />
zonecfg:vnmbase&gt; set zonepath=/vnm/vRouter<br />
zonecfg:vnmbase&gt; set ip-type=exclusive<br />
zonecfg:vnmbase&gt; add inherit-pkg-dir<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/opt<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; set dir=/etc/crypto<br />
zonecfg:vnmbase:inherit-pkg-dir&gt; end<br />
zonecfg:vnmbase&gt; add net<br />
zonecfg:vnmbase:net&gt; set physical=vnic3<br />
zonecfg:vnmbase:net&gt; end<br />
zonecfg:vnmbase&gt; add net<br />
zonecfg:vnmbase:net&gt; set physical=vnic9<br />
zonecfg:vnmbase:net&gt; end<br />
zonecfg:vnmbase&gt; verify<br />
zonecfg:vnmbase&gt; commit<br />
zonecfg:vnmbase&gt; exit</code></p>
<p>gz# zoneadm -z vRouter clone vnmbase<br />
gz# zoneadm -z vRouter boot<br />
gz# zlogin -C vRouter</p>
<p>Connect to the console and go through the sysid config. For this example, we assign 10.0.0.3/24 as IP address for vnic3 and 20.0.0.1/24 as the IP address for vnic9. You can specify this during sysidcfg. For default route, specify &#8216;none&#8217; as the default route. You can say &#8216;none&#8217; for naming service, IPv6,</p>
<p>kerberos etc for the purpose of this example. Lets enable forwarding on the Virtual Router to connect the 10.x.x.x and 20.x.x.x networks.<br />
<code><br />
vRouter# svcadm enable network/ipv4-forwarding:default<br />
</code><br />
Note 5: The above is done inside virtual router. Make sure you are in the window where you did the zlogin -C vRouter above</p>
<p>Now lets bringup VNIC6 and configure it including setting up routes in the global zone. You can easily create another host called host3 as the client on 20.x.x.x network by creating a host3 zone and assigning it 20.0.0.1/24 IP address</p>
<p>Lets configure the VNIC6. Open a xterm in the global zone<br />
<code><br />
gz# ifconfig vnic6 plumb 20.0.0.3/24 up<br />
gz# route add 10.0.0.0 20.0.0.1<br />
gz# ping 10.0.0.1<br />
10.0.0.1 is alive</code></p>
<p>gz# ping 10.0.0.2<br />
10.0.0.2 is alive</p>
<p>Similarly, login into host1 and/or host2 and verify connectivity<br />
<code><br />
host1# ping 20.0.0.3<br />
20.0.0.3 is alive<br />
host1# ping 10.0.0.2<br />
10.0.0.2 is alive<br />
</code></p>
<h3>Set up Link Speed</h3>
<p>What we configured above are unlimited B/W links. We can configure a link speed on all the links. For this example, lets configure the link speed of 100Mbps on VNIC1<br />
<code><br />
gz# dladm set-linkprop -p maxbw=100 vnic1<br />
</code><br />
We could have configured the link speed (or B/W limit) while we were creating  the vnic itself by adding the &#8220;-p maxbw=100&#8243; option to create-vnic command.</p>
<h3>Test the performance</h3>
<p>Start &#8216;<em>netserver</em>&#8216; (or tool of your choice) in host1 and host2. You wil have to install the tools in the relevant places<br />
<code><br />
host1# /opt/tools/netserver &amp;<br />
host2# /opt/tools/netserver &amp;<br />
gz# /opt/tools/netperf -H 10.0.0.2<br />
TCP STREAM TEST to 10.0.0.2 : histogram<br />
Recv   Send    Send<br />
Socket Socket  Message  Elapsed<br />
Size   Size    Size     Time     Throughput<br />
bytes  bytes   bytes    secs.    10^6bits/sec<br />
49152  49152  49152    10.00    2089.87</code></p>
<p>gz# /opt/tools/netperf -H 10.0.0.1<br />
TCP STREAM TEST to 10.0.0.1 : histogram<br />
Recv   Send    Send<br />
Socket Socket  Message  Elapsed<br />
Size   Size    Size     Time     Throughput<br />
bytes  bytes   bytes    secs.    10^6bits/sec<br />
49152  49152  49152    10.00     98.78</p>
<p>Note6: Since 10.0.0.2 is assigned to VNIC2 which has no limit, we get the max  speed possible. 10.0.0.1 is configured over VNIC1 which is assigned to host1 and we just set the link speed to 100Mbps and thats why we get only 98.78Mbps.</p>
<h3>Cleanup</h3>
<p><code><br />
gz# zoneadm -z host1 halt<br />
gz# zoneadm -z host1 uninstall</code></p>
<p>delete the zone</p>
<p>gz# zonecfg -z host1</p>
<p>zonecfg:host1&gt; delete<br />
Are you sure you want to delete zone host1 (y/[n])? y<br />
zonecfg:host1&gt; exit</p>
<p>In this way, delete host2 and vRouter zones. Make sure you don&#8217;t delete vnmbase since re creating it takes time.</p>
<p>gz# ifconfig vnic6 unplumb</p>
<p>After you have deleted the zone, you can delete vnics and etherstubs as follows</p>
<p># dladm delete-vnic vnic1			/* Delete VNIC */<br />
# dladm delete-vnic vnic2<br />
# dladm delete-vnic vnic3<br />
# dladm delete-vnic vnic6<br />
# dladm delete-vnic vnic9</p>
<p># dladm delete-etherstub vswitch3		/* Delete etherstub */<br />
# dladm delete-etherstub vswitch1</p>
<p>Make sure that VNICs are unplumbed (ifconfig vnic6 unplumb) and not assigned to a zone (delete the zone first) before you can delete them. You need to delete all the vnics on the etherstub before you can delete the etherstub.</p>
<h3>User Exercises</h3>
<p>Now that you are familiar with the concepts and technology, you are ready to do some experiments of your own. Cleanup the machine as mentioned above. The exercises below will help you master IP routing, configuring networks, and debugging for performance bottlenecks.</p>
<ul>
<li>1. 	Recreate the Virtual Networkwork as show in Fig 1b but this time 	create an additional zone called client and assigned vnic6 to that 	client zone.</li>
</ul>
<pre>client Zone	vRouter	        host1	host2
|		  |  |		  |       |
---- vswitch3 -----  -------- vswitch1-----</pre>
<p>Run all your connectivity tests from zloging into the client. Now change all IPv4 addresses to be IPv6 addresses and verify that client and hosts still have connectivity</p>
<ul>
<li>2. 	Leave the Virtual Network as in 1, but configure OSPF in vRouter 	instead of RIP by default. Verify that you can still get the 	connectivity. Note the steps needed to configure OSPF</li>
</ul>
<ul>
<li>3. 	Configure 20.0.0.0 and 10.0.0.0 networks as two separate autonomous 	networks, assign them unique ASN numbers and configure unique BGP 	domains. Verify that connectivity still works. Note the steps needed 	to configure BGP domains.</li>
</ul>
<ul>
<li>4. 	Cleanup everything and recreate the virtual network in 1 above but 	instead of statically assigning the IP addresses to hosts and 	clients, configure NAT on the vRouter to give out address on subnet 	10.0.0.0/24 on vnic3 and address on 20.0.0.0/24 for vnic9. While 	creating the hosts and clients, configure them to get their IP 	address through DHCP.</li>
</ul>
<ul>
<li>5. 	Cleanup everything and recreate the virtual network in 1 above. Add 	additional router vRouter2 which has a vnic each on the 2 	etherstubs.</li>
</ul>
<pre>                               vRouter1
                              / 	 \
                   20.0.0.0/24 	         10.0.0.0/24
                              \	         /
                               vRouter2</pre>
<p>This provides a redundant path from client to the hosts. Experiment with running different routing protocols and assign different weight to each path and see what path you take from client to host (use traceroute to detect). Now configure the routing protocol on two vRouters to be OSPF and play with link speeds and see how the path changes. Note the configuration and observations.</p>
<ul>
<li>6. 	Cleanup. Lets now introduce another Virtual Router between two 	subnets i.e.</li>
</ul>
<pre>client Zone	 vRouter1	 vRouter2	host1	     host2
|		   |  |            |   |	  |	       |
---- vswitch3 ------   -vswitch2----   -----vswitch3------------
    20.0.0.0/24	      30.0.0.0/24	      10.0.0.0/24</pre>
<p>Now set the link (VNIC) between vRouter1 and etherstub2 to be 75 Mbps. Use snmp from client to retrive the stats from the vRouter1 and check  where the packets are getting dropped when you run netperf from client to host2.</p>
<p>Remove the limit set earlier and instead set the link speed of 75 Mbps on link between etherstub2 and vRouter2. Again use snmp to get the stats out on vRouter1. Do you see similar results as vRouter1? If not, can you explain why?</p>
<h3>Conclusion &amp; More resources</h3>
<p>Use the real example and configure the virtual network to get familiar with the techniques used. At this point, have a look at your network and try to create a virtual network.</p>
<p>Get more details on the OpenSolaris Crossbow page</p>
<p><strong>http://www.opensolaris.org/os/project/crossbow</strong></p>
<p>You can find high level presentations, architectural documents, man pages etc  at http://www.opensolaris.org/os/project/crossbow/Docs</p>
<p><strong>Join the crossbow-discuss@opensolaris.org mailing list at</strong></p>
<p><strong>http://www.opensolaris.org/os/project/crossbow/discussions</strong></p>
<p>Send in your questions or your configuration samples and we will put it in the use cases examples.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/60/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/60/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/60/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=60&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2010/03/25/60/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1a.gif" medium="image">
			<media:title type="html">Fig 1a</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/crossbow_workshop_fig1b.gif" medium="image">
			<media:title type="html">Fig 1B</media:title>
		</media:content>
	</item>
		<item>
		<title>CrossBow: Solaris Network Virtualization &amp; Resource Provisioning</title>
		<link>http://sunaytripathi.wordpress.com/2010/03/25/crossbow-solaris-network-virtualization-resource-provisioning/</link>
		<comments>http://sunaytripathi.wordpress.com/2010/03/25/crossbow-solaris-network-virtualization-resource-provisioning/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 06:32:51 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=48</guid>
		<description><![CDATA[Crossbow provide network virtualization and resource provisioning to OpenSolaris Networking Stack. An architects overview of the stack to understand the high level working and why it works and performs way better than other architectures.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=48&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h2>1. CrossBow (the name):</h2>
<p>It makes some sense to explain the relatonship between the technology (Network Virtualization and Resource Control) and the project name (CrossBow). It is believed that Crossbow was invented in 341B.C. in China but the use became prevalent in middle ages specially when steel was used to make the weapon. More powerful Crossbows could penetrate the armour at 200 yards and gave the typical horse mounted knights real nightmares. But the biggest differentiator was the simplicity in their use. Crossbow could be used effectively after a week of<br />
training, while a comparable single-shot skill with a longbow could take years of practice. </p>
<p>Similary, if you take a look at the existing QOS mechanisms on a end host, they are very difficult to use and normally take a very skilled administrator to use effectively. Even then, the existing QOS mechanism come with heavy performance penalties which is also pretty common with any kind of virtualization as well. In Solaris land, we have invented a new way of imposing bandwidth resource control as attribute to a real or a virtual NIC such that it is built in as part of the Solaris network stack and comes without any performance penalties. Since the virtualization aspects and/or resource control aspects are just the attributes of the NIC/VNIC (specified when a NIC or Virtual NIC is created), a normal user and configure them without<br />
needing a docterate in QOS or virtualization. &#8220;CrossBow&#8221; was the most suitable name for this project since we are trying to achieve similar results in the field of virtualization and resource control as the weapon did in medivial times in the battlefield.</p>
<h2>2. CrossBow (the background):</h2>
<p>Crossbow provides the building blocks for network virtualization and resource control by creating virtual stacks around any service (HTTP, HTTPS, FTP, NFS, etc.), protocol (TCP, UDP, SCTP, etc.), or Virtual machines like Containers, Xen and ldoms.</p>
<p>The project allows the system administrator to carve out any physical NIC into multiple virtual NICs which are pretty similar to real NICs and are administered just like real NICs. Each Virtual NIC can be assigned its own priority and band-width on a shared NIC without causing any performance degradation. The virtual NICs can have their own NIC hardware resources (Rx/Tx rings, DMA channels), MAC addresses, kernel threads and queues which are private to the VNIC and are not shared accross all traffic. In case of Solaris Containers, the Container can be assigned a virtual Stack Instance as well along with one or more virtual NICs. As such traffic for one VNIC can be totally isolated from other traffic and assigned any kind of limits or guarantees on amount of bandwidth it can use.</p>
<h2>3. Overview:</h2>
<p>Project Crossbow extends Solaris reach in several markets.<br />
	</p>
<h3>3a. OS/Network/Server Consolidation:</h3>
<p>
The application, network and server consolidation environments where both OS and network virtualization play a big role. This market is typically driven by the cost of owning and managing physical machines and physical networks. The sweet spot for these horizontally scaled environment are the 2-4 socket machines which appear as 4-8 CPU machines in case of x86/x64 systems and 32-64 CPU machines in case of SUN&#8217;s new Niagara based servers. From total cost of ownership perspective, these blades have only one physical NIC (1Gb or 10Gb) but<br />
are trying to run multiple virtual machines (Xen, Containers, ldoms) which have to share the NIC resources and the available bandwidth.</p>
<p>The problem gets worse because for 3 decades we have been designing application to go as fast as possible and any congestion control is the job of the transport layer (if at all). So if one virtual machine is using UDP based traffic, then other virtual machines on the same system using TCP traffic will suffer badly. Even within same transport  (TCP for instance), bulk through put applications like ftp/http etc will have a very negetive impact on interactive traffic and latency sensitive applications. </p>
<p>The goal of the project Crossbow  is to different virtual machines share the common NIC in a fair manner and allow system administrators to set preferential policies where necessary (e.g. the ISP selling limited B/W on a common pipe) without any performance impact.</p>
<h3>3b. Traditional QOS and application consolidation:</h3>
<p>Exisiting host based QOS mechanism are very complex to setup and typically come with a sizable performance penalty and increase in latency. The big part of the problem is the interrupt based delivery mechanism for inbound packets and the QOS being implemented by a separate layer (typically between NIC driver and IP). The network and transport layer of the host stack is unware about the QOS layer. The packets are already delivered to the host memory by means of interrupts and the QOS layer needs to classify the packets to various queues before it can apply the policies. In case the packet can not be  processed because the bandwidth usage for that class is exceeded, it sits in a queue while still consuming system memory.</p>
<p>Project Crossbow integrates stack virtualization and QOS as part of the stack architecture itself to offer a large subset of QOS type functionality at zero performance penalty and simple administrative interfaces. It also integrates diffserv with the stack where a virtual NIC can set and read the diffserv based labels. Since Crossbow architecture is limited in differentiating the traffic based on layer 2, 3, and 4 headers only i.e. the VLAN tag, local mac address, local IP address, protocol, and ports; the functionality offered is a subset  of exisiting QOS mechanism although it covers 90% of the use cases without any performance penalty. This is the prime reason why project Crossbow refers to the bandwidth related policies as &#8216;Bandwidth resource control&#8217; instead of QOS.</p>
<h3>3c. Horizontally scaled markets:</h3>
<p>This is the market segment made up of low priced volume servers (typically 2-4 socket machines) which offer services which require little or no sharing of data between them. The small servers can be standalone machines in a rack or blades in a chassis. Grids are another  way to use volume servers to achieve the output of the traditional large SMP machines or main frames. </p>
<p>In case of blades which share a common 10Gb NIC to the chasis, Crossbow again provides the sharing of bandwidth in a fair manner. In addition, the Crossbow provided APIs for network management, virtualization and bandwidth resource control can be use by 3rd party management softwares to propagate the common policy throughout the server farm or all the blades in the chassis. In a Solaris based homogenous environments, its very easy to mark an application or a virtual machine (based on port or IP address) as critical and propagate the same policy through all the machines. The diffserv labels can be added appropriately such that the policy is honored by all machines and network element in the center.</p>
<h2>4. Technical problems in exisiting architectures:</h2>
<p>As mentioned earlier, the host based QOS systems work as a layer between the network stack and as such are pretty inefficient in providing the QOS services required of them. But that is not all. </p>
<p>The exisiting interrupt driven packet delivery model pecludes any kind of policy enforcement and fair sharing. When a NIC interrupt is raise, it is at a highest priority and the CPU has to context switch whatever processing to deal with the interrupt. Most of the time, the processing of a critical packet is interrupted to deal with the arrival of a non critical packet.</p>
<p>The anonymous packet processing in the kernel is another major problem  in virtualizing the stack and enforcing any kind of bandwidth resource  control (including fairness). 80% of the work is already done for an incoming packet when the stack determines that no one is actually interested in the packet and it needs to drop it. In other words, the cost of dropping unwanted packets is too high.</p>
<p>Everything in the host flows through common queues and is processed by  common threads which make enforcing policies based on traffic type very difficult. Recv or xmit of each packet impacts processing on any other packet on that particular CPU.</p>
<p>In most of the virtualized environments, the pseudo NIC in the virtual  machines has no way of knowing about the hardware capabilities of the real hardware (even simple things like hardware checksum) because of the presence of the bridge in between and ends up making negative performance impact. In addition, there is no mechanism to share the NIC in a fair manner. The transition of typical packet from the dom0 to domU also causes severe performance problems.</p>
<h2>5. CrossBow Architecture:</h2>
<p>The Crossbow architecture starts out by integrating network virtualization and resource control as part of the stack architecture. The Solaris 10 network stack has already been designed<br />
for the next decade where the connection to CPU affinity is maintained  and the upper stack has tight control over the NIC resources. </p>
<p>Crossbow builds on top of that by pushing the classification of packets based on services, protocols or virtual machines as far below as possible. If the NIC hardware itself has ability to divide on board memory into segments/queues (know as Rx and Tx rings) which can preferably have their own DMA channels and MSI-X interrupts, the stack  programs the NIC classifier to classify packets based on configured policies to different Rx rings. Each Rx/Tx ring is owned by a CPU and a separate kernel queue know as SRS which controls the  rate of packet arrival into the system based on configured bandwidth.</p>
<p>The Rx/Tx ring, the associated DMA channel, MSI-X interrupt, the serialization queue, the CPU, and processing threads are all unique for the service, protocol or virtual machine in question and can be assigned a unique MAC address and a Virtual NIC which becomes the administration entity that can be administered like a normal NIC. The NIC classifier drives the incoming packets to the correct RX ring from  where the SRS owning the Rx ring (and VNIC) will pull the packets via polling mode based on fair sharing of resources or configured bandwidth. As shown in the figure below, each such path is called a hardware lane and each hardware lane has dedicated hardware and software resources and operates independently from each other. The interrupt mode is used only when the SRS has no packets to process and the Rx ring is empty. Each individual Rx ring is dynamically switched between interrupt and polling mode. Incoming<br />
packets that exceed the configured bandwidth limit remain in the NIC itself in their corresponding Rx ring and are pulled in the system only when they are ready to be processed. </p>
<div id="attachment_53" class="wp-caption aligncenter" style="width: 465px"><a href="http://sunaytripathi.files.wordpress.com/2010/03/xbow_hw_lanes1.gif"><img src="http://sunaytripathi.files.wordpress.com/2010/03/xbow_hw_lanes1.gif?w=455&#038;h=189" alt="" title="Crossbow Hardware Lanes" width="455" height="189" class="size-full wp-image-53" /></a><p class="wp-caption-text">Crossbow Hardware Lanes</p></div>
<p>
The creation of an administrative entity (VNIC) is optional and typically associated with a virtual machine like Solaris containers, Xen or ldoms. For application or protocol based resource control, a separate data path is created to provide the isolation and resource control but a VNIC is not configured.</p>
<p>As mentioned above the VNIC is just an administrative entity. If the classification has already been done by the NIC to a particular Rx ring, the packets as delivered directly to IP layer by means of function calls when Rx ring is interrupt mode or the SRS residing in MAC layer pulls the packet chain directly from the Rx ring when in the polling mode. In essence, the entire data link layer is bypassed resulting in improved performance and lower latencies. If the VNIC is placed in promiscous mode, the data link bypass is abandoned and the Rx ring delivers packets via the VNIC layer which creates a copy of the packet for promiscous stream. </p>
<p>The entire layered architecture is built on function pointers know as &#8216;upcall_func&#8217; and &#8216;downcall_func&#8217; with corresponding &#8216;upcall_arg&#8217; and &#8216;downcall_arg&#8217; for context. Every layer provides a pointer of its recv function as &#8216;upcall_func&#8217; and a context as &#8216;upcall_arg&#8217; to the layer below. Similarly, every layer provides pointer to its transmit function as &#8216;downcall_func&#8217; and a context cookie as &#8216;downcall_arg&#8217; to layer above. This is how the packet path is constructed. Any layer can  short circuit itself out by providing the &#8216;upcall_func&#8217; and &#8216;upcall_arg&#8217; of the layer above to layer below (and same for transmit side if needed). All context cookies for a layer work on reference based system when each layer pointed to it gets a reference and ensure  that data structures don&#8217;t get freed till all references are dropped.</p>
<p>In case, the NIC hardware does not have classification capability (unlikely since most of intel, broadcom and SUN 1Gb NICs and pretty much all 10Gb NICs shipping for past several years have this capability) or have run out of the classification capability, the architecture provides a classification capability in the mac layer and employs soft rings which are similar to functionality as NIC hardware classifier and RX rings. The NIC hardware layer coupled with lower MAC layer and soft rings are termed as &#8216;Pseudo Hardware layer&#8217;. A request by administartor to create a new VNIC or flow will always return successful from the pseudo hardware layer. The pseudo hardware layer manages the hardware and software classification capability and Rx rings and soft rings transparently from upper layers.</p>
<h2>6. The administrative model:</h2>
<p>Crossbow introduces a new command called &#8216;flowadm&#8217; and further augments  &#8216;dladm&#8217; which was introduced as part of the new high performance device driver framework (GLDv3) in Solaris 10. </p>
<p>&#8216;dladm (1M)&#8217; &#8211; This is primarily used to create, modify and destroy VNIC based on mac or IP addresses. The created VNIC is visible and managed by ifconfig just like any otehr NIC and can get its IP address  assigned via DHCP if necessary.</p>
<p>The examples below can illustrate this better:<br />
<br />
<code><br />
     Example 1: Configuring VNICs</p>
<p>     To create two VNICs interfaces with vinc-ids 1 and 2<br />
     over a single physical device bge0, enter the following com-<br />
     mands:</p>
<p>     # dladm create-vnic -l bge0 vnic1<br />
     # dladm create-vnic -l bge0 2vnic<br />
     The new links will be called vnic1 and vnic2.</p>
<p>     Example 2: Configuring VNICs and allocating bandwidth &amp; priority</p>
<p>     To create two VNIC interfaces with vinc-ids 1 and 2<br />
     over a single physical device bge0 and make vnic1 a higher<br />
     priority VNIC using factory assigned MAC address with guarantee<br />
     to use upto 90% of the bandwidth and vnic2 having a lower priority<br />
     with a random MAC address and a hard limit of 100Mbps:</p>
<p>     # dladm create-vnic -l bge0 -m factory -b 90% -G -p high vnic1<br />
     # dladm create-vnic -l bge0 -m random -b 100M -L -p low vnic2 </p>
<p>     Example 3: Configure a VNIC by choosing a factory MAC address</p>
<p>     To create a VNIC interface with vinc-id 1 by first<br />
     listing the factory available MAC address and then using one<br />
     of them:</p>
<p>     # dladm show-dev -l bge0 -m<br />
     bge0<br />
            link: up        speed: 1000   Mbps       duplex: full<br />
	     MAC addresses:<br />
			slot-ident	Address			In Use<br />
			1		0:e0:81:27:d4:47	Yes<br />
			2		8:0:20:fe:4e:a5		No</p>
<p>     # dladm create-vnic -l bge0 -m factory -n 2 vnic1</p>
<p>     # dladm show-dev -l bge0<br />
     bge0<br />
            link: up        speed: 1000   Mbps       duplex: full<br />
	     MAC addresses:<br />
			slot-ident	Address			In Use<br />
			1		0:e0:81:27:d4:47	Yes<br />
			2		8:0:20:fe:4e:a5		Yes</p>
<p>     Example 4: Configuring VNICs sharing a MAC address</p>
<p>     To create two VNICs with vnic-id 1 and 2 by first listing the<br />
     available factory assigned MAC addresses and then picking one<br />
     that will be shared by the newly created VNICs</p>
<p>     # dladm show-dev -l bge0 -m<br />
     bge0<br />
            link: up        speed: 1000   Mbps       duplex: full<br />
	     MAC addresses:<br />
			slot-ident	Address			In Use<br />
			1		0:e0:81:27:d4:47	Yes<br />
			2		8:0:20:fe:4e:a5		No</p>
<p>     # dladm create-vnic -l bge0 -m shared -n 2 vnic1<br />
     # dladm create-vnic -l bge0 -m shared -n 2 vnic2</p>
<p>     Example 5: Creating a VNIC with user specified MAC address</p>
<p>     To create a VNIC with vnic-id 1 by providing a user specified<br />
     mac address</p>
<p>     # dladm create-vnic -l bge0 -m 8:0:20:fe:4e:b8<br />
</code></p>
<p>&#8216;flowadm (1M)&#8217; &#8211; This command is primarily used to provide isolation<br />
and private resources to an application traffic or protocol. In<br />
addition, we can also configure bandwidth limits and guarantees for<br />
the flows. Again some example can illustrate the usage better:<br />
<br />
<code><br />
     Example 1: Create a policy around mission critical port 443 traffic<br />
     which is https service.</p>
<p>     To create a policy around inbound https traffic on a https server<br />
     so that https gets it dedicated NIC hardware and kernel TCP/IP<br />
     resources. The policy-id specified is https-1 which is used to<br />
     later modify of delete the policy.</p>
<p>     # flowadm add-flow -l bge0 -a transport=tcp,local_port=443 https-1</p>
<p>     Example 2: Modify an existing policy to add bandwidth resource control</p>
<p>     To modify https-1 policy to add bandwidth control and give it a<br />
     high priority</p>
<p>     # flowadm set-flowprop -p maxbw=500M,priority=high  https-1</p>
<p>     Example 3: Limit the bandwidth usage of UDP protocol</p>
<p>     To create a policy for UDP protocol so that it can not consume more<br />
     than 10% of available bandwidth. The flow-id is called limit-udp-1.</p>
<p>     # flowadm add-flow -l bge0 -a transport=UDP -p maxbw=100M, \<br />
       priority=low limit-udp-1</p>
<p></code></p>
<h2>8. Crossbow Observability &#8211; Stats, history and APIs:</h2>
<p>Apart from the functionality related to network virtualization and bandwidth resource control, Crossbow offers a whole range of news tools and mechanism to understand the bandwidth usage. Administrators can see real time bandwidth usage for various VNICs or configured flows (via &#8216;flowadm&#8217;) without causing any performance penalties. </p>
<p>The Rx rings, SRS and squeues dealing with a particular flow keep track of normal stats which are pulled by a user land daemon from time to time. The daemon also logs the information in special log files which allows users to see history at any given time. A user can request usage for a time period in past to understand the system behavior.</p>
<p>Crossbow will provide more tools to help capacity planning by allowing  the system to be put under capacity planning mode where bandwdith usage for top traffic is monitored and displayed.</p>
<p>All the observability and administrative interfaces can be accessed by  APIs which allow other applications to use and manage the system.</p>
<h2>9. Resources:</h2>
<p>Crossbow project page on OpenSolaris is a good source of information http://www.opensolaris.org/os/project/crossbow</p>
<p>The Crossbow mailing list is where all the day to day business for the  project is conducted. Anyone can join the mailing list crossbow-discuss@opensolaris.org. </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/48/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/48/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/48/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=48&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2010/03/25/crossbow-solaris-network-virtualization-resource-provisioning/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/xbow_hw_lanes1.gif" medium="image">
			<media:title type="html">Crossbow Hardware Lanes</media:title>
		</media:content>
	</item>
		<item>
		<title>Solaris 10 Networking &#8211; The Magic Revealed</title>
		<link>http://sunaytripathi.wordpress.com/2010/03/25/solaris-10-networking-the-magic-revealed/</link>
		<comments>http://sunaytripathi.wordpress.com/2010/03/25/solaris-10-networking-the-magic-revealed/#comments</comments>
		<pubDate>Thu, 25 Mar 2010 05:41:18 +0000</pubDate>
		<dc:creator>sunaytripathi</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sunaytripathi.wordpress.com/?p=8</guid>
		<description><![CDATA[Sunay's overview of Solaris 10 networking stack as the Architect for the redesign. The article describes in detail on how packets flow through the stack, the new GLDv3 device driver interface and why Solaris 10 performs so much better then its predecessors and how to get even more.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=8&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<ol id="mozToc"> <!--mozToc h2 1 h3 2--></p>
<li><a href="#mozTocId106028">Background</a></li>
<li><a href="#mozTocId592593.25">Solaris 10 stack</a>
<ol>
<li><a href="#mozTocId984166.25">Overview </a></li>
<li><a href="#mozTocId533719">Vertical perimeter </a></li>
<li><a href="#mozTocId267870">IP classifier</a></li>
<li><a href="#mozTocId874133.25">Synchronization mechanism</a></li>
</ol>
</li>
<li><a href="#mozTocId533350">TCP </a>
<ol>
<li><a href="#mozTocId900018.25">Socket</a></li>
<li><a href="#mozTocId976985">Bind</a></li>
<li><a href="#mozTocId872316.25">Connect</a></li>
<li><a href="#mozTocId546724.25">Listen</a></li>
<li><a href="#mozTocId5895.001953125">Accept</a></li>
<li><a href="#mozTocId261866">Close</a></li>
<li><a href="#mozTocId42499.015625">Data path</a></li>
<li><a href="#mozTocId809090.25">TCP Loopback</a></li>
</ol>
</li>
<li><a href="#mozTocId117060.03125">UDP</a>
<ol>
<li><a href="#mozTocId842262.25">UDP packet drop within the stack</a></li>
<li><a href="#mozTocId771014">UDP Module</a></li>
<li><a href="#mozTocId344689.125">UDP and Socket interaction</a></li>
<li><a href="#mozTocId497358">Synchronous STREAMS</a></li>
<li><a href="#mozTocId959941">STREAMs fallback</a></li>
</ol>
</li>
<li><a href="#mozTocId98206">IP</a>
<ol>
<li><a href="#mozTocId955763">Plumbing NICs</a></li>
<li><a href="#mozTocId342636.125">IP Network MultiPathing (IPMP)</a></li>
<li><a href="#mozTocId315410">Multicast</a></li>
</ol>
</li>
<li><a href="#mozTocId767708">Solaris 10 Device Driver framework</a>
<ol>
<li><a href="#mozTocId727795.25">GLDv2 and Monolithic DLPI drivers (Solaris 9 and before) </a></li>
<li><a href="#mozTocId946165.25">GLDv3 &#8211; A New Architecture </a></li>
<li><a href="#mozTocId342621">GLDv3 Link aggregation architecture</a></li>
<li><a href="#mozTocId368836">Checksum offload</a></li>
</ol>
</li>
<li><a href="#mozTocId400304.125">Tuning for performance:</a></li>
<li><a href="#mozTocId109935.03125">Future</a></li>
<li><a href="#mozTocId371329.125">Acknowledgments</a></li>
</ol>
<h2><a class="mozTocH2" name="mozTocId106028"></a>1 Background</h2>
<p>The networking stack of Solaris 1.x was a BSD variant and was pretty similar to the BSD Reno implementation. The BSD stack worked fine for low end machines but Solaris wanted to satisfy the needs of low end customers as well as enterprise customers and such migrated to AT&amp;T SVR4 architecture which became Solaris 2.x.</p>
<p>With Solaris 2.x, the networking stack went through a make over and transitioned from a BSD style stack to STREAMs based stack. The STREAMs framework provided an easy message passing interface which allowed the flexibility of one STREAMs module interacting with other STREAM module. Using the STREAMs inner and outer perimeter, the module writer could provide mutual exclusion without making the implementation complex. The cost of setting up a STREAM was high but number of connection setup per second was not an important criterion and connections were usually long lived. When the connections were more long lived (NFS, ftp, etc.), the cost of setting up a new stream was amortized over the life of the connection.</p>
<p>During late 90s, the servers became heavily SMP running large number of CPUs. The cost of switching processing from one CPU  to another became high as the mid to high end machines became more NUMA centric. Since STREAMs by design did not have any CPU affinity, packets for a particular connections moved around to different CPU. It was apparent that Solaris needed to move away from STREAMs architecture.</p>
<p>Late 90s also saw the explosion of web and increase in processing power meant a large number of short lived connections making connection setup time equally important. With Solaris 10, the networking stack went through one more transition where the core pieces (i.e. socket layer, TCP, UPD, IP, and device driver) used an IP Classifier and serialization queue to improve the connection setup time, scalability, and packet processing cost. STREAMs are still used to provide the flexibility that ISVs need to implement additional functionality.</p>
<h2><a class="mozTocH2" name="mozTocId592593.25"></a>2 Solaris 10 stack</h2>
<p>Lets have a look at how the new framework and its key components.</p>
<h3><a class="mozTocH3" name="mozTocId984166.25"></a>Overview</h3>
<p>The pre Solaris 10 stack uses STREAMS perimeter and kernel adaptive mutexes for multi-threading. TCP uses a STREAMS QPAIR perimeter, UDP uses a STREAMS QPAIR with PUTSHARED, and IP a PERMOD perimeter with PUTSHARED and various TCP, UDP, and IP global data structures protected by mutexes. The stack was executed by both user-land threads executing various system-calls, the network device driver read-side interrupt or device driver worker thread, and by STREAMS framework worker threads. As the current perimeter provides per module, per protocol stack layer, or horizontal perimeter. This can, and often does, lead to a packet being processed on more than one CPU and by more than one thread leading to excessive context switching and poor CPU data locality. The problem gets even more compounded by the various places packet can get queued under load and various threads that finally process the packet.</p>
<p>The &#8220;FireEngine&#8221; approach is to merge all protocol layers into one STREAMs module which is fully multi threaded. Inside the merged module, instead of using per data structure locks, use a per CPU synchronization mechanism called &#8220;vertical perimeter&#8221;. The &#8220;vertical perimeter&#8221; is implemented using a serialization queue abstraction called &#8220;squeue&#8221;. Each squeue is bound to a CPU and each connection is in turn bound to a squeue which provides any synchronization and mutual exclusion needed for the connection specific data structures.</p>
<p>The connection (or context) lookup for inbound packets is done outside the perimeter, using an IP connection classifier, as soon as the packet reaches IP. Based on the classification, the connection structure is identified. Since the lookup happens outside the perimeter, we can bind a connection to an instance of the vertical perimeter or &#8220;squeue&#8221; when the connection is initialized and process all packets for that connection on the squeue it is bound to maintaining better cache locality. More details about the vertical perimeter and classifier are given later sections. The classifier also becomes the database for storing a sequence of function calls necessary for all inbound and outbound packets. This allows to change the Solaris networking stacks from the current message passing interface to a BSD style function call interface. The string of functions created on the fly (event-list) for processing a packet for a connection is the basis for an eventual new framework where other modules and 3rd party high performance modules<br />
can participate in this framework.</p>
<h3><a class="mozTocH3" name="mozTocId533719"></a>Vertical perimeter</h3>
<p>Squeue guarantees that only a single thread can process a given connection at any given time thus serializing access to the TCP connection structure by multiple threads (both from read and write side) in the merged TCP/IP module. It is similar to the STREAMS QPAIR perimeter but instead of just protecting a module instance, it protects the whole connection state from IP to sockfs.</p>
<p>Vertical perimeter or squeue by themselves just provide packet serialization and mutual exclusion for the data structures, but by creating per CPU perimeter and binding a connection to the instance attached to the CPU processing interrupts, we can guarantee much better data locality.</p>
<p>We could have chosen between creating a per connection perimeter or a per CPU perimeter i.e. a instance per connection or per CPU. The overheads involved with a per connection perimeter and thread contention gives lower performance and made us choose a per CPU instance. For a per CPU instance, we had the choice of queuing a connection structure for processing or instead just queue the packet itself and store the connection structure pointer in the packet itself. The former approach leads to some interesting starvation scenarios where packets for a connection keep arriving and to prevent such a situation, the overheads caused a lowered performance. Queuing the packet themselves allows us to protect the ordering and is much simpler and thus the approach we have taken for FireEngine.</p>
<p>As mentioned before, each connection instance is assigned to a single squeue and is thus only processed within the vertical perimeter. As a squeue is processed by a single thread at a time all data structures used to process a given connection from within the perimeter can be accessed without additional locking. This improves both the CPU and thread context data locality of access of both the connection meta data, the packet meta data, and the packet payload data. In addition this will allow the removal of per device driver worker thread schemes which are problematic in solving a system wide resource issue and allow additional strategic algorithms to be implemented to best handle a given network interface based on throughput of the network interface and the system throughput (e.g. fanning out per connection packet processing to a group of CPUs). The thread, entering squeue may either process the packet right away or queue it for later processing by another thread or worker thread. The choice depends on the squeue entry point and on the state of the squeue. The immediate processing is only possible when no other thread has entered the same squeue. The squeue is represented by the following abstraction:</p>
<pre>
typedef  struct squeue_s {
int_t	          sq_flag;	/* Flags tells squeue status */
kmutex_t     sq_lock;	/* Lock to protect the flag etc */
mblk_t	 *sq_first;	/* First Packet */
mblk_t	 *sq_last;	/* Last Packet */
thread_t sq_worker;	/* the worker thread for squeue */
} squeue_t;
</pre>
<p>Its important to note that the squeues are created on the basic of per H/W execution pipeline i.e. cores, hyper threads, etc. The stack processing of the serialization queue (and the H/W execution pipeline) is limited to one thread at a time but this actually improves performance because the new stack ensure that there are no waits for any resources such as memory or locks inside the vertical perimeter and allowing more than one kernel thread to time share the H/W execution pipelines has more overheads vs allowing only one thread to run uninterrupted.</p>
<ul>
<li>Queuing Model &#8211; The queue is strictly FIFO (first in first out) for both read and write side which ensures that any particular connection doesn&#8217;t suffer or is starved. A read side or a write side thread enqueues packet at the end of the chain. It can then be allowed to process the packet or signal the worker thread based on the processing model below.</li>
<li>Processing Model &#8211; After enqueueing its packet, if another thread is already processing the squeue, the enqueuing thread returns and the packet is drained later based on the drain model. If the squeue is not being processed and there are no packets queued, the thread can mark the squeue as being processed (represented by &#8216;sq_flag&#8217;), and processes the packet. Once it completes processing the packet, it removes the &#8216;processing in progress&#8217; flag and makes the squeue free for future processing.</li>
<li>Drain Model &#8211; A thread, which was successfully able to process its own packet, can also drain any packets that were enqueued while it was processing the request. In addition, if the squeue is not being processed but there are packets already queued, then instead of queuing its packet and leaving, the thread can drain the queue and then process its own packets.</li>
</ul>
<p>The worker thread is always allowed to drain the entire queue. Choosing the correct Drain model is quite complicated. Choices are between,</p>
<ul>
<li> &#8220;always queue&#8221;,</li>
<li> &#8220;process your own packet if you can&#8221;,</li>
<li> &#8220;time bounded process and drain&#8221;.</li>
</ul>
<p>These options can be independently applied to the read thread and the write thread.</p>
<p>Typically, the draining by an interrupt thread should always be time-bounded &#8220;drain and process&#8221; while the write thread can choose between &#8220;processes your own&#8221; and time bounded &#8220;process and drain&#8221;. For Solaris 10, the write thread behavior is a tunable with default being &#8220;process your own&#8221; while the read side is fixed to &#8220;time bounded process and drain&#8221;.</p>
<p>The signaling of worker thread is another option worth exploring. If the packet arrival rate is low and a thread is forced to queue its packet, then the worker thread should be allowed to run as soon as the entering thread finished processing the squeue when there is work to be done.</p>
<p>On the other hand, if the packet arrival rate is high, it may be desirable to delay waking up the worker thread hoping for an interrupt to arrive shortly after to complete the drain. Waking up the worker thread immediately when the packet arrival rate is high creates unnecessary contention between the worker and interrupt threads.</p>
<p>The default for Solaris 10 is delayed wakeup of the worker thread. Initial experiments on available servers showed that the best results are obtained by waking up the worker thread after a 10ms delay.</p>
<p>Placing a request on the squeue requires a per-squeue lock to protect the state of the queue, but this doesn&#8217;t introduce scalability problems because it is distributed between CPU&#8217;s and is only held for a short period of time. We also utilize optimizations, which allow avoiding context switches while still preserving the single-threaded semantics of squeue processing. We create an instance of an squeue per CPU in the system and bind the worker thread to that CPU. Each connection is then bound to a specific squeue and thus to a specific CPU as well.</p>
<p>The binding of an squeue to a CPU can be changed but binding of a connection to an squeue never changes because of the squeue protection semantics. In the merged TCP/IP case, the vertical perimeter protects the TCP state for each connection. The squeue instance used by each connection is chosen either at the &#8220;open&#8221;, &#8220;bind&#8221; or &#8220;connect&#8221; time for outbound connections or at &#8220;eager connection creation time&#8221; for inbound ones.</p>
<p>The choice of the squeue instance depends on the relative speeds of the CPUs and the NICs in the system. There are two cases:</p>
<ul>
<li>CPU is faster than the NIC: the incoming connections are assigned to the &#8220;squeue instance&#8221; of the interrupted CPU. For the outbound case, connections are assigned to the squeue instance of the CPU the application is running on.</li>
<li>NIC is faster than the CPU: A single CPU is not capable of handling the NIC. The connections are bounded in random manner on all available squeue.</li>
</ul>
<p>For Solaris 10, the determination of NIC being faster or slower than CPU is done by the system administrator in the form of a tuning the global variable &#8216;ip_squeue_fanout&#8217;. The default is &#8216;no fanout&#8217; i.e. Assign the incoming connection to the squeue attached to the<br />
interrupted CPU. For the purposes of taking a CPU offline the worker thread bound to this CPU removes its binding and restores it when the CPU gets back online. This allows for the DR functionality to work correctly. When packets for a connection are arriving on multiple NICs (and thus interrupting multiple CPUs), they are always processed on the squeue the connection was originally established on. In Solaris 10, the vertical perimeter are provided only for TCP based connections. The interface to vertical perimeter is done at the TCP and IP layer after determining that it is a TCP connection. Solaris 10 updates will introduce the general vertical perimeter for any use.</p>
<p>The squeue APIs look like:<br />
<code><br />
squeue_t 	*squeue_create(squeue_t *, uint32_t, processorid_t, void (*)(), void *, clock_t, pri_t);<br />
void 		squeue_bind(squeue_t *, processorid_t);<br />
void 		squeue_unbind(squeue_t *);<br />
void 		squeue_enter(squeue_t *, mblk_t *, void (*)(), void *);<br />
void 		squeue_fill(squeue_t *, mblk_t *, void (*)(), void *);<br />
</code></p>
<p>Squeue_create instantiates a new squeue and uses squeue_bind()/squeue_unbind() to bind or unbind itself from a particular CPU. The squeue once created are never destroyed. The squeue_enter() is used to try and access the squeue and the thread entering is allowed to process and drain the squeue based on models discussed before. squeue_fill() is used just to queue a packet on the squeue to be processed by worker thread or other threads.</p>
<h3><a class="mozTocH3" name="mozTocId267870"></a>IP classifier</h3>
<p>The IP connection fanout mechanism consists of 3 hash tables. A 5-tuple hash table {protocol, remote and local IP addresses, remote and local ports} to keep fully qualified TCP (ESTABLISHED) connections, A 3-tuple lookup consisting of protocol, local address and local port to keep the listeners and a single-tuple lookup for protocol listeners. As part of the lookup, a connection structure (a superset of all connection information) is returned. This connection structure is called &#8216;conn_t&#8217; and is abstracted below.<br />
<code><br />
typedef struct conn_s {<br />
kmutex_t conn_lock; 		/* Lock for conn_ref */<br />
uint32_t conn_ref; 		/* Reference counter */<br />
uint32_t conn_flags; 		/* Flags */<br />
struct ill_s *conn_ill; 	/* The ill packets are coming on */<br />
struct ire_s *conn_ire; 	/* ire cache for outbound packets */<br />
tcp_t 	*conn_tcp;		/* Pointer to tcp struct */<br />
void 	*conn_ulp 		/* Pointer for upper layer*/<br />
edesc_pf conn_send; 		/* Function to call on read side */<br />
edesc_pf conn_recv; 		/* Function to call on write side */<br />
squeue_t *conn_sqp; 		/* Squeue for processing */<br />
/* Address and Ports */<br />
struct {<br />
in6_addr_t connua_laddr;	 /* Local address */<br />
in6_addr_t connua_faddr; 	/* Remote address. */<br />
} connua_v6addr;<br />
#define 	conn_src V4_PART_OF_V6(connua_v6addr.connua_laddr)<br />
#define 	conn_rem V4_PART_OF_V6(connua_v6addr.connua_faddr)<br />
#define 	conn_srcv6 connua_v6addr.connua_laddr<br />
#define 	conn_remv6 connua_v6addr.connua_faddr<br />
union {<br />
/* Used for classifier match performance */<br />
uint32_t conn_ports2;<br />
struct {<br />
in_port_t tcpu_fport; 	/* Remote port */<br />
in_port_t tcpu_lport; 	/* Local port */<br />
} tcpu_ports;<br />
} u_port;<br />
#define 	conn_fport u_port.tcpu_ports.tcpu_fport<br />
#define	 conn_lport u_port.tcpu_ports.tcpu_lport<br />
#define 	conn_ports u_port.conn_ports2<br />
uint8_t 	conn_protocol; 	/* protocol type */<br />
kcondvar_t 	conn_cv;<br />
} conn_t;<br />
</code></p>
<p>The interesting member to note is the pointer to the squeue or vertical perimeter. The lookup is done outside the perimeter and the packet is processed/queued on the squeue connection is attached to. Also, conn_recv and conn_send point to the read side and write side functions. The read side function can be &#8216;tcp_input&#8217; if the packet is meant for TCP.</p>
<p>Also, the connection fan-out mechanism has provisions for supporting wildcard listener&#8217;s i.e. INADDR ANY. Currently, the connected and bind tables are primarily for TCP and UDP only. A listener entry is made during a listen() call. The entry is made into the connected table after the three-way handshake is complete for TCP.</p>
<p>The IPCLassifier APIs look like:<br />
<code><br />
conn_t 	*ipcl_conn_create(uint32_t type, int sleep);<br />
void 	         ipcl_conn_destroy(conn_t *connp);<br />
int 	         ipcl_proto_insert(conn_t *connp, uint8_t protocol);<br />
int 	         ipcl_proto_insert_v6(conn_t *connp, uint8_t protocol);<br />
conn_t 	*ipcl_proto_classify(uint8_t protocol);<br />
int 	        *ipcl_bind_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, uint16_t lport);<br />
int 	        *ipcl_bind_insert_v6(conn_t *connp, uint8_t protocol, const in6_addr_t * src, uint16_t lport);<br />
int 	        *ipcl_conn_insert(conn_t *connp, uint8_t protocol, ipaddr_t src, ipaddr_t dst, uint32_t ports);<br />
int 	        *ipcl_conn_insert_v6(conn_t *connp, uint8_t protocol, in6_addr_t *src, in6_addr_t *dst, uint32_t ports);<br />
void 	        ipcl_hash_remove(conn_t *connp);<br />
conn_t 	*ipcl_classify_v4(mblk_t *mp);<br />
conn_t 	*ipcl_classify_v6(mblk_t *mp);<br />
conn_t 	*ipcl_classify(mblk_t *mp);<br />
</code></p>
<p>The names of the functions are pretty self explanatory.</p>
<h3><a class="mozTocH3" name="mozTocId874133.25"></a>Synchronization mechanism</h3>
<p>Since the stack is fully multi-threaded (barring the per CPU serialization enforced by the vertical perimeter), it uses a reference based scheme to ensure that connection instance are available when needed. The reference count is implemented by &#8216;conn_t&#8217; member &#8216;conn_ref&#8217; and protected by &#8216;conn_lock&#8217;. The prime purpose of the lock in not to protect bulk of &#8216;conn_t&#8217; but just the reference count. Each time some entity takes reference to the data structure (stores a pointer to the data structure for later processing), it increments the reference count by calling the CONN_INC_REF macro which basically acquires the &#8216;conn_lock&#8217;, increments the &#8216;conn_ref&#8217; and drops the &#8216;conn_lock&#8217;. Each time the entity drops the reference to the connection instance, it drops its reference using the CONN_DEC_REF macro.</p>
<p>For an established TCP connection, There are guaranteed to be 3 references on it. Each protocol layer has a reference on the instance (one each for TCP and IP) and the classifier itself has a reference since its a established connection. Each time a packet arrive for the connection and classifier looks up the connection instance, an extra reference is place which is dropped when the protocol layer finishes processing that packet. Similarly, any timers running on the connection instance have a reference to ensure that the instance is around whenever timer fires. The memory associated with the connection instance is freed once the last reference is dropped.</p>
<h2><a class="mozTocH2" name="mozTocId533350"></a>3 TCP</h2>
<p>Solaris 10 provides the same view for TCP as previous releases i.e. TCP appears as a clone device but it is actually a composite, with the TCP and IP code merged into a single D_MP STREAMS module. The merged TCP/IP module&#8217;s STREAMS entry points for open and close are the same as IP&#8217;s entry points viz ip_open and ip_close. Based on the major number passed during open, IP decides whether the open corresponds to a TCP open or an IP open. The put and service STREAMS entry points for TCP are tcp_wput, tcp_wsrv and tcp_rsrv. The tcp_wput entry point simply serves as a wrapper routine and enable sockfs and other modules from the top to talk to TCP using STREAMs. Note that tcp_rput is missing since IP calls TCP functions directly. IP&#8217;s STREAMS entry points remain unchanged.</p>
<p>The operational part of TCP is fully protected by the vertical perimeter which entered through the squeue_* primitives as illustrated in Fig 4. Packets flowing from the top enter into TCP through the wrapper function tcp_wput, which then tries to execute the real TCP output processing function tcp_output after entering the corresponding vertical perimeter. Similarly packets coming from the bottom try to execute the real TCP input processing function tcp_input after entering the vertical perimeter. There are multiple entry points into TCP through the vertical perimeter.</p>
<p><a href="http://sunaytripathi.files.wordpress.com/2010/03/fig4.gif"><img class="aligncenter size-medium wp-image-4" title="Fig4" src="http://sunaytripathi.files.wordpress.com/2010/03/fig4.gif?w=478&#038;h=845" alt="" width="478" height="845" /></a></p>
<pre>
tcp_input - All inbound data packets and control messages
tcp_output - All outbound data packets and control messages
tcp_close_output - On user close
tcp_timewait_output - timewait expiry
tcp_rsrv_input - Flowcontrol relief on read side.
tcp_timer - All tcp timers</pre>
<h4>The Interface between TCP and IP</h4>
<p>FireEngine changes the interface between TCP and IP from the existing STREAMS based message passing interface to a functional call based interface, both in the control and data paths. On the outbound side TCP passes a fully prepared packet directly to IP by calling ip_output, while being inside the vertical perimeter.</p>
<p>Similarly control messages are also passed directly as function arguments. ip_bind_v{4, 6} receives a bind message as an argument, performs the required action and returns a result mp to the caller. TCP directly calls ip_bind_v{4, 6} in the connect(), bind() and listen() paths. IP still retains all its STREAMs entry point but TCP (/dev/tcp) becomes a real device driver i.e. It can&#8217;t be pushed over other device drivers.</p>
<p>The basic protocol processing code was unchanged. Lets have a look at common socket calls and see how they interact with the framework.</p>
<h3><a class="mozTocH3" name="mozTocId900018.25"></a>Socket</h3>
<p>A socket open of TCP or open of /dev/tcp eventually calls into ip_open. The open then calls into the IP connection classifier and allocates the per-TCP endpoint control block already integrated with the conn_t. It chooses the squeue for this connection. In the case of an internal open i.e by sockfs for an acceptor stream, almost nothing is done, and we delay doing useful work till accept time.</p>
<h3><a class="mozTocH3" name="mozTocId976985"></a>Bind</h3>
<p>tcp_bind eventually needs to talk to IP to figure out whether the address passed in is valid. FireEngine TCP prepares this request as usual in the form of a TPI message. However this messages is directly passed as a function argument to ip_bind_v{4, 6}, which returns the result as another message. The use of messages as parameters is helpful in leveraging the existing code with minimal change. The port hash table used by TCP to validate binds still remains in TCP, since the classifier has no use for it.</p>
<h3><a class="mozTocH3" name="mozTocId872316.25"></a>Connect</h3>
<p>The changes in tcp_connect are similar to tcp_bind. The full bind() request is prepared as a TPI message and passed as a function argument to ip_bind_v{4, 6}. IP calls into the classifier and inserts the connection in the connected hash table. The conn_ hash table in TCP is no longer used.</p>
<h3><a class="mozTocH3" name="mozTocId546724.25"></a>Listen</h3>
<p>This path is part of tcp_bind. The tcp_bind prepares a local bind TPI message and passes it as a function argument to ip_bind_v{4, 6}. IP calls the classifier and inserts the connection in the bind hash table. The listen hash table of TCP does not exist any more.</p>
<h3><a class="mozTocH3" name="mozTocId5895.001953125"></a>Accept</h3>
<p>The pre Solaris 10 accept implementation did the bulk of the connection setup processing in the listener context. The three way handshake was completed in listener&#8217;s perimeter and the connection indication was sent up the listener&#8217;s STREAM. The messages necessary to perform the accept were sent down on the listener STREAM and the listener was single threaded from the point of sending the T_CONN_RES message to TCP till sockfs received the acknowledgment. If the connection arrival rate was high, the ability of pre Solaris 10 stack to accept new connections deteriorated significantly.</p>
<p>Furthermore, there were some additional TCP overhead involved, which contribute to slower accept rate. When sockfs opened an acceptor STREAM to TCP to accept a new connection, TCP was not aware that the data structures necessary for the new connection have already been allocated. So it allocated new structures and initializes them but later as part of the accept processing these are freed. Another major problem with the pre Solaris 10 design was that packets for a newly created connection arrived on the listener&#8217;s perimeter. This requires a check for every incoming packet and packets landing on the wrong perimeter need to be sent to their correct perimeter causing additional delay.</p>
<p>The FireEngine model establishes an eager connection (a incoming connection is called eager till accept completes) in its own perimeter as soon as a SYN packet arrives thus making sure that packets always land on the correct connection. As a result it is possible to completely eliminate the TCP global queues. The connection indication is still sent to the listener on the listener&#8217;s STREAM but the accept happens on the newly created acceptor STREAM (thus, there is no need to allocate data structures for this STREAM) and the acknowledgment can be sent on the acceptor STREAM. As a result, sockfs doesn&#8217;t need to become single threaded at any time during the accept processing.</p>
<p>The new model was carefully implemented because the new incoming connection (eager) exists only because there is a listener for it and both eager and listener can disappear at any time during accept processing as a result of eager receiving a reset or listener closing.</p>
<p>The eager starts out by placing a reference on the listener so that the eager reference to the listener is always valid even though the listener might close. When a connection indication needs to be sent after the three way handshake is completed, the eager places a reference on itself so that it can close on receiving a reset but any reference to it is still valid. The eager sends a pointer to itself as part of the connection indication message, which is sent via the listener&#8217;s STREAM after checking that the listener has not closed. When the T_CONN_RES message comes down the newly created acceptor STREAM, we again enter the eager&#8217;s perimeter and check that the eager has not closed because of receiving a reset before completing the accept processing. For TLI/XTI based applications, the T_CONN_RES message is still handled on the listener&#8217;s STREAM and the acknowledgment is sent back on listener&#8217;s STREAMs so there is no change in behavior.</p>
<h3><a class="mozTocH3" name="mozTocId261866"></a>Close</h3>
<p>Close processing in tcp now does not have to wait till the reference count drops to zero since references to the closing queue andreferences to the TCP are now decoupled. Close can return as soon as all references to the closing queue are gone. The TCP data structures themself may continue to stay around as a detached TCP in most cases. The release of the last reference to the TCP frees up the TCP data structure.</p>
<p>A user initiated close only closes the stream. The underlying TCP structures may continue to stay around. The TCP then goes through the FIN/ACK exchange with the peer after all user data is transferred and enters the TIME_WAIT state where it stays around for a certain duration of time. This is called a detached TCP. These detached TCPs also need protection to prevent outbound and inbound processing from happening at the same time on a given detached TCP.</p>
<h3><a class="mozTocH3" name="mozTocId42499.015625"></a>Data path</h3>
<p><span id="more-8"></span><!--more--><br />
TCP does not even need to call IP to transmit the outbound packet in the most common case, if it can access the IRE. With a merged TCP/IP we have the advantage of being able to access the cached ire for a connection, and TCP can putnext the data directly to the link layer driver based on the information in the IRE. FireEngine does exactly the above.</p>
<h3><a class="mozTocH3" name="mozTocId809090.25"></a>TCP Loopback</h3>
<p>TCP Fusion is a protocol-less data path for loopback TCP connections in Solaris 10. The fusion of two local TCP endpoints occurs at connection establishment time. By default, all loopback TCP connections are fused. This behavior may be changed by setting the system wide tunable do tcp fusion to 0. Various conditions on both endpoints need to be met for fusion to be successful:</p>
<ul>
<li>They must share a common squeue.</li>
</ul>
<ul>
<li>They must be TCP and not &#8220;raw socket&#8221;.</li>
</ul>
<ul>
<li>They must not require protocol-level processing, i.e. IPsec or<span style="font-family:monospace;"> </span>IPQoS policy is not present for the connection.</li>
</ul>
<p>If it fails, we fall back to the regular TCP data path; if it succeeds, both endpoints proceed to use tcp fuse output() as the transmit path. tcp fuse output() enqueues application data directly onto the peer&#8217;s receive queue; no protocol processing is involved. After enqueueing the data, the sender can either push &#8211; by calling putnext(9F) &#8211; the data up the receiver&#8217;s read queue; or the sender can simply return and let the receiver retrieve the enqueued data via the synchronous STREAMS entry point. The latter path is taken if synchronous STREAMS is enabled.It gets automatically disabled if sockfs no longer resides directly on top of TCP module due to a module insertion or removal.</p>
<p>Locking in TCP Fusion is handled by squeue and the mutex tcp fuse lock. One of the requirements for fusion to succeed is that both endpoints need to be using the same squeue. This ensures that neither side can disappear while the other side is still sending data. By itself, squeue is not sufficient for guaranteeing safe access when synchronous STREAMS is enabled. The reason is that tcp fuse rrw() doesn&#8217;t enter the squeue, and its access to tcp rcv list and other fusion-related fields needs to be synchronized with the sender. tcp fuse lock is used for this purpose.</p>
<p>Rate Limit for Small Writes Flow control for TCP Fusion in synchronous stream mode is achieved by checking the size of receive buffer and the number of data blocks, both set to different limits. This is different than regular STREAMS flow control where cumulative size check dominates data block count check (STREAMS queue high water mark typically represents bytes). Each enqueue triggers notifications sent to the receiving process; a build up of data blocks indicates a slow receiver and the sender should be blocked or informed at the earliest moment instead of further wasting system resources. In effect, this is equivalent to limiting the number of outstanding segments in flight.</p>
<p>The minimum number of allowable enqueued data blocks defaults to 8 and is changeable via the system wide tunable tcp_fusion_burst_min to either a higher value or to 0 (the latter disables the burst check).</p>
<h2><a class="mozTocH2" name="mozTocId117060.03125"></a>4 UDP</h2>
<p>Apart from the framework improvements, Solaris 10 made additional changes in the UDP packets move through the stack. The internal code name for the project was &#8220;Yosemite&#8221;. Pre Solaris 10, the UDP processing cost was evenly divided between per packet processing cost and per byte processing cost. The packet processing cost was generally due to STREAMS; the stream head processing; and packet drops in the stack and driver. The per byte processing cost was due to lack of H/W cksum and unoptimized code branches throughout the network stack.</p>
<h3><a class="mozTocH3" name="mozTocId842262.25"></a>UDP packet drop within the stack</h3>
<p>Although UDP is supposed to be unreliable, the local area networks have become pretty reliable and applications tend to assume that there will be no packet loss in a LAN environment. This assumption was largely true but pre Solaris 10 stack was not very effective in dealing with UDP overload and tended to drop packets within the stack itself.</p>
<p>On Inbound, packets were dropped at more than one layers throughout the receive path. For UDP, the most common and obvious place is at the IP layer due to the lack of resources needed to queue the packets. Another important yet in-apparent place of packet drops is at the network adapter layer. This type of drop is fairly common to occur when the machine is dealing with a high rate of incoming packets.</p>
<p>UDP sockfs The UDP sockfs extension (sockudp) is an alternative path to socktpi used for handling sockets-based UDP applications. It provides for a more direct channel between the application and the network stack by eliminating the stream head and TPI message-passing interface. This allows for a direct data and function access throughout the socket and transport layers. This allows the stack to become more efficient and coupled with UDP H/W checksum offload (even for fragmented UDP), ensures that UDP packets are rarely dropped within the stack.</p>
<h3><a class="mozTocH3" name="mozTocId771014"></a>UDP Module</h3>
<p>A fully multi-threaded UDP module running under the same protection domain as IP. It allows for a tighter integration of the transport (UDP) with the layers above and below it. This allows socktpi to make direct calls to UDP. Similarly UDP may also make direct calls to the data link layer. In the post GLDv3 world, the data link layer may also make direct calls to the transport. In addition, utility functions can be called directly instead of using message-based interface.</p>
<p>UDP needs exclusive operation on a per-endpoint basis, when executing functions that modify the endpoint state. udp rput other() deals with packets with IP options, and processing these packets end up having to update the endpoint&#8217;s option related state. udp wput other() deals with control operations from the top, e.g. connect(3SOCKET) that needs to update the endpoint state. In the STREAMS world this synchronization was achieved by using shared inner perimeter entry points, and by using qwriter inner() to get exclusive access to the endpoint.</p>
<p>The Solaris 10 model uses an internal, STREAMS-independent perimeter to achieve the above synchronization and is described below:</p>
<ul>
<li>udp enter() &#8211; Enter the UDP endpoint perimeter. udp become writer() i.e.become exclusive on the UDP endpoint. Specifies a function that will be called exclusively either immediately or later when the perimeter is available exclusively.</li>
<li>udp exit() &#8211; Exit the UDP endpoint perimeter.</li>
</ul>
<p>Entering UDP from the top or from the bottom must be done using udp enter(). As in the general cases, no locks may be held across these perimeter. When finished with the exclusive mode, udp exit() must be called to get out of the perimeter.</p>
<p>To support this, the new UDP model employs two modes of operation namely UDP MT HOT mode and UDP SQUEUE mode. In the UDP MT HOT mode, multiple threads may enter a UDP endpoint concurrently. This is used for sending or receiving normal data and is similar to the putshared STREAMS entry points. Control operations and other special cases call udp become writer() to become exclusive on a per-endpoint basis and this results in transitioning to the UDP SQUEUE mode. squeue by definition serializes access to the conn t. When there are no more pending messages on the squeue for the UDP connection, the endpoint reverts to MT HOT mode. In between when not all MT threads of an endpoint have finished, messages are queued in the endpoint and the UDP is in one of two transient modes, i.e. UDP MT QUEUED or UDP QUEUED SQUEUE mode.</p>
<p>While in stable modes, UDP keeps track of the number of threads operating on the endpoint. The udp reader count variable represents the number of threads entering the endpoint as readers while it is in UDP MT HOT mode. Transitioning to UDP SQUEUE happens when there is only a single reader, i.e. when this counter drops to 1. Likewise, udp squeue count represents the number of threads operating on the endpoint&#8217;s squeue while it is in UDP SQUEUE mode. The mode transition to UDP MT HOT happens after the last thread exits the endpoint.</p>
<p>Though UDP and IP are running in the same protection domain, they are still separate STREAMS modules. Therefore, STREAMS plumbing is kept unchanged and a UDP module instance is always pushed above IP. Although this causes an extra open and close for every UDP endpoint, it provides backwards compatibility for some applications that rely on such plumbing geometry to do certain things, e.g. issuing I POP on the stream to obtain direct access to IP9.</p>
<p>The actual UDP processing is done within the IP instance. The UDP module instance does not possess any state about the endpoint and merely acts as a dummy module, whose presence is to keep the STREAMS plumbing appearance unchanged.</p>
<p>Solaris 10 allows for the following plumbing modes:</p>
<ul>
<li>Normal &#8211; IP is first opened and later UDP is pushed directly on top. This is the default action that happens when a UDP socket or device is opened.</li>
<li>SNMP &#8211; UDP is pushed on top of a module other than IP. When this happens it will support only SNMP semantics.</li>
</ul>
<p>These modes imply that we don&#8217;t support any intermediate module between IP and UDP; in fact, Solaris has never supported such scenario in the past as the inter-layer communication semantics between IP and transport modules are private.</p>
<h3><a class="mozTocH3" name="mozTocId344689.125"></a>UDP and Socket interaction</h3>
<p>A significant event that takes place during socket(3SOCKET) system call is the plumbing of the modules associated with the socket&#8217;s address family and protocol type. A TCP or UDP socket will most likely result in sockfs residing directly atop the corresponding transport module. Pre Solaris 10, Socket layer used STREAMs primitives to communicate with UDP module. Solaris 10 allowed for a functionally callable interface which eliminated the need to use T UNITDATA REQ message for metadata during each transmit from sockfs to UDP. Instead, data and its ancillary information (i.e. remote socket address) could be provided directly to an alternative UDP entry point, therefore avoiding the extra allocation cost.</p>
<p>For transport modules, being directly beneath sockfs allows for synchronous STREAMS to be used. This enables the transport layer to buffer incoming data to be later retrieved by the application (via synchronous STREAMS) when a read operation is issued, therefore shortening the receive processing time.</p>
<h3><a class="mozTocH3" name="mozTocId497358"></a>Synchronous STREAMS</h3>
<p>Synchronous STREAMS is an extension to the traditional STREAMS interface for message passing and processing. It was originally added as part of the combined copy and checksum effort. It offers a way for the entry point of the module or driver to be called in synchronous manner with respect to user I/O request. In traditional STREAMS, the stream head is the synchronous barrier for such request. Synchronous STREAMS provides a mechanism to move this barrier from the stream head down to a module below.</p>
<p>The TCP implementation of synchronous STREAMS in pre Solaris 10 was complicated, due to several factors. A major factor was the combined checksum and copyin/copyout operations. In Solaris 10, TCP wasn&#8217;t dependent on checksum during copyin/copyout, so the mechanism was greatly simplified for use with loopback TCP and UDP on the read side. The synchronous STREAMS entry points are called during requests such as read(2) or recv(3SOCKET). Instead of sending the data upstream using putnext(9F), these modules enqueue the data in their internal receive queues and allow the send thread to return sooner. This avoids calling strrput() to enqueue the data at the stream head from within the send thread context, therefore allowing for better dynamics &#8211; reducing the amount of time taken to enqueue and signal/poll-notify the receiving application allows the send thread to return faster to do further work, i.e. things are less serialized than before.</p>
<p>Each time data arrives, the transport module schedules for the application to retrieve it. If the application is currently blocked (sleeping) during a read operation, it will be unblocked to allow it to resume execution. This is achieved by calling STR WAKEUP SET() on the stream. Likewise, when there is no more data available for the application, the transport module will allow it to be blocked again during the next read attempt, by calling STR WAKEUP CLEAR(). Any new data that arrives before then will override this state and cause subsequent read operation to proceed.</p>
<p>An application may also be blocked in poll(2) until a read event takes place, or it may be waiting for a SIGPOLL or SIGIO signal if the socket used is non-blocking. Because of this, the transport module delivers the event notification and/or signals the application each time it receives data. This is achieved by calling STR SENDSIG() on the corresponding stream.</p>
<p>As part of the read operation, the transport module delivers data to the application by returning it from its read side synchronous STREAMS entry point. In the case of loopback TCP, the synchronous STREAM read entry point returns the entire content (byte stream) of its receive queue to the stream head; any remaining data will be re-enqueued at the stream head awaiting the next read. For UDP, the read entry point returns only one message (datagram) at a time.</p>
<h3><a class="mozTocH3" name="mozTocId959941"></a>STREAMs fallback</h3>
<p>By default, direct transmission and read side synchronous STREAMS optimizations are enabled for all UDP and loopback TCP sockets when sockfs is directly above the corresponding transport module. There are several cases which require these features to be disabled; when this happens, message exchange between sockfs and the transport module must then be done through putnext(9F). The cases are described as follows -</p>
<ul>
<li>Intermediate Module &#8211; A module is configured to be autopushed at open time on top of the transport module via autopush(1M), or is I PUSH&#8217;d on a socket via ioctl(2).</li>
<li> Stream Conversion &#8211; The imaginary sockmod module is I POP&#8217;d from a socket causing it to be converted from a socket endpoint into a device stream.</li>
</ul>
<p>(Note that I INSERT or I REMOVE ioctl is not permitted on a socket endpoint and therefore a fallback is not required to handle it.)</p>
<p>If a fallback is required, sockfs will notify the transport module that direct mode is disabled. The notification is sent down by the sockfs module in the form of an ioctl message, which indicates to the transport module that putnext(9F) must now be used to deliver data upstream. This allows for data to flow through the intermediate module and it provides for compatibility with device stream semantics.</p>
<h2><a class="mozTocH2" name="mozTocId98206"></a>5 IP</h2>
<p>As mentioned before, all the transport layers have been merged in IP module which is fully multithreaded and acts as a pseudo device driver as well a STREAMs module. The key change in IP was the removal IP client functionality and multiplexing the inbound packet stream. The new IP Classifier (which is still part of IP module) is responsible for classifying the inbound packets to the correct connection instance. IP module is still responsible for network layer protocol processing and plumbing and managing the network interfaces.</p>
<p>Lets have a quick look at how plumbing of network interfaces, multi pathing, and multicast works in the new stack.</p>
<h3><a class="mozTocH3" name="mozTocId955763"></a>Plumbing NICs</h3>
<p>Plumbing is a long sequence of operations involving message exchanges between IP, ARP and device drivers. Most set ioctls are typically involved in plumbing operations. A natural model is to serialize these ioctls one per ill. For example plumbing of hme0 and qfe0 can go on in parallel without any interference. But various set ioctls on hme0 will all be serialized.</p>
<p>Another possibility is to fine-grain even further and serialize operations per ipif rather than per ill. This will be beneficial only if many ipifs are hosted on an ill, and if the operations on different ipifs don&#8217;t have any mutual interference. Another possibility is to completely multithread all ioctls using standard Solaris MT techniques. But this is needlessly complex and does not have much added value. It is hard to hold locks across the entire plumbing sequence, which involves waits, and message exchanges with drivers or other modules. Not much is gained in performance or functionality by simultaneously allowing multiple set ioctls on an ipif at the same time since these are purely non-repetitive control operations. Broadcast ires are created on a per ill basis rather than per ipif basis. Hence trying to bring up more than 1 ipif simultaneously on an ill involves extra complexity in the broadcast ire creation logic. On the other hand serializing plumbing operations per ill lends itself easily to the existing IP code base. During the course of plumbing IP exchanges messages with the device driver and ARP. The messages received from the underlying device driver are also handled exclusively in IP. This is convenient since we can&#8217;t hold standard mutex locks across the putnext in trying to provide mutual exclusion between the write side and read side activities. Instead of the all exclusive PERMOD syncq, this effect can be easily achieved by using a per ill serialization queue.</p>
<h3><a class="mozTocH3" name="mozTocId342636.125"></a>IP Network MultiPathing (IPMP)</h3>
<p>IPMP operations are all driven around the notion of an IPMP group. Failover and Failback operations operate between 2 ills, usually part of the same IPMP group. The ipifs and ilms are moved between the ills. This involves bringing down the source ill and could involve bringing up the destination ill. Bringing down or bringing up ills affect broadcast ires. Broadcast ires need to be grouped per IPMP group to suppress duplicate broadcast packets that are received. Thus broadcast ire manipulation affects all members of the IPMP group. Setting IFF_FAILED or IFF_STANDBY causes evaluation of all ills in the IPMP group and causes regrouping of broadcast ires. Thus serializing IPMP operations per IPMP group lends itself easily to the existing code base. An IPMP group includes both the IPv4 and IPv6 ills.</p>
<h3><a class="mozTocH3" name="mozTocId315410"></a>Multicast</h3>
<p>Multicast joins operate on both the ilg and ilm structures. Multiple threads operating on an ipc (socket) trying to do multicast joins need to synchronize when operating on the ilg. Multiple threads potentially operating on different ipcs (socket endpoints) trying to do multicast joins could eventually end up trying to manipulate the ilm simultaneously and need to synchronize on the access to the ilm. Both are amenable to standard Solaris MT techniques. Considering all the above, i.e. plumbing, IPMP and multicast, the common denominator is to serialize all the exclusive operations on a per IPMP group basis. If IPMP is not enabled, then on a phyint basis. E.g. hme0 v4 and hme0 v6 ills taken together share a phyint. In the above multicast has a potential higher degree of multithreading. But it has to coexist with other exclusive operations. For example we don&#8217;t want a thread to create or delete an ilm when a failover operation is already in progress trying to move ilms between 2 ills. So the lowest common denominator is to serialize multicast joins per physical interface or IPMP group.</p>
<h2><a class="mozTocH2" name="mozTocId767708"></a>6 Solaris 10 Device Driver framework</h2>
<p>Lets have a quick look at how Network device drivers were implementedpre Solaris 10 and why they need to change with the new Solaris 10stack.</p>
<h3><a class="mozTocH3" name="mozTocId727795.25"></a>GLDv2 andMonolithic DLPI drivers (Solaris 9 and before)</h3>
<p>Pre Solaris 10, network stack relays on DLPI1 providers, which arenormally implemented in one of two ways. The following illustrations(Fig 5) show a stack based on a so-called monolithic DLPI driver and astack based on a driver utilizing the Generic LAN Driver (GLDv2)module.</p>
<p><span style="font-family:monospace;"> </span></p>
<p><a href="http://sunaytripathi.files.wordpress.com/2010/03/fig5.gif"><img src="http://sunaytripathi.files.wordpress.com/2010/03/fig5.gif?w=463&#038;h=485" alt="" title="Fig5" width="463" height="485" class="aligncenter size-medium wp-image-5" /></a></p>
<p>The GLDv2 module essentially behaves as a library. The client still talks to the driver instance bound to the device but the DLPI protocol processing is handled by calling into the GLDv2 module, which will then call back into the driver to access the hardware. Using the GLD module has a clear advantage in that the driver writer need not re-implement large amounts of mostly generic DLPI protocol processing. Layer two (Data-Link) features such as 802.1q Virtual LANs (VLANs) can also be implemented centrally in the GLD module allowing them to be leveraged by all drivers. The architecture still poses a problem though when considering how to implement a feature such as 802.3ad link aggregation (a.k.a. trunking) where the one-to-one correspondence between network interface and device is broken.</p>
<p>Both GLDv2 and monolithic driver depend on DLPI messages and communicated with upper layers via STREAMs framework. This mechanism was not very effective for link aggregation or 10Gb NICs. With the newstack, a better mechanism was needed which could ensure data locality and allow the stack to control the device drivers at much finer granularity to deal with interrupts.</p>
<h3><a class="mozTocH3" name="mozTocId946165.25"></a>GLDv3 &#8211; A New Architecture</h3>
<p>Solaris 10 introduced a new device driver framework called GLDv3 (internal name &#8220;project Nemo&#8221;) along with the new stack. Most of the major device drivers were ported to this framework and all future and 10Gb device drivers will be based on this framework. This framework also provided a STREAMs based DLPI layer for backword compatibility (to allow external, non-IP modules to continue to work).</p>
<p>GLDv3 architecture virtualizes layer two of the network stack. There is no longer a one-to-one correspondence between network interfaces and devices. The illustration below (Fig. 6) shows multiple devices registered with a MAC Services Module (MAC). It also shows two clients: one traditional client that communicates via DLPI to a Data-Link Driver (DLD) and one that is kernel based and simply makes direct function calls into the Data-Link Services Module (DLS).</p>
<p><a href="http://sunaytripathi.files.wordpress.com/2010/03/fig6.gif"><img src="http://sunaytripathi.files.wordpress.com/2010/03/fig6.gif?w=406&#038;h=559" alt="" title="Fig6" width="406" height="559" class="aligncenter size-medium wp-image-6" /></a></p>
<h4><a class="mozTocH3" name="mozTocId113922"></a>GLDv3 Drivers</h4>
<p>GLDv3 drivers are similar to GLD drivers. The driver must be linked with a dependency on misc/mac. and misc/dld. It must call<br />
mac_register() with a pointer to an instance of the following structure to register with the MAC module:<br />
<code><br />
typedef struct mac {<br />
	const char	*m_ident;<br />
	mac_ext_t	*m_extp;<br />
	struct mac_impl	*m_impl;<br />
	void		*m_driver;<br />
	dev_info_t	*m_dip;<br />
	uint_t		m_port;<br />
	mac_info_t	m_info;<br />
	mac_stat_t	m_stat;<br />
	mac_start_t	m_start;<br />
	mac_stop_t	m_stop;<br />
	mac_promisc_t	m_promisc;<br />
	mac_multicst_t	m_multicst;<br />
	mac_unicst_t	m_unicst;<br />
	mac_resources_t	m_resources;<br />
	mac_ioctl_t	m_ioctl;<br />
	mac_tx_t	m_tx;<br />
} mac_t;<br />
</code><br />
This structure must persist for the lifetime of the registration, i.e. it cannot be de-allocated until after mac_unregister() is called. A GLDv3 driver _init(9E) entry point is also required to call mac_init_ops() before calling mod_install(9F), and they are required to call mac_fini_ops() after calling mod_remove(9F) from _fini(9E).</p>
<p>The important members of this &#8216;mac_t&#8217; structure are:</p>
<ul>
<li>&#8216;m_impl&#8217; &#8211; This is used by the MAC module to point to its private data. It must not be read or modified by a driver.</li>
<li>&#8216;m_driver&#8217; &#8211; This field should be set by the driver to point at its private data. This value will be supplied as the first argument to the driver entry points.</li>
<li>&#8216;m_dip&#8217; &#8211; This field must be set to the dev_info_t pointer of the driver instance calling mac_register().</li>
<li> &#8216;m_stat&#8217; -<br />
<code><br />
typedef uint64_t	(*mac_stat_t)(void *, mac_stat_t);<br />
</code><br />
This entry point is called to retrieve a value for one of the statistics defined in the mac_stat_t enumeration (below). All values should be stored and returned in 64-bit unsigned integers. Values will not be requested for statistics that the driver has not explicitly declared to be supported.
</li>
<li>&#8216;m_start&#8217; -<br />
<code><br />
typedef	int (*mac_start_t)(void *);<br />
</code><br />
This entry point is called to bring the device out of the reset/quiesced state that it was in when the interface was registered. No packets will be submitted by the MAC module for transmission and no packets should be submitted by the driver for reception before this call is made. If this function succeeds then zero should be returned. If it fails then an appropriate errno value should be returned.
</li>
<li>&#8216;m_stop&#8217; -<br />
<code><br />
typedef void (*mac_stop_t)(void *);<br />
</code><br />
This entry point should stop the device and put it in a reset/quiesced state such that the interface can be unregistered. No packets will be submitted by the MAC for transmission once this call has been made and no packets should be submitted by the driver for reception once it has completed.
</li>
<li>&#8216;m_promisc&#8217; -<br />
<code><br />
typedef int (*mac_promisc_t)(void *, boolean_t);<br />
</code><br />
This entry point is used to set the promiscuity of the device. If the second argument is B_TRUE then the device should receive all packets on the media. If it is set to B_FALSE then only packets destined for the device&#8217;s unicast address and the media broadcast address should be received.
</li>
<li>&#8216;m_multicst&#8217; -<br />
<code><br />
typedef int (*mac_multicst_t)(void *, boolean_t, const uint8_t *);<br />
</code><br />
This entry point is used to add and remove addresses to and from the set of multicast addresses for which the device will receive packets. If the second argument is B_TRUE then the address pointed to by the third argument should be added to the set. If the second argument is B_FALSE then the address pointed to by the third argument should be removed.
</li>
<li>&#8216;m_unicst&#8217; -<br />
<code><br />
typedef int (*mac_unicst_t)(void *, const uint8_t *);<br />
</code><br />
This entry point is used to set a new device unicast address. Once this call is made then only packets with the new address and the media broadcast address should be received unless the device is in promiscuous mode.
</li>
<li>&#8216;m_resources&#8217; -<br />
<code><br />
typedef void (*mac_resources_t)(void *, boolean_t);<br />
</code><br />
This entry point is called to request that the driver register its individual receive resources or Rx rings.</li>
<li>&#8216;m_tx&#8217; -<br />
<code><br />
typedef mblk_t *(*mac_tx_t)(void *, mblk_t *);<br />
</code><br />
This entry point is used to submit packets for transmission by the device. The second argument points to one or more packets contained in mblk_t structures. Fragments of the same packet will be linked together using the b_cont field. Separate packets will be linked by the b_next field in the leading fragment. Packets should be scheduled for transmission in the order in which they appear in the chain. Any remaining chain of packets that cannot be scheduled should be returned. If m_tx() does return packets that cannot be scheduled the driver must call mac_tx_update() when resources become available. If all packets are scheduled for transmission then NULL should be returned.
</li>
<li>&#8216;m_info&#8217; &#8211; This is an embedded structure defined as follows:<br />
<code><br />
typedef struct mac_info {<br />
		uint_t		mi_media;<br />
		uint_t		mi_sdu_min;<br />
		uint_t		mi_sdu_max;<br />
		uint32_t	mi_cksum;<br />
		uint32_t	mi_poll;<br />
		boolean_t	mi_stat[MAC_NSTAT];<br />
		uint_t		mi_addr_length;<br />
		uint8_t		mi_unicst_addr[MAXADDRLEN];<br />
		uint8_t		mi_brdcst_addr[MAXADDRLEN];<br />
	} mac_info_t;<br />
</code><br />
mi_media is set of be the media type; mi_sdu_min is the minimum payload size; mi_sdu_max is the maximum payload size; mi_cksum details the device cksum capabilities flag; mi_poll details if the driver supports polling; mi_addr_length is set to the length of the addresses used by the media; mi_unicst_addr is set with the unicast address of the device at the point at which mac_register() is called;mi_brdcst_addr is set to the broadcast address of the media; mi_stat is an array of boolean values<br />
<code><br />
typedef enum {<br />
		MAC_STAT_IFSPEED = 0,<br />
		MAC_STAT_MULTIRCV,<br />
		MAC_STAT_BRDCSTRCV,<br />
		MAC_STAT_MULTIXMT,<br />
		MAC_STAT_BRDCSTXMT,<br />
		MAC_STAT_NORCVBUF,<br />
		MAC_STAT_IERRORS,<br />
		MAC_STAT_UNKNOWNS,<br />
		MAC_STAT_NOXMTBUF,<br />
		MAC_STAT_OERRORS,<br />
		MAC_STAT_COLLISIONS,<br />
		MAC_STAT_RBYTES,<br />
		MAC_STAT_IPACKETS,<br />
		MAC_STAT_OBYTES,<br />
		MAC_STAT_OPACKETS,<br />
		MAC_STAT_ALIGN_ERRORS,<br />
		MAC_STAT_FCS_ERRORS,<br />
		MAC_STAT_FIRST_COLLISIONS,<br />
		MAC_STAT_MULTI_COLLISIONS,<br />
		MAC_STAT_SQE_ERRORS,<br />
		MAC_STAT_DEFER_XMTS,<br />
		MAC_STAT_TX_LATE_COLLISIONS,<br />
		MAC_STAT_EX_COLLISIONS,<br />
		MAC_STAT_MACXMT_ERRORS,<br />
		MAC_STAT_CARRIER_ERRORS,<br />
		MAC_STAT_TOOLONG_ERRORS,<br />
		MAC_STAT_MACRCV_ERRORS,<br />
		MAC_STAT_XCVR_ADDR,<br />
		MAC_STAT_XCVR_ID,<br />
		MAC_STAT_XVCR_INUSE,<br />
		MAC_STAT_CAP_1000FDX,<br />
		MAC_STAT_CAP_1000HDX,<br />
		MAC_STAT_CAP_100FDX,<br />
		MAC_STAT_CAP_100HDX,<br />
		MAC_STAT_CAP_10FDX,<br />
		MAC_STAT_CAP_10HDX,<br />
		MAC_STAT_CAP_ASMPAUSE,<br />
		MAC_STAT_CAP_PAUSE,<br />
		MAC_STAT_CAP_AUTONEG,<br />
		MAC_STAT_ADV_CAP_1000FDX,<br />
		MAC_STAT_ADV_CAP_1000HDX,<br />
		MAC_STAT_ADV_CAP_100FDX,<br />
		MAC_STAT_ADV_CAP_100HDX,<br />
		MAC_STAT_ADV_CAP_10FDX,<br />
		MAC_STAT_ADV_CAP_10HDX,<br />
		MAC_STAT_ADV_CAP_ASMPAUSE,<br />
		MAC_STAT_ADV_CAP_PAUSE,<br />
		MAC_STAT_ADV_CAP_AUTONEG,<br />
		MAC_STAT_LP_CAP_1000FDX,<br />
		MAC_STAT_LP_CAP_1000HDX,<br />
		MAC_STAT_LP_CAP_100FDX,<br />
		MAC_STAT_LP_CAP_100HDX,<br />
		MAC_STAT_LP_CAP_10FDX,<br />
		MAC_STAT_LP_CAP_10HDX,<br />
		MAC_STAT_LP_CAP_ASMPAUSE,<br />
		MAC_STAT_LP_CAP_PAUSE,<br />
		MAC_STAT_LP_CAP_AUTONEG,<br />
		MAC_STAT_LINK_ASMPAUSE,<br />
		MAC_STAT_LINK_PAUSE,<br />
		MAC_STAT_LINK_AUTONEG,<br />
		MAC_STAT_LINK_DUPLEX,<br />
		MAC_STAT_LINK_STATE,<br />
		MAC_NSTAT	/* must be the last entry */<br />
	} mac_stat_t;<br />
</code><br />
The macros MAC_MIB_SET(), MAC_ETHER_SET() and MAC_MII_SET() are provided to set all the values in each of the three groups respectively to B_TRUE.</li>
</ul>
<h4><a class="mozTocH3" name="mozTocId518643.125"></a>MAC Services (MAC) module</h4>
<p>Some key Driver Support Functions:</p>
<ul>
<li>&#8216;mac_resource_add&#8217; -<br />
<code><br />
extern mac_resource_handle_t mac_resource_add(mac_t *,	mac_resource_t *);<br />
</code><br />
Various members are defined as<br />
<code><br />
		typedef void (*mac_blank_t)(void *, time_t, uint_t);<br />
		typedef mblk_t *(*mac_poll_t)(void *, uint_t);<br />
		typedef enum {<br />
       			MAC_RX_FIFO = 1<br />
		} mac_resource_type_t;<br />
		typedef struct mac_rx_fifo_s {<br />
			mac_resource_type_t	mrf_type;	/* MAC_RX_FIFO */<br />
			mac_blank_t		mrf_blank;<br />
			mac_poll_t		mrf_poll;<br />
			void 			*mrf_arg;<br />
			time_t			mrf_normal_blank_time;<br />
			uint_t			mrf_normal_pkt_cnt;<br />
		} mac_rx_fifo_t;<br />
		typedef union mac_resource_u {<br />
			mac_resource_type_t	mr_type;<br />
			mac_rx_fifo_t		mr_fifo;<br />
		} mac_resource_t;<br />
</code><br />
This function should be called from the m_resources() entry point to register individual receive resources (commonly ring buffers of DMA descriptors) with the MAC module. The returned mac_resource_handle_t value should then be supplied in calls to mac_rx(). The second argument to mac_resource_add() specifies the resource being added. Resources are specified by the mac_resource_t structure. Currently only resources of type MAC_RX_FIFO are supported. MAC_RX_FIFO resources are described by the mac_rx_fifo_t structure.</p>
<p>This mac_blank function is meant to be used by upper layers to control the interrupt rate of the device. The first argument is the device context meant to be used as the first argument to poll_blank.</p>
<p>The other fields mrf_normal_blank_time and mrf_normal_pkt_cnt specify the default interrupt interval and packet count threshold, respectively. These parameters may be used as the second and third arguments to mac_blank when the upper layer wants the driver to revert to the default interrupt rate.</p>
<p>The interrupt rate is controlled by the upper layer by calling poll_blank with different arguments. The interrupt rate can be increased or decreased by the upper layer by passing a multiple of these values to the last two arguments of mac_blank. Setting these avlues to zero disables the interrupts and NIC is deemed to be in polling mode.</p>
<p>The mac_poll is the driver supplied function is used by upper layer to retrieve a chain of packets (upto max count specified by second argument) from the Rx ring corresponding to the earlier supplied mrf_arg during mac_resource_add (supplied as first argument to mac_poll).</li>
<li>&#8216;mac_resource_update&#8217; -<br />
<code><br />
extern void mac_resource_update(mac_t *);<br />
</code><br />
Invoked by the driver when the available resources have changed.</li>
<li>&#8216;mac_rx&#8217; -<br />
<code><br />
extern void mac_rx(mac_t *, mac_resource_handle_t, mblk_t *);<br />
</code><br />
This function should be called to deliver a chain of packets, contained in mblk_t structures, for reception. Fragments of the same packet should be linked together using the b_cont field. Separate packets should be linked using the b_next field of the leading fragment. If the packet chain was received by a registered resource then the appropriate mac_resource_handle_t value should be supplied as the second argument to the function. The protocol stack will use this value as a hint when trying to load-spread across multiple CPUs. It is assumed that packets belonging to the same flow will always be received by the same resource. If the resource is unknown or is unregistered then NULL should be passed as the second argument.</li>
</ul>
<h4><a class="mozTocH3" name="mozTocId939221"></a>Data-Link Services (DLS) Module</h4>
<p>The DLS module provides Data-Link Services interface analogous to DLPI. The DLS interface is a kernel-level functional interface as opposed to the STREAMS message based interface specified by DLPI. This module provides the interfaces necessary for upper layer to create and destroy a dala link service; It also provides the interfaces necessary to plumb and unplumb the NIC. The plumbing and unplumbing of NIC for GLDv3 based device drivers is unchanged from the older GLDv2 or monolithic DLPI device drivers. The major changes are in data paths which allow direct calls, packet chains and much finer grained control over NIC.</p>
<h4><a class="mozTocH3" name="mozTocId765093.25"></a>Data-Link Driver (DLD)</h4>
<p>The Data-Link Driver provides a DLPI using the interfaces provided by the DLS and MAC modules. The driver is configured using IOCTLs passed to a control node. These IOCTLs create and destroy separate DLPI provider nodes. This module deals with DLPI messages necessary to plumb/unplumb the NIC and provides the backward compatibility for data path via STREAMs for non GLDv3 aware clients.</p>
<h3><a class="mozTocH3" name="mozTocId342621"></a>GLDv3 Link aggregation architecture</h3>
<p>The GLDv3 framework provides support for Link Aggregation as defined by IEEE 802.3ad. The key design principles while designing this facility were:</p>
<ul>
<li>Allow GLDv3 MAC drivers to be aggregated without code change<span style="font-family:monospace;"> </span></li>
<li><span style="font-family:monospace;"> </span>The performance of non-aggregated devices must be preserved<span style="font-family:monospace;"> </span></li>
<li><span style="font-family:monospace;"> </span>The performance of aggregated devices should be cumulative of line<span style="font-family:monospace;"> </span>rate for each member i.e. minimal overheads due to aggregation<span style="font-family:monospace;"> </span></li>
<li><span style="font-family:monospace;"> </span>Support both manual configuration and Link Aggregation Control<span style="font-family:monospace;"> </span>protocol (LACP)</li>
</ul>
<p>GLDv3 link aggregation is implement by means of a pseudo driver called &#8216;aggr&#8217;. It registers virtual ports corresponding to link aggregation groups with the GLDv3 Mac layer. It uses the client interface provided by MAC layer to control and communicate with aggregated MAC ports as illustrated below in Fig 7. It also export a pseudo &#8216;aggr&#8217; device driver which is used by &#8216;dladm&#8217; command to configure and control the link aggregated interface. Once a MAC port is configured to be part of link aggregation group, it cannot be simultaneously accessed by other MAC clients clients such as DLS layer. The exclusive access is enforced by the MAC layer. The implementation of LACP is implemented by the &#8216;aggr&#8217; driver which has access to individual MAC ports or links.</p>
<p><a href="http://sunaytripathi.files.wordpress.com/2010/03/fig7.gif"><img src="http://sunaytripathi.files.wordpress.com/2010/03/fig7.gif?w=415&#038;h=636" alt="" title="Fig7" width="415" height="636" class="aligncenter size-medium wp-image-7" /></a></p>
<p>The GLDv3 aggr driver acts a normal MAC module to upper layer and appears as a standard NIC interface which once created with &#8216;dladm&#8217;, can be configured and managed by &#8216;ifconfig&#8217;. The &#8216;aggr&#8217; module registers each MAC port which is part of the aggregation with the upper layer using the &#8216;mac_resource_add&#8217; function such that the data paths and interrupts from each MAC port can be independently managed by the upper layers (see Section 8b). In short, the aggregated interface is managed as a single interface with possibly one IP address and the data paths are managed as individual NICs by unique CPUs/Squeues providing aggregation capability to Solaris with near zero overheads and linear scalability with respect to number of MAC ports that are part of the aggregation group.</p>
<h3><a class="mozTocH3" name="mozTocId368836"></a>Checksum offload</h3>
<p>Solaris 10 improved the H/W checksum offload capability further to improve overall performance for most applications. 16-bit one&#8217;s complement checksum offload framework has existed in Solaris for some time. It was originally added as a requirement for Zero Copy TCP/IP in Solaris 2.6 but was never extended until recently to handle other protocols. Solaris defines two classes of checksum offload:</p>
<ul>
<li>Full &#8211; Complete checksum calculation in the hardware, including pseudo-header checksum computation for TCP and UDP packets. The hardware is assumed to have the ability to parse protocol headers.</li>
<li>Partial &#8211; &#8220;Dumb&#8221; one&#8217;s complement checksum based on start, end and stuff offsets describing the span of the checksummed data and the location of the transport checksum field, with no pseudo-header calculation ability in the hardware.</li>
</ul>
<p>Adding support for non-fragmented IPV4 cases (unicast or multicast) is trivial for both transmit and receive, as most modern network adapters support either class of checksum offload with minor differences in the interface. The IPV6 cases are not as straightforward, because very few full-checksum network adapters are capable of handling checksum calculation for TCP/UDP packets over IPV64.</p>
<p>The fragmented IP cases have similar constraints. On transmit, checksumming applies to the unfragmented datagram. In order for an adapter to support checksum offload, it must be able to buffer all of the IP fragments (or perform the fragmentation in hardware) before finally calculating the checksum and sending the fragments over the wire; until then, checksum offloading for outbound IP fragments cannot be done. On the other hand, the receive fragment reassembly case is more flexible since most full-checksum (and all partial-checksum) network adapters are able to compute and provide the checksum value to the network stack. During fragment reassembly stage, the network stack can derive the checksum status of the unfragmented datagram by combining the values altogether.</p>
<p>Things were simplified by not offloading checksum when IP option were present. For partial-checksum offload, certain adapters limit the start offset to a width sufficient for simple IP packets. When the length of protocol headers exceeds such limit (due to the presence of options), the start offset will wrap around causing incorrect calculation. For full-checksum offload, none of the capable adapters is able to correctly handle IPV4 source routing option.</p>
<p>When transmit checksum offload takes place, the network stack will associate eligible packets with ancillary information needed by the driver to offload the checksum computation to hardware.</p>
<p>In the inbound case, the driver has full control over the packets that get associated with hardware-calculated checksum values. Once a driver advertises its capability via DL CAPAB HCKSUM, the network stack will accept full and/or partial-checksum information for IPV4 and IPV6 packets. This process happens for both non-fragmented and fragmented payloads.</p>
<p>Fragmented packets will first need to go through the reassembly process because checksum validation happens for fully reassembled datagrams. During reassembly, the network stack combines the hardware-calculated  checksum value of each fragment.</p>
<h4><a class="mozTocH3" name="mozTocId1859"></a>&#8216;dladm&#8217; &#8211; New command for datalink administration</h4>
<p>Over period of time, &#8216;ifconfig&#8217; has become severely overloaded trying to manage various layers in the stack. Solaris 10 introduced &#8216;dladm&#8217; command to manage the data link services and ease the burden on &#8216;ifconfig&#8217;. The dladm command operates on three kinds of object:</p>
<ul>
<li>&#8216;link&#8217; &#8211; Data-links, identified by a name</li>
<li>&#8216;aggr&#8217; &#8211; Aggregations of network devices, identified by a key</li>
<li>&#8216;dev&#8217; &#8211; Network devices, identified by concatenation of a driver name and an instance number.</li>
</ul>
<p>The key of an aggregation must be an integer value between 1 and 65535. Some devices do not support configurable data-links or aggregations. The fixed data-links provided by such devices can be viewed using dladm but not configured.</p>
<p>The GLDv3 framework allows users to select the outbound load balancing policy across various members of aggregation while configuring the aggregation. The policy specifies which dev object is used to send packets. A policy consists of a list of one or more layers specifiers separated by commas. A layer specifier is one of the following:</p>
<ul>
<li>L2 &#8211; Select outbound device according to source and destination MAC addresses of the packet.</li>
<li>L3 &#8211; Select outbound device according to source and destination IP addresses of the packet.</li>
<li>L4 &#8211; Select outbound device according to the upper layer protocol information contained in the packet. For TCP and UDP, this includes source and destination ports. For IPsec, this includes the SPI (Security Parameters Index.)</li>
</ul>
<p>For example, to use upper layer protocol information, the following policy can be used:</p>
<p>-P L4</p>
<p>To use the source and destination MAC addresses as well as the source and destination IP addresses, the following policy can be used:</p>
<p>-P L2,L3</p>
<p>The framework also supports Link aggregation control protocol (LACP) for GLDv3 based aggregations which can be controlled by &#8216;dladm&#8217; via the  &#8216;lacp-mode&#8217; and &#8216;lacp-timer&#8217; sub commands. The &#8216;lacp-mode&#8217; can be set to &#8216;off&#8217;, &#8216;active&#8217; or &#8216;passive&#8217;.</p>
<p>When a new device is inserted into a system. During reconfiguration boot or DR a default non-VLAN data-link will be created for the device. The configuration of all objects will persist across reboot.</p>
<p>In future, &#8216;dladm&#8217; and its private file where all persistant information is stored (&#8216;/etc/datalink.conf&#8217;) will be used to manage device specific parameters which are currently managed via &#8216;ndd&#8217;, driver specific configuration files and /etc/system.</p>
<h2><a class="mozTocH2" name="mozTocId400304.125"></a>7. Tuning for performance:</h2>
<p>The Solaris 10 stack is tuned to give steller out of box performance irrespective of the H/W used. The secret lies in using techniques like dynamically switching between interrupt vs polling mode which gives very good latencies when load is managible by allowing the NIC to interrupt per packet and switching to polling mode for better throughput and well bounded latencies when load is very high. The defaults are also carefully picked based on H/W configuration. For instance, the &#8216;tcp_conn_hash_size&#8217; tunable was very conservative pre Solaris 10. The default value of 512 hash buckets was selected based on lowest supperted configuration (in terms of memory). Solaris 10 looks at the free memory at boot time to choose the value for &#8216;tcp_conn_hash_size&#8217;. Similarly, when connection is &#8216;reaped&#8217; from the time wait state, the memory associated with the connection instance is not freed instantly (again based on the total system memory available) but instead put in a &#8216;free_list&#8217;. When new connections arrive if a given period, TCP tries to reuse memory from &#8216;free_list&#8217; otehr wise &#8216;free_list&#8217; is periodically cleaned up.</p>
<p>Inspite of these features, sometimes its necessary to tweak some tunables to deal with extreme cases or specific workloads. We discuss some tunables below that control the stack behaviour. Care should be taken to understand the impact otherwise the system might become unstable. Its important to note that for bulk of the applications and workloads, the defaults will give the best results.</p>
<ul>
<li>&#8216;ip_squeue_fanout&#8217; &#8211; Controls whether incoming connections from one NIC are fanned out across all CPUs. A value of 0 means incoming connections are assigned to the squeue attached to the interrupted CPU. A value of 1 means the connections are fanned out across all CPUs. The latter is required when NIC is faster than the CPU (say 10Gb NIC) and multiple CPU need to service the NIC. Set via /etc/system by adding the following line
<pre>
set ip:ip_squeue_fanout=1
</pre>
</li>
<li>&#8216;ip_squeue_bind&#8217; &#8211; Controls whether worker threads are bound to specific CPUs or not. When bound (default), they give better locality. The non default value (don&#8217;t bind) should be chosen only when processor sets are to be created on the system. Unset via /etc/system by adding the following line
<pre>
set ip:ip_squeue_bind=0
</pre>
</li>
<li>&#8216;tcp_squeue_wput&#8217; &#8211; controls the write side squeue drain behavior.
<ul>
<li>1 &#8211; Try to process your own packets but don&#8217;t try to drain the squeue</li>
<li>2 &#8211; Try to process your own packet as well as any queued packets.</li>
</ul>
<p>The default value is 2 and can be changed via /etc/system by adding</p>
<pre>
set ip:tcp_squeue_wput=1
</pre>
<p>This value should be set to 1 when number of CPUs are far more than number of active NICs and the platform has inherently higher memory latencies where chances of an application thread doing squeue drain and getting pinned is high.</li>
<li>&#8216;ip_squeue_wait&#8217; &#8211; Controls the amount of time in &#8216;ms&#8217; a worker thread will wait before processing queued packets in the hope that interrupt or writer thread will process the packet. For servers which see enough traffic, the default of 10ms is good but for machines which see more interactive traffic (like desktops) where latency is an issue, the value should be set to 0 via /etc/system by adding set
<pre>
ip:ip_squeue_wait=0
</pre>
</li>
<li>In addition, some protocol level tuning like changing the max_buf, high and low water mark, etc if beneficial specially on large memory systems.</li>
</ul>
<h2><a class="mozTocH2" name="mozTocId109935.03125"></a>8. Future</h2>
<p>The future direction of Solaris networking stack will continue to build on better vertical integration between layers which will improve locality and performance further. With the advent of Chip multithreading and multi core CPUs, the number of parallel execution pipelines will continue to increase even on low end systems. A typical 2 CPU machine today is dual core providing 4 execution pipelines and soon going to have hyperthreading as well.</p>
<p>The NICs are also becoming advanced offering multiple interrupts via MSI-X, small classification capabilities, multiple DMA channels, and various stateless offloads like large segment offload etc.</p>
<p>Future work will continue to leverage on these H/W trends including support for TCP offload engines, Remote direct memory access (RDMA), and iSCSI.  Some other specific things that are being worked on:</p>
<ul>
<li>Network stack virtualization &#8211; With the industry wide trend of server consolidation and running multiple virtual machines on same physical instance, its important the Solaris stack can be virtualized efficiently.</li>
<li>B/W Resource control &#8211; The same trend thats driving network virtualization is also driving the need to control the bandwidth usage for various applications and virtual machines on same box efficiently.</li>
<li>Support for high performance 3rd party modules &#8211; The current Solaris 10 framework is still private to modules from Sun. STREAMs based modules are the only option for the ISVs and they miss the full potential of the new framework.</li>
<li>Forwarding performance &#8211; Work is being done to further improve the Solaris forwarding performance.</li>
<li>Network security with performance &#8211; The world is becoming increasing complex and hostile. Its not possible to choose between performance and security anymore. Both are a requirement. Solaris was always very strong in security and Solaris 10 makes great strides in enabling security without sacrificing performance. Focus will continue on enhancing IPfilter performance and functionality and a whole new approach and detecting Denial of service attacks and dealing with them.</li>
</ul>
<h2><a class="mozTocH2" name="mozTocId371329.125"></a>9. Acknowledgments</h2>
<p>Many Thanks to Thirumalai Srinivasan, Adi Masputra, Nicolas Droux, and Eric Cheng for contributing parts of this text. Also thanks are due to all the members of Solaris networking community for their help.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sunaytripathi.wordpress.com/8/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sunaytripathi.wordpress.com/8/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sunaytripathi.wordpress.com/8/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sunaytripathi.wordpress.com&amp;blog=12524723&amp;post=8&amp;subd=sunaytripathi&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sunaytripathi.wordpress.com/2010/03/25/solaris-10-networking-the-magic-revealed/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/f71c841e7597eabdfd65d9f454e3a92d?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">sunaytripathi</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/fig4.gif?w=169" medium="image">
			<media:title type="html">Fig4</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/fig5.gif?w=286" medium="image">
			<media:title type="html">Fig5</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/fig6.gif?w=221" medium="image">
			<media:title type="html">Fig6</media:title>
		</media:content>

		<media:content url="http://sunaytripathi.files.wordpress.com/2010/03/fig7.gif?w=195" medium="image">
			<media:title type="html">Fig7</media:title>
		</media:content>
	</item>
	</channel>
</rss>
