Tuesday, January 10. 2012
Brendan Gregg wrote a really interesting article about tracing ZFS: Activity of the ZFS ARC. Really worth a read.
Sunday, January 1. 2012
Okay … it's 2012 … and according to some people the world will end this year. However what's really happening? It's the mayan version of the Y2038 problem. While the signed 32 bit integer will send us to the 70ies on 03:14:07 UTC on Tuesday, 19 January 2038, the same happens on 22. December 2012. The mayan calender will send us from 13.0.0.0.0 to 0.0.0.0.0. And sorry, there is more potential in the Y2038 problem to kill us all as in the mayan calendar because nobody used the mayan calendar in embedded systems for nuclear weapons, nuclear power stations, isolation fields for strangelets created in the LHC (  ) , air traffic control. With 32 bit signed integer i'm not so sure
Wednesday, December 28. 2011
Work in Progress - this entry will change often in the next days and weeks
A few days^H^H^H^Hweeks ago, i wrote about simulating the cloud that is most often tagged with the name "network" or "intranet" and sometimes "internet" . This would not be c0t0d0s0.org without an article to explain how you can configure this. This article will explain how you simulate a complete network on a single host with routers, switches, dynamic routing protocols and so on
Scope
At first i want to set the expectations right. I don't want to simulate a cloud in the sense of cloud computing here. I'm thinking about something more complex:
I'm talking about the simulation of this cloud, that often hides a lot of complexities and traps in architectural diagrams.
A word of caution first
This article uses a invisible feature. You don't see that it's there because it isn't in the man page, it isn't in the help output of the dladm command. But it's there. It's the commands dladm create/modify-simnet. As it's undocumented i assume it just can disappear without any notice, because it's not there. Don't complain here when it disappears, don't complain at Oracle. Of course no support. You know the game. Consider it as an artifact. As a diagnosis socket labeled "Only for factory use". Consider it as the testing wiring existing in every technical product that's just use for the testing when the product leaves the factory. Never ever use it in production.
Why i'm writing about this "feature" here? Because it's useful. Because there are a multitude of hints that this function exists. All of them are public. The zonestat documentation mentions a "simnet" type at docs.oracle.com and from there you are just a google away from the PSARC case 2009/200. And the source code at src.opensolaris.org shows it as well. From there it's just curiosity to find everything else out that is used in this text.
About this article
I stumbled the first time over this command when i searched for something in the dladm source at src.opensolaris.org. A month ago my former colleague Brian Utterback remembered me of this and i though "let's check if this is still working". And to my astonishment it still worked.
Writing this article takes virtually forever. Because of my broken ankle i took painkillers and that made me somewhat drownsy. And this drownsyness slowed down everything. Thus i decided to create this article under your observation to get it finally out of the door. Thus it's work in progress.
simnet
I just write about simnet. What are simnets? I just want to point you to the PSARC case for indepth information. It's available on opensolaris.org in the caselog. But in short: Simnets are simulated networks. It's a mechanism to test networking protocols. And in this example we will use it exactly for this purpose. Testing networking.
Okay, let's assume you are admin of FUBAR Inc. You want to recreate your network in a box. You have offices in Hamburg, London, Singapore, New York and San Francisco. In each office you have a multi-legged router, connecting to a switch for the internal network with servers an clients, the other interfaces of the switch are connecting to the other routers. As an image says more than 1000 words i will just summarize the network with this figure.

Configuring it
Of course the and the servers will be zones. However we have to recreate the network topology as well. And that's the point where we use the the simnet non-feature.
We need a the switches in our offices first. Those are really easy to configure
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-bridge london
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-bridge hamburg
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-bridge singapore
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-bridge newyork
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-bridge sanfrancisco
Now i need some switchports. At first i create some switch ports in order to connect the switch to the router.
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonsw1_255
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sanfranciscosw1_255
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet newyorksw1_255
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgsw1_255
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet singaporesw1_255
Now i create some additional switchports to connect servers.
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonsw1_1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonsw1_2
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgsw1_1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet singaporesw1_1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet newyorksw1_1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sanfranciscosw1_1
Ports meant for the bridge are nice, however they should be connected with the bridge.
root@cloudinabox:/opt/cloudsimulation/zones# dladm add-bridge -l londonsw1_1 -l londonsw1_2 -l londonsw1_255 london
root@cloudinabox:/opt/cloudsimulation/zones# dladm add-bridge -l hamburgsw1_1 -l hamburgsw1_255 hamburg
root@cloudinabox:/opt/cloudsimulation/zones# dladm add-bridge -l singaporesw1_1 -l singaporesw1_255 singapore
root@cloudinabox:/opt/cloudsimulation/zones# dladm add-bridge -l sanfranciscosw1_1 -l sanfranciscosw1_255 sanfrancisco
root@cloudinabox:/opt/cloudsimulation/zones# dladm add-bridge -l newyorksw1_1 -l newyorksw1_255 newyork
Let's now create all the interfaces we need for the routers.
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonrouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgrouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgrouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgrouter3
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet newyorkrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet newyorkrouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sinrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sinrouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sinrouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sforouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sforouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sforouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sforouter3 >
And of course we need interfaces for all the servers
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonsrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet londonsrv2
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet hamburgsrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet singaporesrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet sanfranciscosrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm create-simnet newyorksrv1
Now we have to create logical cables … lots of them. At first the routers with their switches.
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p londonsw1_255 londonrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p hamburgsw1_255 hamburgrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p singaporesw1_255 sinrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p newyorksw1_255 newyorkrouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p sanfranciscosw1_255 sforouter0
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p hamburgrouter1 londonrouter1
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p sforouter1 hamburgrouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p sinrouter1 hamburgrouter3
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p sforouter3 sinrouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p newyorkrouter1 sforouter2
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p hamburgsw1_1 hamburgsrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p singaporesw1_1 singaporesrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p newyorksw1_1 newyorksrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p londonsw1_1 londonsrv1
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p londonsw1_2 londonsrv2
root@cloudinabox:/opt/cloudsimulation/zones# dladm modify-simnet -p sanfranciscosw1_1 sanfranciscosrv1
Uff … on the networking side this is all.
The active configuration should look something like that ...
root@cloudinabox:/home/jmoekamp# dladm show-link
LINK CLASS MTU STATE OVER
net1 phys 1500 unknown --
net2 phys 1500 up --
net0 phys 1500 unknown --
london0 bridge 1500 up londonsw1_1 londonsw1_2 londonsw1_255
hamburg0 bridge 1500 up hamburgsw1_1 hamburgsw1_255
singapore0 bridge 1500 up singaporesw1_1 singaporesw1_255
newyork0 bridge 1500 up newyorksw1_1 newyorksw1_255
sanfrancisco0 bridge 1500 up sanfranciscosw1_1 sanfranciscosw1_255
londonsw1_255 simnet 1500 up londonrouter0
sanfranciscosw1_255 simnet 1500 up sforouter0
newyorksw1_255 simnet 1500 up newyorkrouter0
hamburgsw1_255 simnet 1500 up hamburgrouter0
singaporesw1_255 simnet 1500 up sinrouter0
londonsw1_1 simnet 1500 up londonsrv1
londonsw1_2 simnet 1500 up londonsrv2
hamburgsw1_1 simnet 1500 up hamburgsrv1
singaporesw1_1 simnet 1500 up singaporesrv1
newyorksw1_1 simnet 1500 up newyorksrv1
sanfranciscosw1_1 simnet 1500 up sanfranciscosrv1
londonrouter0 simnet 1500 up londonsw1_255
londonrouter1 simnet 1500 up hamburgrouter1
hamburgrouter0 simnet 1500 up hamburgsw1_255
hamburgrouter1 simnet 1500 up londonrouter1
hamburgrouter2 simnet 1500 up sforouter1
hamburgrouter3 simnet 1500 up sinrouter1
newyorkrouter0 simnet 1500 up newyorksw1_255
newyorkrouter1 simnet 1500 up sforouter2
sinrouter0 simnet 1500 up singaporesw1_255
sinrouter1 simnet 1500 up hamburgrouter3
sinrouter2 simnet 1500 up sforouter3
sforouter0 simnet 1500 up sanfranciscosw1_255
sforouter1 simnet 1500 up hamburgrouter2
sforouter2 simnet 1500 up newyorkrouter1
sforouter3 simnet 1500 up sinrouter2
londonsrv1 simnet 1500 up londonsw1_1
londonsrv2 simnet 1500 up londonsw1_2
hamburgsrv1 simnet 1500 up hamburgsw1_1
singaporesrv1 simnet 1500 up singaporesw1_1
sanfranciscosrv1 simnet 1500 up sanfranciscosw1_1
newyorksrv1 simnet 1500 up newyorksw1_1
Zone Creation
Okay, now we have to create the zones.
mkdir -p /opt/cloudsimulation/zones
zfs create rpool/zones
zfs set mountpoint=/zones rpool/zones
We create a lot of controlfiles first. With this controlfiles we will feed zonecfg later on. I created the /opt/cloudsimulation/zones directory to hold them. Of course it's useful to have an own ZFS filesystem in order to enable the zone creation process to simply copy the data needed by a zone by creating a clone of a filesystem.
/opt/cloudsimulation/zones/templateserver
create -b
set zonepath=/zones/templateserver
set brand=solaris
set autoboot=false
set ip-type=exclusive
/opt/cloudsimulation/zones/londonrouter:
create -b
set zonepath=/zones/londonrouter
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=londonrouter0
end
add net
set configure-allowed-address=true
set physical=londonrouter1
end
/opt/cloudsimulation/zones/hamburgrouter
create -b
set zonepath=/zones/hamburgrouter
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=hamburgrouter0
end
add net
set configure-allowed-address=true
set physical=hamburgrouter1
end
add net
set configure-allowed-address=true
set physical=hamburgrouter2
end
add net
set configure-allowed-address=true
set physical=hamburgrouter3
end
/opt/cloudsimulation/zones/singaporerouter:
create -b
set zonepath=/zones/singaporerouter
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=sinrouter0
end
add net
set configure-allowed-address=true
set physical=sinrouter1
end
add net
set configure-allowed-address=true
set physical=sinrouter2
end
/opt/cloudsimulation/zones/sanfranciscorouter:
create -b
set zonepath=/zones/sanfranciscorouter
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=sforouter0
end
add net
set configure-allowed-address=true
set physical=sforouter1
end
add net
set configure-allowed-address=true
set physical=sforouter2
end
add net
set configure-allowed-address=true
set physical=sforouter3
end
/opt/cloudsimulation/zones/newyorkrouter:
create -b
set zonepath=/zones/newyorkrouter
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=newyorkrouter0
end
add net
set configure-allowed-address=true
set physical=newyorkrouter1
end
Whois is wondering about the sfo and sin IATA shorthands that i've used instead of the long names in other "cities". Quagga doesn't seem to like interface names longer than 16 characters.
Okay. Now we have to create all the zones. That's easy. As i said, i will just feed the control files into zonecfg with the -f option.
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z templateserver -f /opt/cloudsimulation/zones/templateserver
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z templaterouter -f /opt/cloudsimulation/zones/templaterouter
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z londonrouter -f /opt/cloudsimulation/zones/londonrouter
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z singaporerouter -f /opt/cloudsimulation/zones/singaporerouter
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z hamburgrouter -f /opt/cloudsimulation/zones/hamburgrouter
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z sanfranciscorouter -f /opt/cloudsimulation/zones/sanfranciscorouter
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z newyorkrouter -f /opt/cloudsimulation/zones/newyorkrouter
Okay, at first we install the template zone. We do a full install here. and that's pretty much the only purpose … to have one installed baseline zone as providing the starting point for all other zones. This may take a while. Depending on your system you may opt for a coffee or two.
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z templateserver install
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T184237Z.templateserver.install
Image: Preparing at /zones/templateserver/root
Install Log: /system/volatile/install.4469/install_log
AI Manifest: /tmp/manifest.xml.oBayTi
SC Profile: /usr/share/auto_install/sc_profiles/enable_sci.xml
Zonename: templateserver
Installation: Starting ...
Creating IPS image
Installing packages from:
solaris
origin: http://pkg.oracle.com/solaris/release/
DOWNLOAD PKGS FILES XFER (MB)
Completed 167/167 32062/32062 175.8/175.8
PHASE ACTIONS
Install Phase 44313/44313
PHASE ITEMS
Package State Update Phase 167/167
Image State Update Phase 2/2
Installation: Succeeded
Note: Man pages can be obtained by installing pkg:/system/manual
done.
Done: Installation completed in 1423,641 seconds.
Next Steps: Boot the zone, then log into the zone console (zlogin -C)
to complete the configuration process.
Log saved in non-global zone as /zones/templateserver/root/var/log/zones/zoneadm.20111217T184237Z.templateserver.install
We never boot this one, it's just to ease the next steps.
Okay, now we prepare the real zones. You don't have to to the next steps, however they relief you from login into each zones and going to the same dialog windows. We will use a simple trick to circumvent the need to go through each sysconfig dialog in each router we will use a simple trick. You can create a xml file containing the necessary data and pass it to the cloning of the zone.
Important: I want to make the resulting xml file as generic as possible, thus i won't configure networking via this process, albeit this is possible. As it's a CUI, i will guide you through this dialog with some pictures.
root@cloudinabox:/opt/cloudsimulation/zones# sysconfig create-profile -o template.xml






After leaving the last screen, you should yield a file with content similar to this:
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="sysconfig">
<service version="1" type="service" name="system/config-user">
<instance enabled="true" name="default">
<property_group type="application" name="root_account">
<propval type="astring" name="login" value="root"/>
<propval type="astring" name="password" value="$5$35worB11$/EeCnO5t2zOHhasRQeWeVyWuGLFFUFLQGmOhKPX82m2"/>
<propval type="astring" name="type" value="role"/>
</property_group>
<property_group type="application" name="user_account">
<propval type="astring" name="login" value="radmin"/>
<propval type="astring" name="password" value="$5$XztZ799F$GVL48echivvJcPl.BRcVvnn3/M8Z7L6LhmyVPP04J/2"/>
<propval type="astring" name="type" value="normal"/>
<propval type="astring" name="description" value="routeradm"/>
<propval type="count" name="gid" value="10"/>
<propval type="astring" name="shell" value="/usr/bin/bash"/>
<propval type="astring" name="roles" value="root"/>
<propval type="astring" name="profiles" value="System Administrator"/>
<propval type="astring" name="sudoers" value="ALL=(ALL) ALL"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="system/timezone">
<instance enabled="true" name="default">
<property_group type="application" name="timezone">
<propval type="astring" name="localtime" value="UTC"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="system/environment">
<instance enabled="true" name="init">
<property_group type="application" name="environment">
<propval type="astring" name="LANG" value="C"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="system/identity">
<instance enabled="true" name="node">
<property_group type="application" name="config">
<propval type="astring" name="nodename" value="jamphfhn"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="system/keymap">
<instance enabled="true" name="default">
<property_group type="system" name="keymap">
<propval type="astring" name="layout" value="German"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="system/console-login">
<instance enabled="true" name="default">
<property_group type="application" name="ttymon">
<propval type="astring" name="terminal_type" value="sun-color"/>
</property_group>
</instance>
</service>
<service version="1" type="service" name="network/physical">
<instance enabled="true" name="default">
<property_group type="application" name="netcfg"/>
</instance>
</service>
</service_bundle>
Before you ask, the password for radmin and root is n0mn0mn0m. And the jamphfhn just stands for "just a meaningless placeholder for hostname".
Okay, i will create another template zone. This is because a routing zone will have some special properties that a zone acting as a server doesn't need and i don't want such properties in the server zones.
At first i just take the template.xml script and substitute the hostname. I could simply do it via vi, but for a tutorial a simple shell line is more efficient.
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/templaterouter/ > templaterouter.xml
I use the newly created file as an input for the zone clone command.
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z templaterouter clone -c /opt/cloudsimulation/zones/templaterouter.xml templateserver
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T193101Z.templaterouter.clone
Log saved in non-global zone as /zones/templaterouter/root/var/log/zones/zoneadm.20111217T193101Z.templaterouter.clone
As the system just creates a zfs clone the command should return after a small period of time. Now we can log into the console of the zone with zlogin.
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z templaterouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C templaterouter
[Connected to zone 'templaterouter' console]
Hostname: unknown
Hostname: templaterouter
templaterouter console login: radmin
Password:
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@templaterouter:~$
radmin@templaterouter:~$ sudo bash
Password:
Dec 17 19:36:33 templaterouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
root@templaterouter:/home/radmin#
I wrote earlier, that the template for the router contains some additional stuff. At first i need a telnet client. It will get obvious why i need it later on:
root@templaterouter:/home/radmin# pkg install pkg://solaris/network/telnet
Packages to install: 1
Create boot environment: No
Create backup boot environment: No
DOWNLOAD PKGS FILES XFER (MB)
Completed 1/1 8/8 0.1/0.1
PHASE ACTIONS
Install Phase 22/22
PHASE ITEMS
Package State Update Phase 1/1
Image State Update Phase 2/2
Okay, now let's install quagga. Quagga is a suite of daemons to implement dynamic routing protocols:
root@templaterouter:/home/radmin# pkg install quagga
Packages to install: 1
Create boot environment: No
Create backup boot environment: No
Services to change: 3
DOWNLOAD PKGS FILES XFER (MB)
Completed 1/1 89/89 2.7/2.7
PHASE ACTIONS
Install Phase 132/132
PHASE ITEMS
Package State Update Phase 1/1
Image State Update Phase 2/2
Loading smf(5) service descriptions: 2/2
Okay, now we have to configure some basics that are equal to all the router in the network.
At first we activate forwarding. With this activation, you enable the operating system to accept packets on one interface
root@templaterouter:/home/radmin# routeadm -e ipv4-forwarding
ipv4-routing tells the system to startup routing protocol daemons. When you have a default router configured it's disabled, when there isn't one this setting is enabled per default.
root@templaterouter:/home/radmin# routeadm -e ipv4-routing
Okay, now we have to do some quagga configurations. I want to use quagga with OSPF, so there are two important services for me. Zebra and ospf. Zebra is the layer, that the quagga suite used to interact with the system. Why is it called Zebra? I assume it's history, the old GNU routing protocol daemon suite was called zebra, quagga is the follow-on project as zebra is now a defunct software development project. What do we configure here.
Both daemons offer a command line for interfaction with the daemon. We configure both just to react from 127.0.0.1 (aka localhost). The zebra daemon has it's console on port 2602, the ospf daemon listens on port 2601. And this both ports are the reason we need telnet on our routers. You access the consoles via telnet.
root@templaterouter:/home/radmin# routeadm -m zebra:quagga vty_port="2602"
root@templaterouter:/home/radmin# routeadm -m ospf:quagga vty_port="2601"
root@templaterouter:/home/radmin# routeadm -m zebra:quagga vty_address="127.0.0.1"
root@templaterouter:/home/radmin# routeadm -m ospf:quagga vty_address="127.0.0.1"
With this command we tell Solaris to use ospf as the routing protocol for ipv4 purposes.
root@templaterouter:/home/radmin# routeadm -s routing-svcs=ospf:quagga -e ipv4-routing
Now we have to activate the new setting
root@templaterouter:/home/radmin# routeadm -u
You should now get some weired SMF error messages that some services couldn't start up. that's normal because there are no configuration files available for the quagga suite. Don't think about it, just shut the zone down now.
root@cloudinabox:/home/jmoekamp# zoneadm -z templaterouter halt
Okay, now we have derived our template for the router zones from the generic template for zones. We use this template for installing all the router zones now.
Okay, i just wrote about quagga config files. I want to prepare them now in order to be able just to copy them into the zones before starting them up and thus to circumvent the error messages. We need a lot of them.
- London
/opt/cloudsimulation/zones/zebra.london.conf
hostname londonrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/zebra.log
line vty
/opt/cloudsimulation/zones/ospfd.london.conf
hostname londonrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/ospf.log
!
!
!
interface lo0
!
interface londonrouter0
!
interface londonrouter1
!
router ospf
redistribute connected
network 10.1.1.0/24 area 0.0.0.0
!
line vty
!
- Hamburg
/opt/cloudsimulation/zones/zebra.hamburg.conf
hostname hamburgrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/zebra.log
line vty
/opt/cloudsimulation/zones/ospfd.hamburg.conf
hostname hamburgrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/ospf.log
!
!
!
interface lo0
!
interface hamburgrouter0
!
interface hamburgrouter1
!
interface hamburgrouter2
!
interface hamburgrouter3
!
router ospf
redistribute connected
network 10.1.1.0/24 area 0.0.0.0
network 10.1.2.0/24 area 0.0.0.0
network 10.1.3.0/24 area 0.0.0.0
!
line vty
!
- Singapore
/opt/cloudsimulation/zones/zebra.singapore.conf
!
! Zebra configuration saved from vty
! 2011/12/12 20:20:13
!
hostname sinrouter
password nomnomnom
enable password nomnomnom
log file /var/adm/quagga/zebra.log
!
interface lo0
!
line vty
!
/opt/cloudsimulation/zones/ospfd.singapore.conf
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/ospf.log
!
interface lo0
!
interface sinrouter0
!
interface sinrouter1
!
interface sinrouter2
!
router ospf
redistribute connected
network 10.1.2.0/24 area 0.0.0.0
network 10.1.4.0/24 area 0.0.0.0
!
line vty
!
- San Francisco
/opt/cloudsimulation/zones/zebra.sanfrancisco.conf
hostname sforouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/zebra.log
line vty
/opt/cloudsimulation/zones/ospfd.sanfrancisco.conf
!
! Zebra configuration saved from vty
! 2011/12/11 04:30:44
!
hostname sforouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/ospf.log
!
!
!
interface lo0
!
interface sforouter0
!
interface sforouter1
!
interface sforouter2
!
interface sforouter3
!
router ospf
redistribute connected
network 10.1.5.0/24 area 0.0.0.0
network 10.1.4.0/24 area 0.0.0.0
network 10.1.3.0/24 area 0.0.0.0
!
line vty
!
- New York
/opt/cloudsimulation/zones/zebra.newyork.conf
hostname newyorkrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/zebra.log
line vty
/opt/cloudsimulation/zones/ospfd.newyork.conf
!
! Zebra configuration saved from vty
! 2011/12/11 04:30:44
!
hostname newyorkrouter
password nomnomnom
enable password nonnomnom
log file /var/adm/quagga/ospf.log
!
!
!
interface lo0
!
interface newyorkrouter0
!
interface newyorkrouter1
!
router ospf
redistribute connected
network 10.1.5.0/24 area 0.0.0.0
!
line vty
!
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/londonrouter/ > londonrouter.xml
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z londonrouter clone -c /opt/cloudsimulation/zones/londonrouter.xml templaterouter
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T205338Z.londonrouter.clone
Log saved in non-global zone as /zones/londonrouter/root/var/log/zones/zoneadm.20111217T205338Z.londonrouter.clone
root@cloudinabox:/opt/cloudsimulation/zones# cp zebra.london.conf /zones/londonrouter/root/etc/quagga/zebra.conf
root@cloudinabox:/opt/cloudsimulation/zones# cp ospfd.london.conf /zones/londonrouter/root/etc/quagga/ospfd.conf
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z londonrouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C londonrouter
[Connected to zone 'londonrouter' console]
londonrouter console login: radmin
Password:
Last login: Sat Dec 17 19:35:39 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@londonrouter:~$ sudo bash
Password:
Dec 17 20:57:58 londonrouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
root@londonrouter:/home/radmin#
# ipadm create-ip londonrouter0
# ipadm create-ip londonrouter1
# ipadm create-addr -T static -a 10.0.10.254/24 londonrouter0/v4
# ipadm create-addr -T static -a 10.1.1.254/24 londonrouter1/v4
# svcadm restart zebra
# svcadm restart ospf
root@londonrouter:/home/radmin# dladm show-link
LINK CLASS MTU STATE OVER
londonrouter0 simnet 1500 up ?
londonrouter1 simnet 1500 up ?
root@londonrouter:/home/radmin# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
londonrouter0/v4 static ok 10.0.10.254/24
londonrouter1/v4 static ok 10.1.1.254/24
lo0/v6 static ok ::1/128
root@londonrouter:/home/radmin# telnet localhost 2601
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to londonrouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al
User Access Verification
Password:
londonrouter> show ip ospf interface
lo0 is up
ifindex 1, MTU 8232 bytes, BW 0 Kbit
OSPF not enabled on this interface
londonrouter0 is up
ifindex 3, MTU 1500 bytes, BW 0 Kbit
OSPF not enabled on this interface
londonrouter1 is up
ifindex 2, MTU 1500 bytes, BW 0 Kbit
Internet Address 10.1.1.254/24, Area 0.0.0.0
MTU mismatch detection:enabled
Router ID 10.1.1.254, Network Type BROADCAST, Cost: 10
Transmit Delay is 1 sec, State DR, Priority 1
Designated Router (ID) 10.1.1.254, Interface Address 10.1.1.254
No backup designated router on this network
Multicast group memberships: OSPFAllRouters OSPFDesignatedRouters
Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5
Hello due in 2.263s
Neighbor Count is 0, Adjacent neighbor count is 0
londonrouter> exit
Connection to londonrouter closed by foreign host.
root@londonrouter:/home/radmin# telnet localhost 2602
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to londonrouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
User Access Verification
Password:
londonrouter> show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
I - ISIS, B - BGP, > - selected route, * - FIB route
C>* 10.0.10.0/24 is directly connected, londonrouter0
O 10.1.1.0/24 [110/10] is directly connected, londonrouter1, 00:04:28
C>* 10.1.1.0/24 is directly connected, londonrouter1
C>* 127.0.0.0/8 is directly connected, lo0
londonrouter>exit
root@londonrouter:/home/radmin# ~.
[Connection to zone 'londonrouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/hamburgrouter/ > hamburgrouter.xml
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z hamburgrouter clone -c /opt/cloudsimulation/zones/hamburgrouter.xml templaterouter
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T212009Z.hamburgrouter.clone
Log saved in non-global zone as /zones/hamburgrouter/root/var/log/zones/zoneadm.20111217T212009Z.hamburgrouter.clone
root@cloudinabox:/opt/cloudsimulation/zones# cp ospfd.hamburg.conf /zones/hamburgrouter/root/etc/quagga/ospfd.conf
root@cloudinabox:/opt/cloudsimulation/zones# cp zebra.hamburg.conf /zones/hamburgrouter/root/etc/quagga/zebra.conf
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z hamburgrouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C hamburgrouter
[Connected to zone 'hamburgrouter' console]
Hostname: hamburgrouter
hamburgrouter console login: radmin
Password:
Last login: Sat Dec 17 19:35:39 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@hamburgrouter:~$ sudo bash
Password:
Dec 17 21:23:45 hamburgrouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
root@hamburgrouter:/home/radmin#
root@hamburgrouter:/home/radmin# ipadm create-ip hamburgrouter0
root@hamburgrouter:/home/radmin# ipadm create-ip hamburgrouter1
root@hamburgrouter:/home/radmin# ipadm create-ip hamburgrouter2
root@hamburgrouter:/home/radmin# ipadm create-ip hamburgrouter3
root@hamburgrouter:/home/radmin# ipadm create-addr -T static -a 10.0.11.254/24 hamburgrouter0/v4
root@hamburgrouter:/home/radmin# ipadm create-addr -T static -a 10.1.1.1/24 hamburgrouter1/v4
root@hamburgrouter:/home/radmin# ipadm create-addr -T static -a 10.1.3.1/24 hamburgrouter2/v4
root@hamburgrouter:/home/radmin# ipadm create-addr -T static -a 10.1.2.1/24 hamburgrouter3/v4
root@hamburgrouter:/home/radmin# svcadm restart ospf
root@hamburgrouter:/home/radmin# ipadm show-addr
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
hamburgrouter0/v4 static ok 10.0.11.254/24
hamburgrouter1/v4 static ok 10.1.1.1/24
hamburgrouter2/v4 static ok 10.1.3.1/24
hamburgrouter3/v4 static ok 10.1.2.1/24
lo0/v6 static ok ::1/128
root@hamburgrouter:/home/radmin# netstat -nr
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
10.0.10.0 10.1.1.254 UG 1 0
10.0.11.0 10.0.11.254 U 2 0 hamburgrouter0
10.1.1.0 10.1.1.1 U 3 9 hamburgrouter1
10.1.2.0 10.1.2.1 U 2 0 hamburgrouter3
10.1.3.0 10.1.3.1 U 2 0 hamburgrouter2
127.0.0.1 127.0.0.1 UH 2 0 lo0
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 0 lo0
root@hamburgrouter:/home/radmin#
root@hamburgrouter:/home/radmin# telnet localhost 2602
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to hamburgrouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
User Access Verification
Password:
hamburgrouter> show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
I - ISIS, B - BGP, > - selected route, * - FIB route
O>* 10.0.10.0/24 [110/20] via 10.1.1.254, hamburgrouter1, 00:01:04
C>* 10.0.11.0/24 is directly connected, hamburgrouter0
O 10.1.1.0/24 [110/10] is directly connected, hamburgrouter1, 00:01:09
C>* 10.1.1.0/24 is directly connected, hamburgrouter1
O 10.1.2.0/24 [110/10] is directly connected, hamburgrouter3, 00:01:09
C>* 10.1.2.0/24 is directly connected, hamburgrouter3
O 10.1.3.0/24 [110/10] is directly connected, hamburgrouter2, 00:01:09
C>* 10.1.3.0/24 is directly connected, hamburgrouter2
C>* 127.0.0.0/8 is directly connected, lo0
hamburgrouter> exit
Connection to hamburgrouter closed by foreign host.
root@hamburgrouter:/home/radmin#
root@hamburgrouter:/home/radmin# ~.
[Connection to zone 'hamburgrouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/singaporerouter/ > singaporerouter.xml
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z singaporerouter clone -c /opt/cloudsimulation/zones/singaporerouter.xml templaterouter
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T223455Z.singaporerouter.clone
Log saved in non-global zone as /zones/singaporerouter/root/var/log/zones/zoneadm.20111217T223455Z.singaporerouter.clone
root@cloudinabox:/opt/cloudsimulation/zones# cp ospfd.singapore.conf /zones/singaporerouter/root/etc/quagga/ospfd.conf
root@cloudinabox:/opt/cloudsimulation/zones# cp zebra.singapore.conf /zones/singaporerouter/root/etc/quagga/zebra.conf
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z singaporerouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C singaporerouter
[Connected to zone 'singaporerouter' console]
singaporerouter console login: radmin
Password:
Last login: Sat Dec 17 19:35:39 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@singaporerouter:~$ sudo bash
Password:
Dec 17 22:37:58 singaporerouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
ipadm create-ip sinrouter0
ipadm create-ip sinrouter1
ipadm create-ip sinrouter2
ipadm create-addr -T static -a 10.0.12.254/24 sinrouter0/v4
ipadm create-addr -T static -a 10.1.2.254/24 sinrouter1/v4
ipadm create-addr -T static -a 10.1.4.1/24 sinrouter2/v4
root@singaporerouter:/home/radmin# svcadm restart zebra
root@singaporerouter:/home/radmin# svcadm restart ospf
root@singaporerouter:/home/radmin# telnet localhost 2602
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to singaporerouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
User Access Verification
Password:
sinrouter> show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
I - ISIS, B - BGP, > - selected route, * - FIB route
O>* 10.0.10.0/24 [110/20] via 10.1.2.1, sinrouter1, 00:00:34
O>* 10.0.11.0/24 [110/20] via 10.1.2.1, sinrouter1, 00:00:34
C>* 10.0.12.0/24 is directly connected, sinrouter0
O>* 10.1.1.0/24 [110/20] via 10.1.2.1, sinrouter1, 00:00:35
O 10.1.2.0/24 [110/10] is directly connected, sinrouter1, 00:00:44
C>* 10.1.2.0/24 is directly connected, sinrouter1
O>* 10.1.3.0/24 [110/20] via 10.1.2.1, sinrouter1, 00:00:35
O 10.1.4.0/24 [110/10] is directly connected, sinrouter2, 00:00:44
C>* 10.1.4.0/24 is directly connected, sinrouter2
C>* 127.0.0.0/8 is directly connected, lo0
sinrouter> exit
Connection to singaporerouter closed by foreign host.
root@singaporerouter:/home/radmin# ~.
[Connection to zone 'singaporerouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/sanfranciscorouter/ > sanfranciscorouter.xml
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z sanfranciscorouter clone -c /opt/cloudsimulation/zones/sanfranciscorouter.xml templaterouter
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T224355Z.sanfranciscorouter.clone
Log saved in non-global zone as /zones/sanfranciscorouter/root/var/log/zones/zoneadm.20111217T224355Z.sanfranciscorouter.clone
root@cloudinabox:/opt/cloudsimulation/zones# cp ospfd.sanfrancisco.conf /zones/sanfranciscorouter/root/etc/quagga/ospfd.conf
root@cloudinabox:/opt/cloudsimulation/zones# cp zebra.sanfrancisco.conf /zones/sanfranciscorouter/root/etc/quagga/zebra.conf
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z sanfranciscorouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C sanfranciscorouter
[Connected to zone 'sanfranciscorouter' console]
sanfranciscorouter console login: radmin
Password:
Last login: Sat Dec 17 19:35:39 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@sanfranciscorouter:~$ sudo bash
Password:
Dec 17 22:46:49 sanfranciscorouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
ipadm create-ip sforouter0
ipadm create-ip sforouter1
ipadm create-ip sforouter2
ipadm create-ip sforouter3
ipadm create-addr -T static -a 10.0.13.254/24 sforouter0/v4
ipadm create-addr -T static -a 10.1.3.254/24 sforouter1/v4
ipadm create-addr -T static -a 10.1.5.1/24 sforouter2/v4
ipadm create-addr -T static -a 10.1.4.254/24 sforouter3/v4
root@sanfranciscorouter:/home/radmin# telnet localhost 2602
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to sanfranciscorouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
User Access Verification
Password:
sforouter> show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
I - ISIS, B - BGP, > - selected route, * - FIB route
O>* 10.0.10.0/24 [110/20] via 10.1.3.1, sforouter1, 00:00:25
O>* 10.0.11.0/24 [110/20] via 10.1.3.1, sforouter1, 00:00:25
O>* 10.0.12.0/24 [110/20] via 10.1.4.1, sforouter3, 00:00:25
C>* 10.0.13.0/24 is directly connected, sforouter0
O>* 10.1.1.0/24 [110/20] via 10.1.3.1, sforouter1, 00:00:26
O>* 10.1.2.0/24 [110/20] via 10.1.3.1, sforouter1, 00:00:26
* via 10.1.4.1, sforouter3, 00:00:26
O 10.1.3.0/24 [110/10] is directly connected, sforouter1, 00:00:26
C>* 10.1.3.0/24 is directly connected, sforouter1
O 10.1.4.0/24 [110/10] is directly connected, sforouter3, 00:00:35
C>* 10.1.4.0/24 is directly connected, sforouter3
O 10.1.5.0/24 [110/10] is directly connected, sforouter2, 00:00:35
C>* 10.1.5.0/24 is directly connected, sforouter2
C>* 127.0.0.0/8 is directly connected, lo0
sforouter> exit
Connection to sanfranciscorouter closed by foreign host.
root@sanfranciscorouter:/home/radmin# ~.
[Connection to zone 'sanfranciscorouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/newyorkrouter/ > newyorkrouter.xml
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z newyorkrouter clone -c /opt/cloudsimulation/zones/newyorkrouter.xml templaterouter
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111217T225139Z.newyorkrouter.clone
Log saved in non-global zone as /zones/newyorkrouter/root/var/log/zones/zoneadm.20111217T225139Z.newyorkrouter.clone
root@cloudinabox:/opt/cloudsimulation/zones# cp ospfd.newyork.conf /zones/newyorkrouter/root/etc/quagga/ospfd.conf
root@cloudinabox:/opt/cloudsimulation/zones# cp zebra.newyork.conf /zones/newyorkrouter/root/etc/quagga/zebra.conf
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z newyorkrouter boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C newyorkrouter
[Connected to zone 'newyorkrouter' console]
newyorkrouter console login: radmin
Password:
Last login: Sat Dec 17 19:35:39 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@newyorkrouter:~$ sudo bash
Password:
Dec 17 22:54:33 newyorkrouter sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
ipadm create-ip newyorkrouter0
ipadm create-ip newyorkrouter1
ipadm create-addr -T static -a 10.0.14.0/24 newyorkrouter0/v4
ipadm create-addr -T static -a 10.1.5.254/24 newyorkrouter1/v4
root@newyorkrouter:/home/radmin# telnet localhost 2602
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to newyorkrouter.
Escape character is '^]'.
Hello, this is Quagga (version 0.99.8).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
User Access Verification
Password:
newyorkrouter> show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
I - ISIS, B - BGP, > - selected route, * - FIB route
O>* 10.0.10.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:22
O>* 10.0.11.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:22
O>* 10.0.12.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:22
O>* 10.0.13.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:22
C>* 10.0.14.0/24 is directly connected, newyorkrouter0
O>* 10.1.1.0/24 [110/30] via 10.1.5.1, newyorkrouter1, 00:00:23
O>* 10.1.2.0/24 [110/30] via 10.1.5.1, newyorkrouter1, 00:00:23
O>* 10.1.3.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:23
O>* 10.1.4.0/24 [110/20] via 10.1.5.1, newyorkrouter1, 00:00:23
O 10.1.5.0/24 [110/10] is directly connected, newyorkrouter1, 00:00:25
C>* 10.1.5.0/24 is directly connected, newyorkrouter1
C>* 127.0.0.0/8 is directly connected, lo0
newyorkrouter> exit
Connection to newyorkrouter closed by foreign host.
root@newyorkrouter:/home/radmin# ~.
[Connection to zone 'newyorkrouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C newyorkrouter
[Connected to zone 'newyorkrouter' console]
root@newyorkrouter:/home/radmin#
root@newyorkrouter:/home/radmin#
root@newyorkrouter:/home/radmin# ping 10.0.10.254
10.0.10.254 is alive
root@newyorkrouter:/home/radmin# traceroute 10.0.10.254
traceroute: Warning: Multiple interfaces found; using 10.1.5.254 @ newyorkrouter1
traceroute to 10.0.10.254 (10.0.10.254), 30 hops max, 40 byte packets
1 10.1.5.1 (10.1.5.1) 0.116 ms 0.083 ms 0.038 ms
2 10.1.3.1 (10.1.3.1) 0.072 ms 0.048 ms 0.041 ms
3 10.0.10.254 (10.0.10.254) 0.065 ms 0.077 ms 0.047 ms
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C sanfranciscorouter
[Connected to zone 'sanfranciscorouter' console]
root@sanfranciscorouter:/home/radmin#
root@sanfranciscorouter:/home/radmin#
root@sanfranciscorouter:/home/radmin# ipadm disable-if sforouter1
ipadm: persistent operation not supported for disable-if
root@sanfranciscorouter:/home/radmin# ipadm disable-if -t sforouter1
root@sanfranciscorouter:/home/radmin# ~.
bash: ~.: command not found
root@sanfranciscorouter:/home/radmin# ~.
[Connection to zone 'sanfranciscorouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C newyorkrouter
[Connected to zone 'newyorkrouter' console]
root@newyorkrouter:/home/radmin# traceroute 10.0.10.254
traceroute: Warning: Multiple interfaces found; using 10.1.5.254 @ newyorkrouter1
traceroute to 10.0.10.254 (10.0.10.254), 30 hops max, 40 byte packets
1 10.1.5.1 (10.1.5.1) 0.114 ms 0.141 ms 0.132 ms
2 10.1.4.1 (10.1.4.1) 0.072 ms 0.046 ms 0.041 ms
3 10.1.2.1 (10.1.2.1) 0.065 ms 0.066 ms 0.048 ms
4 10.0.10.254 (10.0.10.254) 0.073 ms 0.068 ms 0.052 ms
root@newyorkrouter:/home/radmin# ~.
[Connection to zone 'newyorkrouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones#
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C sanfranciscorouter
[Connected to zone 'sanfranciscorouter' console]
root@sanfranciscorouter:/home/radmin#
root@sanfranciscorouter:/home/radmin# ipadm enable-if -t sforouter1
root@sanfranciscorouter:/home/radmin#
root@sanfranciscorouter:/home/radmin# ~.
[Connection to zone 'sanfranciscorouter' console closed]
root@cloudinabox:/opt/cloudsimulation/zones#
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C newyorkrouter
[Connected to zone 'newyorkrouter' console]
root@newyorkrouter:/home/radmin# traceroute 10.0.10.254
traceroute: Warning: Multiple interfaces found; using 10.1.5.254 @ newyorkrouter1
traceroute to 10.0.10.254 (10.0.10.254), 30 hops max, 40 byte packets
1 10.1.5.1 (10.1.5.1) 0.361 ms 0.044 ms 0.037 ms
2 10.1.3.1 (10.1.3.1) 0.061 ms 0.046 ms 0.042 ms
3 10.0.10.254 (10.0.10.254) 0.070 ms 0.052 ms 0.048 ms
root@newyorkrouter:/home/radmin# ~.
[Connection to zone 'newyorkrouter' console closed]
Put something like this into the file /opt/cloudsimulation/zones/londonsrv1
zonecfg -z londonsrv1 export
create -b
set zonepath=/zones/londonsrv1
set brand=solaris
set autoboot=false
set ip-type=exclusive
add net
set configure-allowed-address=true
set physical=londonsrv1
end
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/londonsrv1/ > londonsrv1.xml
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z londonsrv1 -f londonsrv1
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z londonsrv1 clone -c /opt/cloudsimulation/zones/londonsrv1.xml templateserver
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111218T043435Z.londonsrv1.clone
Log saved in non-global zone as /zones/londonsrv1/root/var/log/zones/zoneadm.20111218T043435Z.londonsrv1.clone
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z londonsrv1 boot
root@cloudinabox:/opt/cloudsimulation/zones# zlogin -C londonsrv1
[Connected to zone 'londonsrv1' console]
londonsrv1 console login: radmin
Password:
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@londonsrv1:~$ sudo bash
Password:
Dec 18 04:53:49 londonsrv1 sudo: radmin : TTY=console ; PWD=/home/radmin ; USER=root ; COMMAND=/usr/bin/bash
root@londonsrv1:/home/radmin# ipadm create-ip londonsrv1
root@londonsrv1:/home/radmin# ipadm create-addr -T static -a 10.0.10.10/24 londons
rv1/v4
root@londonsrv1:/home/radmin# route -p add default 10.0.10.254
add net default: gateway 10.0.10.254
add persistent net default: gateway 10.0.10.254
root@londonsrv1:/home/radmin# ping 10.0.10.254
10.0.10.254 is alive
root@londonsrv1:/home/radmin# traceroute 10.0.13.254
traceroute to 10.0.13.254 (10.0.13.254), 30 hops max, 40 byte packets
1 10.0.10.254 (10.0.10.254) 0.238 ms 0.051 ms 0.044 ms
2 10.1.1.1 (10.1.1.1) 0.098 ms 0.057 ms 0.053 ms
3 10.0.13.254 (10.0.13.254) 0.072 ms 0.059 ms 0.061 ms
root@londonsrv1:/home/radmin# ~.
[Connection to zone 'londonsrv1' console closed]
root@cloudinabox:/opt/cloudsimulation/zones# cat template.xml | sed s/jamphfhn/newyorksrv1/ > newyorksrv1.xml
root@cloudinabox:/opt/cloudsimulation/zones# cat londonsrv1 | sed s/londonsrv1/newyorksrv1/ > newyorksrv1
root@cloudinabox:/opt/cloudsimulation/zones# vi newyorksrv1
root@cloudinabox:/opt/cloudsimulation/zones# zonecfg -z newyorksrv1 -f newyorksrv1
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z newyorksrv1 clone -c /opt/cloudsimulation/zones/newyorksrv1.xml templateserver
A ZFS file system has been created for this zone.
Progress being logged to /var/log/zones/zoneadm.20111218T050558Z.newyorksrv1.clone
Log saved in non-global zone as /zones/newyorksrv1/root/var/log/zones/zoneadm.20111218T050558Z.newyorksrv1.clone
root@cloudinabox:/opt/cloudsimulation/zones# zoneadm -z newyorksrv1 boot
root@cloudinabox:/home/jmoekamp# zlogin -C newyorksrv1
[Connected to zone 'newyorksrv1' console]
newyorksrv1 console login: radmin
Password:
Last login: Sun Dec 18 05:38:10 on console
Oracle Corporation SunOS 5.11 11.0 November 2011
radmin@newyorksrv1:~$ sudo bash
Password:
root@newyorksrv1:/home/radmin# ipadm create-ip newyorksrv1
root@newyorksrv1:/home/radmin# ipadm create-addr -T static -a 10.0.14.10/24 new
yorksrv1/v4
root@newyorksrv1:/home/radmin# route -p add default 10.0.14.254
add net default: gateway 10.0.14.254
add persistent net default: gateway 10.0.14.254
root@newyorksrv1:/home/radmin# ping 10.0.14.254
10.0.14.254 is alive
root@newyorksrv1:/home/radmin# traceroute 10.0.10.10
traceroute to 10.0.10.10 (10.0.10.10), 30 hops max, 40 byte packets
1 10.0.14.254 (10.0.14.254) 0.132 ms 0.051 ms 0.044 ms
2 10.1.5.1 (10.1.5.1) 0.068 ms 0.085 ms 0.058 ms
3 10.1.3.1 (10.1.3.1) 0.070 ms 0.057 ms 0.054 ms
4 10.1.1.254 (10.1.1.254) 0.110 ms 0.063 ms 0.058 ms
5 10.0.10.10 (10.0.10.10) 0.085 ms 0.069 ms 0.065 ms
root@newyorksrv1:/home/radmin#
Switches for Hamburg-MAN
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw1_250
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw1_251
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw2_250
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw2_251
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw3_250
root@cloudinabox:/home/jmoekamp# dladm create-simnet hamburgsw3_251
root@cloudinabox:/home/jmoekamp# dladm modify-simnet -p hamburgsw2_250 hamburgsw1_251
root@cloudinabox:/home/jmoekamp# dladm modify-simnet -p hamburgsw3_250 hamburgsw2_251
root@cloudinabox:/home/jmoekamp# dladm modify-simnet -p hamburgsw1_250 hamburgsw3_251
root@cloudinabox:/home/jmoekamp# dladm create-bridge hamburgharbour
root@cloudinabox:/home/jmoekamp# dladm create-bridge hamburgairport
root@cloudinabox:/home/jmoekamp# dladm add-bridge -l hamburgsw1_250 -l hamburgsw1_251 hamburg
root@cloudinabox:/home/jmoekamp# dladm add-bridge -l hamburgsw2_250 -l hamburgsw2_251 hamburgairport
root@cloudinabox:/home/jmoekamp# dladm add-bridge -l hamburgsw3_250 -l hamburgsw3_251 hamburgharbour
root@cloudinabox:/home/jmoekamp# dladm show-bridge -l hamburgharbour
LINK STATE UPTIME DESROOT
hamburgsw3_250 fo
Saturday, December 10. 2011
In the past i wrote quite often about a thing that i call systemic features, when features start to fit together seamlessly in order to create possibilities more than the sum of the features. One of the systemic features is the simulation of the cloud. I don't talk about that thing that most people connect in mind with the word cloud (the grid with a credit card checkout  ), but the cloud-like icon in many architectural diagrams called "Network" or "Internet" that sits between the client and the application that often resembles the "a wonder happens here" box in many architectures.
It's not new: I talked about this mid November at the DOAG conference in Nuremberg. And i've playing around with this at customers an privately for a while now.
Many customers have networks as large and as complex as the internet part of a smaller country perhaps 15 years ago. The interesting question is: How can you test your application for it's resiliency against failures in this cloud shaped icon. How does your application react, when your network is doing its high availability magic.
And interestingly Solaris 11 can help you here. The thoughts behind this are pretty simple.
- A router is a computer that runs an operating environment that is tailormade to do network stuff, but at the end it's a computer with a OS (yeah, i know, hardware offloading makes this a little bit more complex, but at the end it's that way)
- A zone is a virtual operating environment.
- Each zone can have it's own set of routes.
- Each zone can have it's own set of firewall rules.
- Each zone can have it's own set of processes.
- Routing protocols are not more than processes collecting information from the network and configuring the routing table.
- You can install a vast array of dynamic routing protocols on a zone.
- I can have up to 8192 zones (given enough memory)
- In Solaris 11 i can emulate switches (etherstubs)
- I can limit bandwidths in Solaris 11 out-of-the-box with crossbow
When i'm combining all this features i can set up a vast array of zones doing nothing else taking each incoming packet on a interface, routing it on a multitude of ways between each other, and send it out on a outgoing interface. Even when the system in your environment are placed in many separate networks of your network you can still use a system with many networking cards or something called server-on-a-stick (single high-bandwidth connection to a vlan-trunking capable switch and using the switch ports as a fan-out).
So in order to emulate a complex corporate network, all i have to do is configuring a lot of etherstubs, configuring many vnics, replicate the physical bandwidths with the maxbw setting on the vnics, set up a lot of zones, perhaps translate the ACL of the routers into firewall rules for firewall functionality of Solaris, installting the routing daemons and configure it similar to the configuration of the routers (in regard of timeouts and so on).
Now i can test, how my applications react, when the network starts to converge against a new topology because of the failures of some lines. I can test, to which topology my network will converge after an line outage (which is nothing more than a deny-all firewall rule). I can test the impact, when the network converges that way, that my traffic flows over a 2 MBit/s line instead of a 155 MBit/s line. For even more complex failure modes i can even use the htbx driver to introduce additional latencies, packet drop or packet reorderung as shown in this article. In essence you can emulate your complete internal network in a single box and with Zones and Crossbow in Solaris 11 it's so low overhead (at the end it is still just one kernel) that you can really emulate the reality and not a simplyfied view, as you don't have emulate via separate hardware or many independent operating system instances in virtual machines.
At the end you could simply use a single Solaris system, put it between all your test systems and use this solaris system as a emulation device for your corporate network. It's simulating the cloud-shaped icon in your architectural diagrams.
Wednesday, October 5. 2011
I think with a few words as an introduction and this video everybody should understand Quicksort. They just dance the algorithm:
Tuesday, October 4. 2011
There is a nice example of the power of boot environment. Boot environments are something like snapshots of your operating system installation made writeable. As you may already assume, they are based on ZFS snapshots and the clone functionality. This is possible due to the usage of ZFS as the root filesystem.
So: Please don't try this at home. Whey you try it, don't try it on any Solaris 11 Express installation of any value. But don't try it. I don't want to hear any story. that you've deleted your ERP system by accident because you used the wrong terminal window. Leave that to trained professional stunt admins with the right equipment (Solaris 11 Express)
Assume you have a system, configured with all your application, everything is running fine. So you think it would be nice to have something like a freezed state of this situation. No problem. This command will do the trick.
# beadm create rescuenet
# init 6
When you reboot your system you will see it as a new entry in the grub menu.
Okay, but boot into the old environment starting "Oracle Solaris ..." first by selecting it in the grub menu (it should be already selected, or you used beadm activate already. Now i will drop the atomic bomb on your installation.
# rm --no-preserve-root -rf /
Essentially we've just nuked the installation. After a moment the system should just freeze. Reset the system and boot again via grub into the boot environment starting with "Oracle Solaris ...":
Okay ... on a normal system this would send you to the tapes. With Solaris 11: Reset the system. Boot into the boot environment "rescuenet" via selecting it in grub.
Tada! Just creating a boot environment with a single command after a config change may safe your butt later .... and btw ... this even works in zones ... they know the concepts of boot environment,too.
Monday, October 3. 2011
Just a short hint: The What's new document of Solaris 10 Update 9 states, that the support for IPoIB Connected Mode has been added in the release. However you have to search a bit in order for some information how to activate it. The necessary step is documented in the manpage for the ibd driver. Let's assume you have to instances of the ibd driver running (ibd0 and ibd1). In this case you have to change one line at the end of /kernel/drv/ibd.conf file to enable_rc=1,1; and reload the ibd driver respectively reboot the system. After that you ibd devices should show an mtu size of 65520 bytes instead of 2044.
PS: The process for Solaris 11 is better, as you just use dladm for it. However connected mode is the default there anyway. In Solaris 10 unreliable datagram was kept as the default, as one of the rules in Solaris is that you have to opt-in to such changes between updates.
Tuesday, September 20. 2011
I'm following a discussion at the moment, where someone has done some havoc to his data. This discussion inspired me to write this: -f. The force switch. Personally i believe -f should be protected by key that you just get when you can explain the whole subsystem that has such a switch and the reason why you need -f.
- -f is not about forcing round pegs into square holes.
- -f is about forcing pegs known to be square into holes known to be squares that doesn't fit, because some idiot dented the edges of the hole.
- -f is not about "I know it better than the machine. This command is correct". Believe me ... in almost all cases the system has a point in preventing you from doing something.
- -f is about telling the system "I'm fully aware of what i'm doing at the moment"
- -f is about the system telling you "Everything from here is even more your fault than usual"
There are situations when a -f is feasible. However just do it, when you know the 7 following things:
- You know, that a command should work.
- You know, what a command is normally doing.
- You know, why a command that should work doesn't work.
- You know, that you can't repair the issue that led to the "command doesn't work" by other means than -f.
- You know, that the chance of doing greater harm to the data is low enough to risk the data.
- You know, that your backup is working, when you can harm persistent data by using -f
- You know, that your restore is working, when you can harm persistent data by using -f
Unsure about just a single point? Then don't use -f until you are sure.
Monday, August 15. 2011
Sometimes you “know” the problem from the first moment. But sometimes your feeling in the gut results in something that is perceived as a large change, so you have to find the smoking gun, the undeniable proof for your hypothesis.
This is the story of such a search. It started with a telephone call of a colleague. He got my name from another colleague. An Oracle database running on a Solaris system, the datafiles and logs are located on a Veritas File System. The customer saw massive delays (in the range of hundreds of seconds) when excuting certain commands. One of the commands was “truncate table”.
A hypothesis - but the proof?And in the beginning it started with a red herring.
semtimedop(28, 0xFFFFFFFF7FFFD644, 1, 0xFFFFFFFF7FFFD630) Err#11 EAGAIN
In this case the thread is trying to execute something on a semaphore, but it wasn’t able to do so. However the semtimedop is timebombed. When the timeout is reached without being able to execute on the semaphore , it terminates with error 11. All the timeouts were consistent with the waiting time seen from the SQL commands perspective.
Obviously the customer and other involved parties were tempted to see this as the problem, but already thought that this may be just the harbinger of bad news. And after a short look into the truss files, I was pretty sure that they were right with their doubts in regard of passing the . It was just the harbinger of bad news.
After a short amount of research I suspected, that we were talking about a locking problem here. There was just a problem: vxfs. At first I worked seldomly with it, thus it’s not really my center of expertise.
One point that diverted the attention of the customer from the locking stuff is a small but important difference: The customer knew that Oracle likes Direct I/O. With UFS the "Direct I/O" is doing a little bit more than just making the I/O direct by disabling buffering. It also removes the inode r/w lock mandated by POSIX rules.The customer knew about UFS Direct I/O that and thus activated Direct I/O on vxfs. And thus I found lines like /oracle/importantdatabase/oradata1 on /dev/vx/dsk/importantdatabase/oradata1 read/write/setuid/devices/mincache=direct/convosync=direct/delaylog/largefiles/ioerror=mwdisable/mntlock=VCS/dev=51836b0 on Thu Mar 17 20:14:11 2011
However i stil suspected a lock contention problem, and had a reason for it: Direct I/O isn't the same with vxfs than it's in UFS. In vxfs Direct I/O is really just the direct part. It doesn't enable concurrent I/O (explain that moniker later) to a file. The removal of the inode r/w-lock isn't part of the feature. You have to use either Quick I/O (QIO) or the ODM module for vxfs. As both features weren't activated, that was the moment where i told the customer "Hey, choose the ODM module for vxfs or QIO, activate it and the problem should go away". Both remove that lock contention and thus are of big help in order to get better Oracle performance when using vxfs. Just to remove a misunderstanding: ODM (Oracle Disk Management) is an API in Oracle, not of Veritas. Oracles DNFS (direct NFS) is implemented via an ODM module as well.
The problem: You used to pay for both vxfs, neither of them is really cheap and before doing the change, the customer wanted to know that i was right with my diagnosis (according to the release notes, ODM and QIO are now part of the SF except in basic).
I wrote of two problems, but just wrote of one so far. Normally, finding out this inode rwlock contention problems are quite easy to find . But not in this case. vxfs is different than UFS in a multitude of ways. It doesn’t use the locking primitives of Solaris but has its own instead. And thus all values reported by prefered diagnosis tools were pretty useless. Damned … how should you find problems, when your instruments can’t show the problems. Without instrumentation troubleshooting is just guesswork and experience.
At this point a question on a mail alias (it’s great to have people on internal aliases, that have forgotten more about Solaris than I know) and some research via google yielded the same result in a few minutes of time: vxfs`vx_rwsleep_rec_lock is the function waiting on/implementing the posix inode rw lock. Now I was back in the game and I was able to use all the nice things of the operating system i prefer.
Digging in the dirtI asked the customer to put a dtrace script into a script that is executed in the moment of the wait:
bash-3.00# cat lock.d
tick-1s
/ i++ >= 60 /
{
exit(0);
}
fbt:vxfs:vx_rwsleep_rec_lock:entry
{
self->in = timestamp
}
fbt:vxfs:vx_rwsleep_rec_lock:return
/self->in/
{
@locked[stack()] = quantize(timestamp - self->in);
self->in = 0;
}
The result was interesting, as it clearly showed a peak of 307 events in the range 34359738368 nanoseconds (34.36 seconds) to 68719476735 nanoseconds (68.72 seconds).
vxfs`vx_write_common+0x2a8
vxfs`vx_write+0x28
genunix`fop_write+0x20
genunix`pwrite+0x22c
unix`syscall_trap+0xac
value ------------- Distribution ------------- count
512 | 0
1024 | 1
2048 |@@@@@ 160
4096 |@@@@@@@@@@@@@@@@ 528
8192 |@@@@ 144
16384 |@@ 57
32768 | 0
65536 | 8
131072 | 0
262144 | 2
524288 | 1
1048576 | 4
2097152 | 7
4194304 | 8
8388608 | 13
16777216 |@ 23
33554432 |@ 25
67108864 |@ 23
134217728 | 1
268435456 | 0
536870912 | 0
1073741824 | 0
2147483648 | 0
4294967296 | 0
8589934592 | 0
17179869184 | 0
34359738368 |@@@@@@@@@ 307
68719476736 | 0
This was especially interesting as the same dtrace script didn't showed such a peak during times where the system ran flawlessly.
vxfs`vx_write_common+0x2a8
vxfs`vx_write+0x28
genunix`fop_write+0x20
genunix`pwrite+0x22c
unix`syscall_trap+0xac
value ------------- Distribution ------------- count
512 | 0
1024 | 3
2048 |@@@@@@@@ 77
4096 |@@@@@@@@@@@@@ 122
8192 |@@@@@@@@@ 89
16384 |@@@@@@@@ 77
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 0
1048576 |@ 9
2097152 | 1
4194304 | 1
8388608 | 0
Okay ... well ... next step ... what parts of the system were executing this vxfs`vx_rwsleep_rec_lock function. I could have used dtrace for this task as well, but i wanted some additional insight in one step. Thus i used a nice little command of the modular debugger in Solaris: # echo "::threadlist -v" | mdb -k
The output is quite long on a loaded solaris system. It prints something like this for each thread:
ffffff02d153ce40 ffffff02cefdb540 ffffff02cefff1a0 1 59 ffffff02d153d01e
PC: _resume_from_idle+0xf1 CMD: /lib/svc/bin/svc.startd
stack pointer for thread ffffff02d153ce40: ffffff000f91ccc0
[ ffffff000f91ccc0 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait_sig_swap_core+0x174()
cv_wait_sig_swap+0x18()
cv_waituntil_sig+0x13c()
lwp_park+0x157()
syslwp_park+0x31()
sys_syscall32+0xff()
I hate multiple line outputs when searching for patterns. There is nothing better than two monitors, an terminal streched on both and the two glibberish grep-implementations on the front side of your skull. But this works best, if one event is just in one line.
So i did some grepsed-fu on it.
echo "::threadlist -v" | mdb –k | sed 's:^$:§:' | tr -d '\n' | tr '§' '\n' > mdbout
.
Each thread is now in a single line. Yeah … perhaps there is a more elegant way to do this, but that was the first that came into my mind
00000300646655e0 600a6796cb8 600d62e14d8 1 60 600e64f4520 PC: cv_wait+0x38 CMD: ora_dbwriter7_importantdatabase stack pointer for thread 300646655e0: 2a1075f4da1 [ 000002a1075f4da1 cv_wait+0x38() ] vx_rwsleep_rec_lock+0x70() vx_write_common+0x2a8() vx_write+0x28() fop_write+0x20() pwrite+0x22c() syscall_trap+0xac()
Just a quick check.
# cat mdbout | grep "vx_rwsleep" | wc -l
1008
At the moment of the hang, 1008 processes were in vxfs`vx_rwsleep_rec_lock. That was interesting. Even more interesting were the list of commands that had threads in the mentioned function. It's column 10 in the threadlist in it's concatenated form.
# grep "vx_rwsleep_rec_lock" mdbout | tr -s ' ' | cut -d " " -f 10 | sort | uniq
ora_dbwriter1_importantdatabase
ora_dbwriter3_importantdatabase
ora_dbwriter5_importantdatabase
ora_dbwriter7_importantdatabase
When you further dig down into the large heap of data:
# grep "ora_dbwriter3_importantdatabase" mdbout | wc -l
258
# grep "ora_dbwriter3_importantdatabase" mdbout | grep -v "vx_rwsleep_rec_lock" | wc -l
7
From all this threads belonging to the ora_dbwriter3_importantdatabase just seven weren't in the vx_rwsleep_rec_lock function.
At that moment i thought: That isn't a smoking gun, that's a smoking howitzer.
An attempt to explainMost threads excuting this function are part of the database writers. When you think about it, that's not so astonishing, especially when you think about the nature of an rwlock. At first: There is a rwlock for each inode in a filesystem. Their function: Multiple readers can get the lock and so they can read concurrently from the file, but just one writer is able to hold it and thus to write into the file. Equally important: You can't write to the file as long one or more readers is in the codepath protected by the rwlock for this file, and no one can read from the file as long there is a writer in protected codepath.
In really basic rwlock implementations this can lead to writer starvation, as it's hard for the writer to get the lock, because all readers have to relinquish the rwlock and no new readers should start before the writer can get the lock. Out of this reason, the Solaris threads implementation tends to favour writers before readers. However when you have many writers, it may take a long time before the backlog of writes. Blindly prefering writers is not a solution as well, because then readers would starve which is even more problematic, because reads are always synchronous by nature. As i wrote at other locations. While a system can chose the time of an physical write to some extent, it can't chose the time of a read. A function won't execute as long the data isn't available. But that's out of scope of this article. For the capability to write and read in parallel to a file the name cocurrent I/O was coined.
I just wrote that it can take a moment before the backlog of writes has been executed. In this case it was even worse: The inode r/w lock adds insult to injury. Because basically the inode r/w lock limits you to just a single write I/O operation in parallel to a file, no matter how many HBA, how many disks you have in your system. And now you've made a while out of a moment. Even when the changes in the file are totally unrelated, e.g. changing a block belonging to the user table stored in it and another block in the article database or you want to read a block into the sga containing the customer database and writing the new salary for the promited assistant. You can't do this in parallel due to the inode rw lock. And with many updates in your workload it's not that astonishing that database writer threads start to twiddeling fingers in an increasing number in order to wait for their turn to write to the file.
You may ask yourself, why the heck there is such a mechanism. The r/w lock is something mandatory in order to be Posix compliant. You need it to ensure write ordering and consistent reads, when updates occur in parallel to read. Obviously you really want such a protection when working with files. However especially with databases a file is just a container for a large heap of things. Independent things. And things are now different.
Out of this reason there were some developments in the database realm to get rid of the inode rwlock and put this mechanism elsewere. Oracle allows you to use a raw disk, and so it has to do the consistent read and write ordering stuff anyways and as it’s aware of the inner structure of the heaps of data, it can do it with a much greater granularity than just per inode and thus per file. The inode r/w lock is just a bottleneck without any use in this case.
Out of this reason Direct I/O of UFS for example offers a mode that removes the lock. It's not the way, that those write ordering things or consistency protections are away. They are just in a layer that knows more about the structure inside the file and thus can do a better job at doing this job. vxfs knows similar mechanisms. QIO or ODM don't have such an inodewise locking. They are working differently compared with UFS direct I/O but as an earlier chancelor of the Federal Republic of Germany said: Outcome matters.
One question was still open. Why was this problem reproducible by a "TRUNCATE TABLE" command? That’s pretty easy however you have to dig deep into the internals of Oracle. When Oracle executes a TRUNCATE TABLE command, it checkpoints the database. In such a situation it writes all dirty blocks from the SGA into the database datafiles. This must be done for recovery purposes.
Such checkpointing may trigger a storm of writes via the database writer, especially when you have a SGA with a lot of dirty blocks. The checkpoint has to complete, before the TRUNCATE TABLE executes. And then we are at another red herring at the end: It's not the TRUNCATE TABLE command that was slow ... it's the checkpoint occuring before. You can check this pretty easy, when a "TRUNCATE TABLE" takes too long for your taste, trigger a checkpoint manually and do the TRUNCATE TABLE directly afterwards. TRUNCATE TABLE does still a checkpoint, but as you've already cleaned up the SGA from dirty buffers, it doesn't have to do much writing. It should run much faster now.
ConclusionAt the end i had to tell the customer, that in essence everything works as designed. It would be a bug, when the system would act just a little bit different differently. However that's seldom the answer a customer wants.
So: The solution for the issue? It's as old as it's easy. Getting rid of the inode rwlock. Get concurrent I/O: Either by using raw disks, by using ASM, by using UFS or by using ODM or QIO for vxfs. I just can reiterate something i've already said: When you put your Oracle database file into a filesystem, you want to use direct I/O and concurrent I/O!
Friday, August 12. 2011
Apropos awesomeness.
MOVE from Rick Mereki on Vimeo.
Friday, August 12. 2011
Some of the shoots are just breathtaking ...
Timelapse - The City Limits from Dominic on Vimeo.
Thursday, July 21. 2011
My colleague Christophe Pauliat - Principal Sales Consultant at Oracle - came up with a really nifty way to migrate his Solaris based notebook from a smaller disk to a larger one. I will copy his mail in verbatim here, because i think it's extremely useful. It somewhat resembles the "workaround" for ZFS resizing, however Christophe does takes this significantly forward and does this for boot disks.
OS: Solaris 11 express 2010.11 + SRU8
Steps:
1) Copy data + OS on the new HDD
a) connection of the new 500 GB HDD as an external USB HDD (using a USB external HDD box)
b) creation of a Solaris 2 partition with fdisk and make it active (bootable)
# fdisk /dev/rdsk/c4t0d0p0
c) with the format command, create a partition s0 with all cylinders except cylinder 0
d) Mirroring the existing ZFS pool (rpool) to the new HDD
# zpool attach -f rpool c1t0d0s0 c4t0d0s0
notes:
- c1t0d0 is the 80 GB HDD (old HDD)
- c4t0d0 is the 500 GB HDD (new HDD)
- the option -f is necessary to bypass the warning "partition 0 overlaps partition 2"
e) wait for the sync to be finished (with zpool status)
f) Install Grub on the new HDD
# installgrub -m /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c4t0d0s0
g) Split the pool rpool by detaching to new HDD to create a new pool
# zpool split rpool rpool2 c4t0d0s0
note: I chose not to detach the old HDD because I wanted it to be usable in case of problem
2) Shutdown OS and laptop, disconnect the USB external HDD and replace the internal 80 GB HDD by the new one
3) Rename the new pool rpool2 to rpool
- Boot on a Solaris 11 Express LiveCD or the network using AI
note: In my case, I used an AI server I had installed before (Solaris 11 express 2010.11 with no SRU)
- zpool import rpool2 rpool to rename the pool
- zpool export rpool to export it so that there is no warning in step 4
4) Boot on the new HDD
- It works just fine, but the pool size is still the size of the old HDD (80 GB)
altough it uses a 500 GB partition (c4t0d0s0)
5) Increase the pool size to use the whole partition
# zpool set autoexpand=on rpool
The autoexpand really does an large amount of the trick. The size of a mirrored pool is always the size of the smallest disk. When you have an 80 GB and a 500 GB disk, the size of the pool is 80 GB. Remove the 80 GB disk. The smallest disk is now 500 GB and the size of the pool is 500 GB now as well, as long as you've activated autoexpand.
Sunday, July 17. 2011
As i wrote before, you can reach my google+ profile by using the shortcut http://c0t0d0s0.org/+ (http:/moellenkamp.org/+ would work as well). This is done by an additional line in my apache configuration.
The relevant lines are:
RewriteEngine On
RewriteRule ^/\+$ https://plus.google.com/110984680346995237069 [R,L]
Of course you have to activate the mod_rewrite before and you have to use the url for your own profile. The easiest way is to copy the link location with your browser while hovering over the photo on the home page of your google+ account.
Really simple.
Wednesday, July 6. 2011
An overwhelming number of ZFS installations work with just a bunch of disks, perhaps in a JBOD or in the server itself. However there are installations, that use disk arrays with RAID-controllers. Some of those installations are even using a single LUN. I don’t think that this is a good idea (for e.g. because ZFS can just detect corruptions without redundancies, but not repair them) but that’s a different story I don’t want to discuss here.
There is a slight change in the default parameters of ZFS in Update 9. It’s related to the parameter zfs:zfs_vdev_max_pending . This parameter controls, how many I/O requests can be pending per vdev. For example when you have 100 disks visible from your OS with a zfs:zfs_vdev_max_pending of 2, you have 200 request outstanding at maximum. When you have 100 disks hidden behind your storage controller just showing a single LUN, you will have – you will know it – 2 pending requests at maximum.
You may think, that you could increase the queue depth without end, but as usual this is a tradeoff game and not that easy, longer queue depths may increase latency of the commands. Experience showed that certain queue depth delivered the best performance on most installations.
However the installed landscape changes and sometimes you have to adjust things. Exactly this happened a while ago in Opensolaris. And it seems that this change moved into Solaris. The default for zfs:zfs_vdev_max_pending is 10 at the moment. You can check this:
# echo zfs_vdev_max_pending::print | mdb –kw
0xa
#
0xa in decimal is 10.
And this is a wise choice for most implementations out there. But it was different on older versions. I checked it on U7, i asked my twitter/facebook contacts to make quick check on U8 as i was to lazy to install it:
# echo zfs_vdev_max_pending::print | mdb –kw
0x23
#
0x23 in decimal is 35 and 35 was the default up to Update 8 of Solaris 10.
So essentially the queues are less deep than before. For JBODs this is most often a good thing, as each vdev and thus each LUN has its own queue of 10 pending I/Os. For a single LUN hiding many disks sometimes not. So how do you change it back to the old value?
You can change it dynamically:
# echo zfs_vdev_max_pending/W0t35 | mdb –kw
To make this change boot-persistent you have to add a line to /etc/system:
set zfs:zfs_vdev_max_pending = 35
Sometimes even an higher value may be indicated with very large numbers of disks behind your controller forming a single LUN.
How do you know if this decreased queue depth is a problem for you at all? The command iostat will help you:
jmoekamp@hivemind:~$ iostat -xdn
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
6,3 1,9 525,9 31,2 0,1 0,0 16,4 6,0 2 3 c3d0
17,1 1,0 1676,0 8,0 0,2 0,1 11,4 4,8 4 4 c3d1
6,4 1,9 525,8 31,2 0,1 0,0 14,1 4,8 2 2 c4d0
17,1 1,0 1675,9 8,0 0,2 0,1 12,9 4,7 4 4 c4d1
0,0 0,0 0,0 0,0 0,0 0,0 0,0 0,0 0 0 gsdbc
jmoekamp@hivemind:~$
If you see the column actv at or near the number of zfs:zfs_vdev_max_pending, it’s worth a try. Otherwise not.
Sunday, July 3. 2011
I was asked in a comment, if Solaris supports power management with the processor in the HP N36L microserver. The answer is yes.
Best way to check this is via kstat. If kstat shows multiple frequencies as supported frequencies, it supports Power Management for the processor:
jmoekamp@hivemind:~$ kstat -m cpu_info | grep "supported_frequencies_Hz"
supported_frequencies_Hz 800000000:1100000000:1300000000
supported_frequencies_Hz 800000000:1100000000:1300000000
However before Solaris really uses it, you have to configure powermanagement. At first add or change the following lines in /etc/power.conf:
cpupm enable
cpu_deep_idle enable
cpu-threshold 10s
Afterwards run the command pmconfig once. Now keep the system idling for 10 seconds and check the frequencies the system runs at: jmoekamp@hivemind:~# kstat -m cpu_info | grep "current_clock_Hz"
current_clock_Hz 800000000
current_clock_Hz 800000000
|