Resolving slow ssh login performance problems on OpenIndiana

For some time my production systems have been plagued with intermittent slow login performance over ssh. In day to day operations, it was rarely an issue, except for a couple of zones where it was always persistent. It really seemed to bite us the most when we tried to deploy mass configuration changes.

Naturally, I looked at the usual suspects.

  • DNS lookups [LookupClientHostnames, VerifyReverseMapping]
  • SSH login DoS protection [MaxStartups]
  • Kerberos (mis)configuration [GSSAPIAuthentication]
  • DNS Server Timeouts
  • Network drops casing issues for any of these services
  • Unintentional interactions in PAM

None of those panned out. I spent hours trying to prove the problem was one of those features and it was completely futile. So, I resorted to denial. Maybe the problem was just in my head and no one else would notice. Denial never works. Just ask the tax man.

The other night some cronjobs that ran on one server but executed work on a bunch of other servers failed to restart/pstart some processes. For whatever reason, my interest in solving the problem was renewed.

Here are the symptoms of my problem:

root@heimdall:~# ssh -v web003
Sun_SSH_1.5, SSH protocols 1.5/2.0, OpenSSL 0x009080ff
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Rhosts Authentication disabled, originating port will not be trusted.
debug1: ssh_connect: needpriv 0
debug1: Connecting to web003 [W.X.Y.Z] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/identity type -1
debug1: identity file /root/.ssh/id_rsa type 1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.5
debug1: match: Sun_SSH_1.5 pat Sun_SSH_1.5*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-Sun_SSH_1.5
debug1: use_engine is 'yes'
debug1: pkcs11 engine initialized, now setting it as default for RSA, DSA, and symmetric ciphers
debug1: pkcs11 engine initialization complete
debug1: Failed to acquire GSS-API credentials for any mechanisms (No credentials were supplied, or the credentials were unavailable or inaccessible

)
debug1: SSH2_MSG_KEXINIT sent

At this point, the system hangs for seconds to minutes. In order to try figure out what was going on the server side i used truss.

Since web003 is a regular zone, I chose to work from the global zone. The first thing I did was find the PID of the parent sshd process on the web003 zone.


root@www002:~# ps -efZ |grep ssh |grep web003
web003.a root 13880 3502 0 11:30:12 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13967 13965 0 11:30:55 ? 0:00 /usr/lib/ssh/sshd
web003.a root 3502 1 0 18:28:06 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13881 13880 0 11:30:12 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13965 3502 0 11:30:55 ? 0:00 /usr/lib/ssh/sshd

There it is. PID 3502 owned by init. Now, we’ll use truss to see what is going on.


root@www002# truss -pedf 3502

[some truss output removed for interest of brevity]

14426: 30.1188 mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFE820000
14426: 30.1188 memcntl(0xFE830000, 71660, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
14426: 30.1198 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 5
14426: 30.1199 brk(0x080EA228) = 0
14426: 30.1200 brk(0x080EC228) = 0
3502: pollsys(0x08047470, 2, 0x00000000, 0x00000000) (sleeping...)
14426: connect(5, 0x08047180, 16, SOV_DEFAULT) (sleeping...)
14424: pollsys(0x080453B0, 1, 0x00000000, 0x00000000) (sleeping...)

Now we see that the process hangs at this connect call. What in the world is this connect() call doing? Why is ssh connecting to something? Is this a DNS lookup over TCP? Could it be something else? I had to investigate this, so I changed the options to truss to get the arguments passed to connect().


root@www002# truss -pdef -tall -v connect 3502
[truss data removed]
14970: 21.6030 brk(0x080EC228) = 0
3502: pollsys(0x08047470, 2, 0x00000000, 0x00000000) (sleeping...)
14968: pollsys(0x080453B0, 1, 0x00000000, 0x00000000) (sleeping...)
14970: connect(5, 0x08047180, 16, SOV_DEFAULT) (sleeping...)
14970: AF_INET name = 127.0.0.1 port = 30003

Okay, now we see that ssh is trying to connect to port 30003 on localhost. That is useful information, but why is it doing that? A quick check of /etc/services revealed nothing — which seems like an omission. Googling port 30003 eventually pointed me to a software package called Trousers which is “Trusted Platform Module” or TPM. What in the world is it anyway?

My colleague had a great description of it:

Looks like some sort of “Trusted Computing” TPM bigbrotherware that was invented to appease Hollywood and Microsoft.

Welp, Hollywood and Microsoft don’t pay my bills so how do I disable it? I found via Google that in OpenIndiana the Trousers package is branded as the service tcsd


root@www002:~# svcs tcsd
STATE STIME FMRI
disabled 18:18:09 svc:/application/security/tcsd:default

The service is disabled by default. Hrm. Is this a dead end? Why is ssh connecting to this again? Perhaps it is part of the cryptographic framework. Those crypto guys are genius and a little sneaky. What can be learned about the default configuration?


root@web003:~# cryptoadm list

User-level providers:
Provider: /usr/lib/security/$ISA/pkcs11_kernel.so
Provider: /usr/lib/security/$ISA/pkcs11_softtoken.so
Provider: /usr/lib/security/$ISA/pkcs11_tpm.so

Kernel software providers:
des
aes
arcfour
blowfish
ecc
sha1
sha2
md4
md5
rsa
swrand

Kernel hardware providers:

Holy Cow Batman, there it is! TPM, right there installed in the kernel as a module. Son of a biscuit. So, I tried to disable the module with:


cryptoadm disable provider=/usr/lib/security/\$ISA/pkcs11_tpm.so

but that didnt work. So I rebooted the zone. That made no difference. Here is what I did to make the difference:


cryptoadm uninstall provider=/usr/lib/security/\$ISA/pkcs11_tpm.so

Don’t freak out. This doesnt uninstall anything from your system. It simply does what “cryptoadm disable” doesnt seem to do. It disables the feature by removing it from the crypto framework.

So here are the results.

Initially, web003 performed like this:

root@heimdall:~# time ssh web003 hostname
web003

real 3m3.169s
user 0m0.034s
sys 0m0.009s

Can you believe that? Three minutes to login into a system? Bleh.

And after removing Trousers from the crypto framework, it looks like this:

root@heimdall:~# time ssh web003 hostname
web003

real 0m0.373s
user 0m0.037s
sys 0m0.012s

Yes, that is 373 miliseconds compared to 3 minutes, 3 seconds, and 169 miliseconds.

I think we can put a nail in the coffin of this problem. It must be a bug that the cryptographic framework invokes a service that is disabled by default. I guess mistakes happen. Managing a large project like IllumOS or OpenIndiana is definitely a challenge.

Posted in Uncategorized | Leave a comment

Intel 910 SSD ZFS benchmark results on OpenIndiana 151a1 for 8k IOPs using filebench and pgbench

These are the benchmark results of my testing the Intel 910 with pgbench and filebench.

The test parameters are outlined in each run.
For file filebench, the size of the file was set to 20GB, with primary cache disable and not L2ARC configured.. The system configuration is as follows:

  • OpenIndiana 151a1
  • Intel SR2625URLXR with 96GB of RAM
  • 2 Intel L5630 low power CPUs @ 2.133Ghz
  • Intel 910 800GB SSD
  • 64 GB Crucial M4 SSD as boot device
  • LSI SAS3801E-R as boot controller (MPT2 will not work as boot controllers with the Intel 910 without at least rev P15 which was released after this benchmark was produced)

8k-randomread test, 20GB file, primary and secondary caches set to none, compressions set to none, ashift=9, running for 120 seconds

 

Threads Total ops ops/s ops type r/w throughput CPU/op latency
1 111951 924 924/0 7.2MB/s 1530us 1.1ms
2 286665 2365 2365/0 18.5MB/s 741us 0.8ms
4 634501 5236 5236/0 40.9MB/s 458us 0.8ms
8 1260642 10404 10404/0 81.3MB/s 359us 0.8ms
16 2330167 19233 19233/0 150.3MB/s 315us 0.8ms
32 3761698 31061 31061/0 242.7MB/s 313us 1.0ms
64 5186173 42923 42923/0 335.3MB/s 304us 1.5ms
128 5752484 47771 47771/0 373.2MB/s 289us 2.7ms
256 6021613 49956 49956/0 390.0MB/s 287us 5.1ms
512 6095361 50600 50600/0 395.3MB/s 284us 10ms

8k-randomread test, 20GB file, primary and secondary caches set to none, compressions set to LZJB, ashift=9, running for 120 seconds

Threads Total ops ops/s ops type r/w throughput CPU/op latency
1 147908 1220 1220/0 9.5MB/s 1197us 0.8ms
2 407808 3366 3366/0 26.3MB/s 549us 0.6ms
4 986512 8143 8143/0 63.6MB/s 329us 0.5ms
8 2006674 16560 16560/0 129.4MB/s 260us 0.5ms
16 3777101 31191 31191/0 243.7MB/s 238us 0.5ms
32 6193553 51197 51197/0 400MB/s 247us 0.6ms
64 7876073 65415 65415/0 511.0MB/s 230us 0.9ms
128 8062341 67005 67005/0 523.5MB/s 221us 1.4ms
256 7159659 59432 59432/0 464.3MB/s 230us 1.2ms
512 6932314 57481 57481/0 449.0MB/s 234us 1.0ms

8k random read, compression is none, primary and secondary cache set to none, ashift=12

Threads Total ops ops/s ops type r/w throughput CPU/op latency
1 97325 803.1 803/0 6.3MB/s 1641us 1.2ms
2 209902 1732 1732/0 13.5MB/s 896us 1.1ms
4 395048 3260 3260/0 25.5MB/s 595us 1.2ms
8 686443 5664 5664/0 44.2MB/s 440us 1.4ms
16 1044447 8620 8620/0 67.3MB/s 350us 1.8ms
32 1332033 10993 10993/0 85.9MB/s 306us 2.9ms
64 1457660 12034 12034/0 94.0MB/s 289us 5.3ms
128 1505032 12421 12421/0 97.0MB/s 293us 10.3ms
256 1527138 12609 12609/0 98.5MB/s 287us 20.2ms
512 1546034 12772 12772/0 99.7MB/s 282us 39.9ms

8k oltp no compression, 20gb file, ashift=9

shadow Threads Total ops ops/s ops type r/w throughput CPU/op latency
100 3255652 26912 13391/13384 210.1MB/s 358us 3.9
200 3332651 27606 13737/13728 215.5MB/s 361us 6.5ms
400 3306165 27390 13638/13611 213.7MB/s 356us 7.5ms

8k oltp no compression, 20gb file, ashift=9

8k oltp with lzjb, ashift=9

shadow Threads Total ops ops/s ops type r/w throughput CPU/op latency
100 3255652 26912 13391/13384 210.1MB/s 358us 3.9
200 3332651 27606 13737/13728 215.5MB/s 361us 6.5ms
400 3306165 27390 13638/13611 213.7MB/s 356us 7.5ms
shadow Threads Total ops ops/s ops type r/w throughput CPU/op latency
200 4087097 33796 16817/16806 263.8MB/s 253us 2.2ms
400 4096267 33884 16896/16842 264.4MB/s 244us 1.5ms
100 4111776 34015 16919/16922 265.6MB/s 241us 1.5ms

primary and secondary cache set to none, compression set to off

scaling factor number of rows # of clients threads transactions processed transactions /client transactions/ second with connection time transactions/ second without connection time
100 10M 96 32 9600/9600 100 5289.97 5599.76
200 20M 96 32 9600/9600 100 4993.52 5253.86
500 50M 96 32 9600/9600 100 4407.32 4627.92
1000 100M 96 32 9600/9600 100 4438.84 4659.63
2000 200M 96 32 9600/9600 100 4173.09 4365.93
5000 500M 96 32 9600/9600 100 4013.53 4194.34
10000 1B 96 32 9600/9600 100 3822.07 3979.56

primary secondary cache set none, compression=lzjb

scaling factor number of rows # of clients threads transactions processed transactions /client transactions/ second with connection time transactions/ second without connection time
100 10M 96 32 9600/9600 100 4646.00 4898.47
200 20M 96 32 9600/9600 100 4685.83 4936.45
500 50M 96 32 9600/9600 100 4136.88 4318.00
1000 100M 96 32 9600/9600 100 4243.42 4428.62
2000 200M 96 32 9600/9600 100 3842.88 4009.80
5000 500M 96 32 9600/9600 100 3684.58 3839.77
10000 1B 96 32 9600/9600 100 3665.28 3819.95

zfs primary cache enabled, secondary cache set none, compression=lzjb

scaling factor (-s) number of rows # of clients (-c) threads (-j) transactions processed

transactions / client (-t)

transactions/ second with connection time transactions/ second without connection time
100 10M 96 32 960000/960000 10000 5950.14 5953.34
200 20M 96 32 960000/960000 10000 5755.83 5758.70
500 50M 96 32 960000/960000 10000 5513.13 5516.04
1000 100M 96 32 960000/960000 10000 5145.42 5147.79
2000 200M 96 32 960000/960000 10000 5069.97 5062.48
5000 500M 96 32 960000/960000 10000 4606.50 4608.49
10000 1B 96 32 960000/960000 10000 4470.64 4472.39
Posted in Benchmarks, filebench, Intel 910 SSD, Intel Server, Performance, SAN, SSD cache, ZFS | Tagged , , | Leave a comment

Crucial M4 SSD SMART failure detection

Crucial/Micron is kind enough to trigger a SMART failure in their M4 SSD drives when they reach about 88% of their wear level. This metric is tracked in ID #202. The precise measurement they appear to be using is a value of 2700 in ID #173 ‘Wear_Levelling_Count.’ This is a nice feature to have as it allows you to reliably detect when your drives are about to fail.

Today I called Crucial support to open an RMA on four of my failing M4 drives. They put two people on the phone who kept telling me there is no way I could wear out a disk in just one year. They were quite convinced that I have a garbage collection problem. It is quite sad that the technicians are so poorly trained on their products.

Below is an example of how to get the relevent metrics out of your system.


# for i in `zpool status |grep 0751 | awk '{ print $1 }'`; do echo $i; /usr/local/sbin/smartctl -a -d sat,0 /dev/rdsk/$i |egrep -i "wear_level|perc_rate|power_on_hours"; done
9 Power_On_Hours 0x0032 100 100 001 Old_age Always - 11028
173 Wear_Levelling_Count 0x0033 010 010 010 Pre-fail Always FAILING_NOW 2704
202 Perc_Rated_Life_Used 0x0018 012 012 001 Old_age Offline - 88

...

The M4s are built on 24nm NAND flash which means they only have a 3000 count write/erase count. As you can see with wear_levelling_count the device is now within 10% of its predicted end of life.

I was completely unable to convince them that the wear level have nothing to do with garbage collection. Given this, and the fact that I value my time I told them I would do as they asked and power the drive up for a day without the SATA cable attached. We all know what is going to happen, exactly butkiss. I am fully shocked that Crucial doesn’t recognize their own failure thresholds as grounds for an RMA — infact, these folks really had no idea what I was talking about.

I will keep you posted on how this goes.

*update 1/17/2013*

I called back, got a different service rep and she happily took back all four disks. That is how it should have gone the first time.

j.

Posted in Uncategorized | 4 Comments

Using vlan tagging with OpenIndiana and Juniper switches

Setting up VLANs with OpenIndiana is shockingly simple. I will show you how to do it.

My environment already uses link aggregated groups (LAGs). I use Juniper EX4200 switches set up in virtual switch(es). The two gigE links from the servers go to two different EX4200 chassis. In this way, the server can survive a chassis failure.

This simple stanza on the switch (aggregrated ethernet) interface sets it all up on the switch side. You will notice that the switch is geared to support jumbo frames and in this case we allow membership to vlans 300 and 400 over the trunk.


root@brdr0.sf0# show ae4
description "www004 aggr0 ge-0/0/2 ge-1/0/0/2";
mtu 9014;
aggregated-ether-options {
lacp {
active;
}
}
unit 0 {
family ethernet-switching {
port-mode trunk;
vlan {
members [ 300 400 ];
}
}
}

So that takes care of the switch side. Now, there is a minor modification to the VNICs on the server side.

My old vnic creation line looked like this:

dladm create-vnic -l aggr0 web7

but the new vnic creation line has to reflect which VLAN the VNIC is associated with:

dladm create-vnic -l aggr0 -v 300 web7

You can validate your handy work using dladm

droot@www004:~# dladm show-vnic
LINK OVER SPEED MACADDRESS MACADDRTYPE VID
web7 aggr0 1000 2:8:20:9:9e:c random 300
www4 aggr0 1000 2:8:20:ed:bc:6f random 300
flkrb0 aggr0 1000 2:8:20:47:ab:f1 random 400

Now my zones can share the gift of different broadcast domains.

Posted in crossbow, IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, Juniper, LACP, Link Aggregation, OpenIndiana, Uncategorized, Zones | Tagged , , , , , , | Leave a comment

Intel 910 SSD compatibility issues with other LSI mpt2 based products – Part I

I purchased an Intel 910 SSD to validate the device before deployment in my production environment. I was shocked to discover that it would not play nice with my LSI 2003, 2008, 2308 based cards. LSI is the reference point for enterprise storage. If your are gear doesn’t play nice with LSI, you simply aren’t in the enterprise space — but how could Intel develop a product that was based on the LSI 2008 chip and be fully incompatible with other LSI cards? It is a mystery to me. I have the following LSI products and received the same results: 9207-8i, 9211-8i, 9205-8e

It seems that the Intel product has been configured to prohibit being a boot device. This configuration option may some how bleed over to the LSI cards which disable themselves from being boot devices. This is all happening despite the fact that the LSI cards are configured to “boot from BIOS or OS.”

My reference platform is an Intel SR2625URLXR 2u server. I have tickets open with both Intel and LSI. The Intel SSD team denies any responsibility and has pushed the ticket off to the server team. The server team, at least report, was trying to get their hands on a 910 SSD to validate the issue. It seems the server team should have a stock pile of Intel server products at their disposal. Apparently, they don’t. What is up with that?

On the other hand, LSI was more accepting of the problem. The tech I worked with seemed concerned that during post the LSI card reported the card as being “disabled” even though was clearly set to boot from “BIOS or OS” within the configuration utility. The LSI tech said something like, and I am paraphrasing here, “I need to get someone with a budget to buy one of these cards.” He followed it up with, this may take a long time. This is not what you want to hear from the worlds leading storage vendor. The tech requested screen shots of both the card disable message and the configuration screen indicating that it was infact enabled.

I am now playing wait and see. In the mean time, either be prepared to validate the Intel 910 SSD with your LSI gear or if you cannot afford the risk (since the 800gb is going for $4,000) it might be wise to simply stay away until the issues have been worked out.

I will continue to post progress on this issue.

j.

Posted in Intel 910 SSD | Tagged , , , , , | 11 Comments

ZFS: Performance Tuning for Scrubs and Resilvers

My environment is write heavy. I have hundreds of millions of phones sending me events continously. They are like billions and billions of star like bubbles invading my spinning rust.

Consequently, when a disk fails or takes on excessive CRC errors a resilver job is kicked off. Similarly, I scrub the storage pools monthly and occasionally turn up a bad disk that is silently corrupting. The problem is the resilver or scrub job reports that it needs two or three months to complete, with the default settings. That doesn’t work for me. If you are reading this, you might have a similar problem.

I would prefer to throttle the inbound data and repair the system to optimal data redundancy. Around here, data integrity is everything. To complete the scrubs and resilvers quickly, I have had to tweak ZFS in an Evil way. First, we’ll talk about what to put into /etc/system, then I will show you how to make these changes take effect immediately.

The /etc/system changes are straight forward:

* Prioritize resilvering by setting the delay to zero
set zfs:zfs_resilver_delay = 0

* Prioritize scrubs by setting the delay to zero
set zfs:zfs_scrub_delay = 0

* set maximum number of inflight IOs to a reasonable value - this number will vary for your environment
set zfs:zfs_top_maxinflight = 128

* resilver for five seconds per TXG
set zfs:zfs_resilver_min_time_ms = 5000

The default value here is three seconds. By giving resilver five seconds per transaction group we can rebuild the mirror more quickly.

Now, to make these changes take effect immediately so you dont have to reboot, you can execute the following:

echo zfs_resilver_delay/w0 | mdb -kw
echo zfs_scrub_delay/w0 |mdb -kw
echo zfs_top_maxinflight/w7f |mdb -kw
echo zfs_resilver_min_time_ms/w1388 |mdb -kw

These variables and more can be found in dsl_scan.c

These modifications speed up my scrub/resilver operations from taking a approximately two months, to approximately two to three hours depending on the individual system.

Happy Scrubbing….

Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, Uncategorized, ZFS | Tagged , , | 2 Comments

Reconfiguring OI/Solaris 11 for full reboot instead of Fast Reboot

Buried deep in the repository is an entry named ‘boot-config.’ This control how your system reboots when you type ‘reboot.’ Sounds simple right? Good.

The svcs(1) reports the service as follows:

jason@heimdall:/home/jason/smf% svcs boot-config
STATE STIME FMRI
online Mar_23 svc:/system/boot-config:default

The rub for me was that my Intel SR2625 (S5520UR) based servers will not reboot properly with the default setting — of course it is set for “fast reboot.” Fast reboot basically allows Solaris to restart in place without resetting the motherboard and starting from scratch. This is both fast and efficient, if it works. The problem for me is, it just didn’t work. The systems would start to boot OI but then spit out some messages about 32bit address space and that’s where the joy stopped.

NOTICE: unsupported 32-bit IO address on pci-pci bridge

32-bit IO address not supported

The work around for me was to tell OpenIndiana how to reboot to avoid this problem. It simply involves flipping a boolean object to true. Below is the diff.


jason@heimdall:/home/jason/smf% gdiff boot-config.dist.smf boot-config.smf
43c43
< <propval name='fastreboot_default' type='boolean' value='true'/>
---
<

This is a one line change. Simply feed that back into the repo and your system will completely restart by resetting the bios and come all the way. If this makes sense to you, feel free to stop reading. The balance of the article goes into the mechanics of how to update the repo.

First, dump the boot-config entry to a file.

root@heimdall:/home/jason/smf# svccfg export boot-config > /tmp/boot-config.smf

Now edit the forementioned line. Then push the update back into the repo as follows:


% svccfg verify /tmp/boot-config.smf
% svccfg import /tmp/boot-config.smf

Voila! You are now set for fast reboot.

Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, Intel Server, OpenIndiana | Tagged , , , | 6 Comments

Los Hicimos! We Did It! We broke the bonds of Amazon (AWS, EC2)

Moving out of Amazon is no small feat.

After a couple of months preparation we finally moved out of Amazon Web Services ([AWS], aka. the Roach Motel) on December 12th, 2011. And not a moment too soon. In the end, after a trial run, we successfully moved out of AWS in less than three hours. This was no small feat considering we had to transfer and maintain synchronization of terabytes of data from a source on the other side of the country.

Our new installation is fantastic. It is excellent blend of carrier class technology and commodity hardware. This delicate balance gives us fabulous data handling capabilities while maintaining very low operating costs.

Why is platform so special?

  • It is built on OpenIndiana of course.
  • It leverages ZFS to the max.
  • DDRdrive X1s for logzilla — blazing fast synchronous write performance.
  • Aggregrated gigE links utilizing jumbo frames every where.
  • Each LACP member connects to a separate physical switch within a virtual chassis. If a physical switch chassis fails, the servers connected to that physical chassis continue to operate on the other chassis at gigE speeds.
  • Full tilt on DRAM. Every slot is used with the maximum size.
  • Gobs of 15K RPM SAS disks per data storage device leveraging multiple Sanima-SC/Newisys NDS-2241 storage chassis per server.
  • Since we are on ZFS, we can hot swap the disks to SSDs when suitable enterprise grade devices come available.
  • There is more SSD based cache (L2ARC)per server than the size of the existing data set, so there is plenty of room to grow in read ops.
  • Obviously remote out of band management, KVM, SMASH interface, with all the bells and whistles.
  • Fully redundant power, every where.
  • Should a server fail, the disks owned by that server can be imported on a partner system (thank you SAS!), and a zone booted to continue operating the services provided by the down partner. Genius!
Posted in Uncategorized | 6 Comments

How to PXE Boot Systems on LACP (802.3ad) using Juniper Switches

The real trick here is that Juniper supports an option called ‘force-up.’ Since the PXE images are generally too small and dumb most operating system are unable to leverage LACP during the boot process. Historically this means the switch has to be reconfigured for straight up ethernet switching, then configured back to LACP once the OS is installed. The worst bit is the time lost co-ordinating between the neteng teams and the sys-admin teams in a large organization (such as AOL). This is no longer an issue. I will illustrate how to avoid this problem.

To get started, we will first add some interfaces to a LAG. Assume the first interface on the server happens to be connected to ge-2/0/1, the second interface is on ge-0/0/1, and both live on vlan100 with 9000 byte jumbo frames. Given this, we use the following instructions for the Juniper:


configure
edit interfaces
set ge-0/0/1 ether-options 802.3ad ae2
set ge-2/0/1 ether-options 802.3ad ae2
set ge-2/0/1 ether-options 802.3ad lacp force-up

set ae2 mtu 9014
set ae2 aggregrated-ether-options lacp active
set ae2 aggregated-ether-options periodic fast
set ae2 unit 0 family ethernet-switching vlan members vlan100

As I mentioned before, the real trick here is ‘force-up.’ Force-up tells the switch to ignore the absence of LACP BPDUs and keep the link up. This is quite handy for PXE boot, but runs some risk of problems if the switch holds the LAG member up if the host is misconfigured. Generally, the convenience often out weighs the risk.

Now, your system will PXE boot normally. Below is how to further configure your IllumOS derived system for LACP and jumbo frames. Assuming you have igb interfaces, it looks something like this:


# ifconfig igb0 unplumb
# ifconfig igb1 unplumb
# dladm create-aggr -l igb0 -l igb1 -L passive aggr0
# dladm set-linkprop -p mtu=9000 aggr0
# echo `hostname` > /etc/hostname.aggr0
# svcadm restart network/physical:default

[ technically there is no restart argument for network/physical but it yields the desired action anyway]

Now, let’s verify our handy work…

# dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       igb0         yes          yes  yes  yes  no        no
--          igb1         yes          yes  yes  yes  no        no

Success! Now you are off to the races…

PS – In my example here, I use Juniper EX4200 switches. My ethernet ports from the servers connect to two different physical chassis within a single virtual switch. This assures that even if a physical chassis fails, the server will continue to operate at gigE speeds on the surviving member link.

Give me a shout out if this was helpful to you….

Posted in Juniper, OpenIndiana | Tagged , , , , , , , , , , | 3 Comments

How to configure Time-Slider/autosnap without using the GUI

This is a straight forward process with a twist. The twist is setting an un-obvious ZFS property. Here are the highlights:

  1. Configure Snapshot Properties in SMF repo
  2. Set super secret ZFS property to enable snapshots
  3. Enable related services
  4. Voila!

Okay, here we go. First, let’s dump the manifest and adjust how many one hour snapshots we are going to hold.

# svccfg export auto-snapshot > /tmp/auto-snapshot.smf
# vi /tmp/auto-snapshot.smf

... locate this stanza for the hourly ...

<instance name='hourly' enabled='true'>
<property_group name='zfs' type='application'>
<propval name='interval' type='astring' value='hours'/>
<propval name='keep' type='astring' value='23'/>
<propval name='period' type='astring' value='1'/>
</property_group>
<property_group name='general' type='framework'>
<property name='action_authorization' type='astring'/>
<property name='value_authorization' type='astring'/>
</property_group>
</instance>

I want to hold three days of hourly snapshots. So I changed the value of ‘keep’ from 23 to 71 to hold three days of hourly snapshots. So my stanza looks like this:

<instance name='hourly' enabled='true'>
<property_group name='zfs' type='application'>
<propval name='interval' type='astring' value='hours'/>
<propval name='keep' type='astring' value='71'/>
<propval name='period' type='astring' value='1'/>
</property_group>
<property_group name='general' type='framework'>
<property name='action_authorization' type='astring'/>
<property name='value_authorization' type='astring'/>
</property_group>
</instance>

Okay, great, we have a custom rule set. Now, let’s import it into the SMF repo.

# svccfg import /tmp/auto-snapshot.smf

With the repo updated, we need to set the super secret zfs property and then enable the services. The name of one of my pools is ‘data’ so to set the auto-snapshot property on data I execute this command:

# zfs set com.sun:auto-snapshot=true data

This setting will propagate down so that all the file systems in the pool will have snapshots created for them. By delegating file systems to users with ‘zfs allow -u joe_user data/some/filesystem’ users can control which file systems snapshots will be created for by maintaining their own com.sun:auto-snapshot properties.

Now, let’s review the services and the auto-snap property.

root@db012:~# svcs -a |egrep "auto-snap|slider"
disabled Nov_21 svc:/application/time-slider/plugin:rsync
disabled Nov_21 svc:/application/time-slider/plugin:zfs-send
disabled Nov_21 svc:/system/filesystem/zfs/auto-snapshot:daily
disabled Nov_21 svc:/system/filesystem/zfs/auto-snapshot:frequent
disabled Nov_21 svc:/system/filesystem/zfs/auto-snapshot:hourly
disabled Nov_21 svc:/system/filesystem/zfs/auto-snapshot:monthly
disabled Nov_21 svc:/system/filesystem/zfs/auto-snapshot:weekly
disabled 20:55:28 svc:/application/time-slider:default

root@db012:~# zfs get com.sun:auto-snapshot data
NAME PROPERTY VALUE SOURCE
data com.sun:auto-snapshot - -

So, all the services are off and the property isn’t set. So let’s fix that up now.

# zfs set com.sun:auto-snapshot=true data
# svcadm enable auto-snapshot:hourly
# svcadm enable auto-snapshot:frequently
# svcadm enable time-slider

Now, let’s go check on our handy work…

root@db012:/tmp# zfs get -r creation data/zones |grep @zfs-auto-snap
data/zones@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard0012a.apsalar.com@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard0012a.apsalar.com/local@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard0012a.apsalar.com/mysql@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com/ROOT@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com/ROOT/zbe@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com/apsalar@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com/local@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard009b.apsalar.com/postgres@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/ROOT@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/ROOT/zbe@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/apsalar@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/local@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/mysql@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -
data/zones/shard012a.apsalar.com/postgres@zfs-auto-snap_daily-2011-12-19-21h05 creation Mon Dec 19 21:05 2011 -

Voila! Los Hicimos

If for some reason you didnt get hourly snapshots immediately try restarting the time-slider.


# svcadm restart time-slider

Posted in OpenIndiana, ZFS, ZFS Fun Fact | Tagged , , , , | 5 Comments