OpenIndiana 151 was released last week

The Release Notes list a number of new features, fixes, and bugs.

In the good news category:

  • KVM – Kernel Virtual Machine which includes extensions for Intel VT. KVM can now run Linux, BSD, or Windows images as guest operating systems without VirtualBox
  • QEMU
  • Notably ZFS gets a new aclmode property
  • Zone administration enhancements
  • svcs gets a new flag to view service log files (YES!)
  • Driver support for Areca ARC-1880 RAID (why not buy LSI?)
  • COMSTAR gets UNMAP support
  • iostat -E is fixed to display serial number

Here is the bad news:

  • mega_sas driver can hang (yikes!)
  • Samba is broken out of the box
  • ZFS time-slider is broken due to a missing zfssnap-roleadd in SMF (you can fix this yourself)
  • Adaptec 6xxx support is missing altogether

Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, OpenIndiana, Uncategorized | Leave a comment

Making Seagate 2TB Green drives compatible with your LSI 1068E based controller

So I purchased a number of Seagate 2TB Green drives for my home file server. This new set of drives is to act as a backup to the primary “production” zpool and hold backups of various systems in the house.

The drives are essentially jumperless and advertised to operate in some fully automatic fashion whether connected to 3Gb/s or 6Gb/s topologies. Unfortunately, only about half of my drives were identified by the controller and the ones that were identified linked up at only 1.5Gb/s. This is piss poor automation from both the premiere manufacturer of storage HBAs and the premiere manufacturer of “spinning rust.”

My controller is a LSI 1068E based 8 external port HBA (SAS3801E). A quick look on LSI’s site indicated that they have updated firmware for my controller to interact with 6Gb/s drives. This comes at a small cost, you lose functionality with 1.5Gb/s devices. I don’t have any 1.5Gb/s devices but my storage chassis is old enough to only be rated for 3.0Gb/s and 1.5Gb/s. Naturally, the rub is my disk chassis may not be up at 6Gb/s or worse, may appear to operate and then brown out later. Since qualifying the chassis is a whole other project I decided to avoid the firmware update and change my controller settings to make it more friendly for 6Gb/s drives.

I rebooted into the HBA CLI and found no way to adjust link settings — can you say disappointing?. Okay, back to the OS. Once OpenIndiana rebooted I whipped out the swiss army knife of legacy LSI products, ‘lsiutil’

Here is the main menu:

Main menu, select an option:  [1-99 or e/p/w or 0 to quit]

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
22.  Reset bus
23.  Reset target
42.  Display operating system names for devices
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

From here, I selected option 13, “Change SAS IO Unit settings,” then selected all ports, and then set the minimum link speed to 3.0Gb/s.

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 13

SATA Maximum Queue Depth:  [0 to 255, default is 32]
Device Missing Report Delay:  [0 to 2047, default is 0]
Device Missing I/O Delay:  [0 to 255, default is 0]

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     1.5      3.0    Enabled    Disabled  Auto
   1    Enabled     1.5      3.0    Enabled    Disabled  Auto
   2    Enabled     1.5      3.0    Enabled    Disabled  Auto
   3    Enabled     1.5      3.0    Enabled    Disabled  Auto
   4    Enabled     1.5      3.0    Enabled    Disabled  Auto
   5    Enabled     1.5      3.0    Enabled    Disabled  Auto
   6    Enabled     1.5      3.0    Enabled    Disabled  Auto
   7    Enabled     1.5      3.0    Enabled    Disabled  Auto

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit] 8
Link:  [0=Disabled, 1=Enabled, or RETURN to not change]
MinRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change] 1
MaxRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change]
Initiator:  [0=Disabled, 1=Enabled, or RETURN to not change]
Target:  [0=Disabled, 1=Enabled, or RETURN to not change]
Port configuration:  [1=Auto, 2=Narrow, 3=Wide, or RETURN to not change]

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      3.0    Enabled    Disabled  Auto
   1    Enabled     3.0      3.0    Enabled    Disabled  Auto
   2    Enabled     3.0      3.0    Enabled    Disabled  Auto
   3    Enabled     3.0      3.0    Enabled    Disabled  Auto
   4    Enabled     3.0      3.0    Enabled    Disabled  Auto
   5    Enabled     3.0      3.0    Enabled    Disabled  Auto
   6    Enabled     3.0      3.0    Enabled    Disabled  Auto
   7    Enabled     3.0      3.0    Enabled    Disabled  Auto

Persistence:  [0=Disabled, 1=Enabled, default is 1]
Physical mapping:  [0=None, 1=DirectAttach, 2=EnclosureSlot, default is 0]

To verify my handwork, I enabled the advanced menu options and then selected option 8 from the main menu.

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 68

Current Port State
------------------
SAS1068E's links are 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, 3.0 G, down, down

Software Version Information
----------------------------
Current active firmware version is 01170200 (1.23.02)
Firmware image's version is MPTFW-01.23.02.00-IT
  LSI Logic
x86 BIOS image's version is MPTBIOS-6.18.01.00 (2007.08.08)
FCode image's version is MPT SAS FCode Version 1.00.45 (2007.04.13)

Firmware Settings
-----------------
SAS WWID:                       500605b000b29730
Multi-pathing:                  Disabled
SATA Native Command Queuing:    Enabled
SATA Write Caching:             Enabled
SATA Maximum Queue Depth:       32
Device Missing Report Delay:    0 seconds
Device Missing I/O Delay:       0 seconds
Phy Parameters for Phynum:      0    1    2    3    4    5    6    7
  Link Enabled:                 Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes
  Link Min Rate:                3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0
  Link Max Rate:                3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0
  SSP Initiator Enabled:        Yes  Yes  Yes  Yes  Yes  Yes  Yes  Yes
  SSP Target Enabled:           No   No   No   No   No   No   No   No
  Port Configuration:           Auto Auto Auto Auto Auto Auto Auto Auto
Target IDs per enclosure:       1
Persistent mapping:             Enabled
Physical mapping type:          None
Target ID 0 reserved for boot:  No
Starting slot (direct attach):  0
Target IDs (physical mapping):  0
Interrupt Coalescing:           Enabled, timeout is 16 us, depth is 4

Persistent Mappings
-------------------
Persistent entry 0 is valid, Bus 0 Target 0 is PhysId 1e605d31927d5f2e
Persistent entry 1 is valid, Bus 0 Target 1 is PhysId 1e605d31808b602e
Persistent entry 2 is valid, Bus 0 Target 16 is PhysId 1e605d31727e6051
Persistent entry 3 is valid, Bus 0 Target 17 is PhysId 1e605d3181705e50
Persistent entry 4 is valid, Bus 0 Target 18 is PhysId 1e605d318c854935
Persistent entry 5 is valid, Bus 0 Target 19 is PhysId 1f605d2e7e7f5e43

Voila. All six disks are present. If yours come up at 1.5Gb/s try power cycling the disks (in whatever chassis they are in) and running the report again.

If you need lsiutil, it is not surprising. LSI has buried it with the download section of its fiber channel controllers. Talk about silly. Here is a deep link. LSI_Util

Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, NAS, OpenIndiana, Uncategorized, ZFS | Leave a comment

Using OpenIndiana on Dedicated Hardware to Blow the Doors Off of the Amazon Cloud

I wrote his article back in August of 2011 but for whatever reason did not publish it.

So I took a job at a mobile engagement management company (now some seven months ago) as the head of operations (for those of you drinking the cool aid, devops).

We track billions of events, summarize data, and report it along with value added suggestions to optimize the advertising campaigns of our paying customers. The challenge of course is scaling the organization. We hit a brick wall at AWS. AWS was becoming none of the things that were on the bill of sale. It was slooow. It was no longer cheap. It wasn’t reliable and it certainly wasn’t easy given the reliability problems.

One reason scaling was difficult is that the organization was sucked into the cloud. It was just like the USS Enterprise fighting off the planet killing monster in ‘The Doomsday Machine’. Once you get into the Doomsday machine, it is very very hard to get out. The cloud can be the same way.

Look up in the sky, it is a bird, it’s a plane, it’s a cloud thing. You see billboards for it at bus stops (eg. 2nd & Howard). The Could, advertised to leap tall buildings in a single bound, cut your costs, make you toast in the morning, yes the cloud. That undefined, highly variable, cloud thing. What is it? No one can quite tell you because it is just a marketing term.

Isn’t the cloud supposed to be good? That’s what all the talking heads say on TV. Consultants certainly pitch ‘the cloud.’ Even Microsoft would have you believe that the cloud can fix their software. Everything runs better in the cloud, doesn’t it?

I had an idea, I would measure it. Measuring is good right? Making observations is fundamental to the scientific process. It yields hard data from which intelligent decisions can be made.

What did I find when I ran iostat in cloud? I found goose eggs.

bash-3.2$ iostat -nMx 5 5
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d0
    0.0    0.0    0.0    0.0  0.1  0.1    0.0    0.0   0   1 c7d1
    0.0    0.0    0.0    0.0  0.1  0.2    0.0    0.0   1   1 c7d2
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d1
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d2
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d1
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c7d2

The cloud didn’t seem very forthcoming in sharing its activities.

I knew I was hammering whatever AWS presented to me as a disk. I certainly didn’t like seeing goose eggs in my normal reporting tools — Yuck.

The cloud can’t be measured in a lot of common tools. Why? It simply does not map well into low level reporting tools. Most system level tools such as iostat, vmstat, or many things that look at raw disk I/O are simply broken. The first question that came to mind is how am I going to measure what is going on these systems?

Thankfully, two tools did work. fsstat and zpool iostat (DTRACE worked of course but that is a whole other discussion). fsstat at least gave me some insight into what was going in terms of throughput. To measure throughput of ZFS using fsstat, you can use this incantation, fsstat zfs 1. That will show you the absolute numbers reporting during the reporting interval (1 second, in this case). So this was the beginning of enlightenment.

Next I decided to run a test. A simple test. A test designed to stress the storage subsystem and at the same time mimic batch jobs. I used the pgbench and the mighty for() loop.

The setup was easy. I created a database with one billion rows (scale factor 10,000) and I set pgbench loose on it. It looked similar to this:

createdb pgbench_1b
pgbench -i -s 10000 pgbench_1b

for i in `seq 1 10000`; do
  pgbench -T 58 -c 30 -j 10 -N pgbench_1b
  sleep 2;
done | tee ~/58s-pgbench_1b.$$

So I let that run for hours. Dont worry if you dont speak postgres, my test is only doing inserts. Initially, AWS returned 110 TPS but when I checked back just four hours later, they were only doing about 60 TPS. My database operation was crushing the AWS write back cache and exposing their real throughput.

I ran a similar setup on some eval gear sitting on my desk, graciously provided by DataOn Storage (with a $10k security deposit). I ran the same sequence of commands. The initial result was 1100 TPS for the first minute. After that, something strange happened. My setup didn’t get slower over time like AWS. It got faster. At four hours, it was doing 3500 TPS.

Making the performance calculation was simple. 3500 / 60 is 58. The setup on my desk was 58x faster than Amazon? Whaaaat? I was paying big money to Amazon each month and this box valued at $13,500 is 58x faster than one of their instances? Wow. Amazon is screwing me. Is Amazon screwing you?

By now you want to know what is under the hood.

1 Intel S5520UR motherboard
2 Intel L5630 40w CPUs
96GB of RAM
1 Newisys NDS-2241 disk shelf with 24 drives configured as 11 mirrors (raid10ish style) and two hot spares
1 Kingston 128GB Value SSD for L2ARC
3 Crucial 128GB M4 for L2ARC
1 Intel x25-m to boot from
1 DDRdrive X1 for a log device
1 LSI 9205-8e SAS controller to talk to the Newisys shelf
1 LSI 9211-8i SAS controller to handle the SSD drives

Running at full speed the setup comes in at about 420 watts. I have had car stereos that used more power than that and they didn’t make me money, but I digress.

Some of you bit monkeys are probably wondering what I think of the Crucial M4. All I can tell you is, for 8k writes, it does 3000 IOPs +/- 2 IOPs very very consistently as reported by filebench. It is probably twice as fast as the Kingston series V at the same price point (normalized for total storage) and much much more predictable. It is MLC and has a 3 year warranty with no wear level restrictions. So far, I like it. That’s all I have on the M4 at this time.

So, why was my setup getting faster while AWS was getting slower? It is all in the cache baby. ZFS was plumbing my L2ARC with disk blocks containing the b-tree structures and other related meta data required to execute the inserts. When the test first began, the Seagate disks were doing 180 reads/second/disk and about 30 or so writes/second/disk. At four hours, the reads were down to about 20 reads/second/disk but the SSDs were turning in 300-700 reads/second each while writes to the Seagate drives were up to about 220 writes/second range with some variation. Go L2ARC!

The throughput will likely go higher with a larger L2ARC, as by this time my 192GB read cache was full. I suspect another $200 M4 would boost the writes even more.

So my next question was, “What happens if I go to 2 billion rows?” I mean, who of us collects less data?

To answer this, I created a 2BN row table with pgbench on AWS and set pgbench lose on it. Once again, AWS returned 108 TPS in the first iteration. Then it went as follows:

Minute TPS including connections TPS excluding connections
1 108 108
2 101 101
3 95 95
4 70 71
5 58 58
6 65 66

After that, performance hovered some where around 60-70 TPS. The results were similar to the 1BN row test but required significantly less time to achieve them.

Which would you rather have, one pimp machine for $13.5k or 58 m1.large instances for about $40k/month?

I bought cage full of pimp machines and bailed out of AWS. Read how I transferred and maintained synchronization of terabytes of data between AWS in on east coast and a data center in San Francisco bay area..

Posted in AWS, Benchmarks, IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, OpenIndiana, Postgres, SSD cache, ZFS | Tagged , , , , , , | Leave a comment

My Last Day at AOL

Today was my last day at AOL. It was a surprisingly long and uneventful day. However, I was unexpectedly flagged as a ‘Legal Hold.’ Apparently someone over there thought “I knew too much” or had to much access. At any rate, the PTBs wanted a chance to archive my two desktops and my macbook pro. The PTBs saw fit to gift my macbook to me, at least after it goes through the ringer. While I will certainly miss AOL and all the wonderful people (and benefits a big company has to offer [like an office with a door and a window]), the new gig is at a agile startup which has already been sold on using OpenIndiana — so at least I wont have that up hill battle. Since I have been an advocate of the right tool for the right job, I will probably be introducing some Linux in the OI environment. I want to give the ZenOSS monitoring system a whirl and the source code looked semi-unfriendly to Solaris — but hey, that’s what VirtualBox is for.

The new role should provide more opportunity to write about the technical details of managing Solaris derived systems. While we likely won’t be using the built-in load balancer that came available in b147, I would like to focus on network engineering aspects a little more and pause on storage. So much has been done and written about storage already that I feel I am a bit late to the party. Cursory searches have not yielded any useful information on the Solaris/OpenIndiana load balancing. I intend to look into it deeper at write an article about it soon.

That’s all for now.

j.

Posted in Uncategorized | Leave a comment

Intel SSD 710 Endurance Results & SSD Rants by Artur Bergman

 

Here is some leaked data on the new Intel 710 SSD. It is good news for those using hybrid storage pools such ZFS like those available on OpenIndiana, Solaris, and Nexenta. If your pockets are deep enough you could even create zpools consisting of only model 710 SSDs with a reasonable expected life span. The 710 SSDs have a super cap which makes them write safe.

The new hardware has been endurance tested with pre-release firmware.  The largest model can supports writing 300GiB/day in 16KB blocks for 19 years! I am focusing on 16KB because this is the size of an InnoDB write in MySQL. This represents a significant increase in write-erase cycles as compared to previous models.

Algebra tells us that it should be able to support writing up to 1.2TiB/day for three years using 16KB blocks. This makes it a viable direct disk replacement in database servers. Serving as L2ARC in ZFS it will last a very long time. I like the sound of this.

Intel 710 series SSDs use of 25 nm MLC NAND flash, backed by 64 MB of DRAM for write cache. It comes in capacities of 100, 200, and 300 GB. Like many other SSDs the transfer rates promise to be in the 270MB/s range for reads and 210 MB/s for writes.

Estimated Life at 300 GiB/day
Write Block Size 4K 8K 16K 32K
300 GiB 11 13 19 35
240 GiB 13 26 35 62

Another upside is performance.

Using 16K blocks, the device turns out 1000 IOPs for random writes.  This provides about 16.7x the write performance of a 7200 RPM hard drive [calculated as: 1000 IOPS/ ( 7200 RPM/ 60s/minute / 2 )=~ 16.667]

IOPS (100% random, 100% writes, QD=32)
Write Block Size 4kb 8kb 16kb 32kb
300GiB Capacity 2100 1100 1000 1300
240GiB Capacity 2600 2200 2100 2700

 

Intel 710 SSD performance and endurance test results

Intel Performance and Endurance test results on 710 SSD

Posted in SSD cache, Uncategorized, ZFS | Tagged , , | 1 Comment

All About Adaptec Arcconf, ZFS, and OpenIndiana

This is a quick article about using Adaptec RAID controllers with OpenIndiana. In this case, I am using an Adaptec 5805. The driver appears to handle hardware based raid as well as JBOD. This is an improvement over the Oracle shipped driver in Solaris, or OpenSolaris. OpenIndiana does not ship with the StorMan package, which provides arcconf, so you will have to fetch StorMan from Adaptec yourself and install it.

Like most RAID command line tools, arcconf has a myriad of options to support your ZFS NAS server. We are going to cover these typical operations. It is my hope that by seeing these examples it will give readers a feel for how to use arcconf.

  1. How to destroy all JBODS
  2. How to create a hardware RAID10 logical disk
  3. How to check status of a logical disk
  4. Using iostat to check existing a logical drive
  5. How to lay down a UFS file system on a logical disk
  6. How to destroy a logical disk
  7. How to get a list of all physical drives
  8. How to turn all physical drives into JBODs suitable for ZFS
  9. How to create a hardware RAID6 logical disk
  10. How to verify the status of a logical disk
  11. How to start a verify_fix operation How to extract the event log from the RAID controller in XML format
  12. How to extract the event log from the RAID controller in tabular format
  13. How to extract the device log from the RAID controller in XML format
  14. How to extract the device log from the RAID controller in tabular format
  15. How to convert time stamps in the RAID controller to real time
  16. How to rescan the Bus for new or removed drives
  17. How to convert a RAID10 volume to RAID5 (I never recommend using RAID5)
  18. Replace a failed JBOD disk

Destroy all JBODS

 

root@vs-lm1577:/opt/StorMan# ./arcconf delete 1 JBOD ALL noprompt
Controllers found: 1

All data in JBOD 0,8 will be lost.
Deleting: JBOD 0,8

All data in JBOD 0,9 will be lost.
Deleting: JBOD 0,9

All data in JBOD 0,10 will be lost.
Deleting: JBOD 0,10

All data in JBOD 0,11 will be lost.
Deleting: JBOD 0,11

// there are 24 disks on box, we'll stop here as by now you get the idea

Command completed successfully.

Create RAID10 Volume

I recommend using ZFS for your mirrors instead of the RAID controller. I am sure you have your reasons for wanting to do this, but ZFS is better than hardware raid.

root@vs-lm1577:/opt/StorMan# ./arcconf create 1 LOGICALDRIVE \
 Name TestVolume Max 10 0,8 0,9 0,10 0,11 noprompt
Controllers found: 1

Creating logical device: TestVolume

Command completed successfully.

 

Check Status of Existing Volume

 

Check status of raid volume 0

root@vs-lm1577:/opt/StorMan# ./arcconf getconfig 1 LD 0
Controllers found: 1
----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
Logical device name                      : TestVolume
RAID level                               : 10
Status of logical device                 : Optimal
Size                                     : 1906678 MB
Stripe-unit size                         : 256 KB
Read-cache mode                          : Enabled
MaxIQ preferred cache setting            : Disabled
MaxIQ cache setting                      : Disabled
Write-cache mode                         : Enabled (write-back)
Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
Partitioned                              : Yes
Protected by Hot-Spare                   : No
Bootable                                 : Yes
Failed stripes                           : No
Power settings                           : Disabled
--------------------------------------------------------
Logical device segment information
--------------------------------------------------------
Group 0, Segment 0                       : Present (0,8)      WD-WMATV4097675
Group 0, Segment 1                       : Present (0,9)      WD-WMATV4111023
Group 1, Segment 0                       : Present (0,10)      WD-WMATV4095249
Group 1, Segment 1                       : Present (0,11)      WD-WMATV4109136

 

Using iostat to Check Existence of a Logical Drive

 

In OpenIndiana, iostat -En sees the RAID10 volume

root@vs-lm1577:/opt/StorMan# iostat -En

c1t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: Adaptec  Product: RAID 5805        Revision: V1.0 Serial No:
Size: 1999.31GB <1999307276288 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

 

How to Lay Down a UFS File System on a Logical Disk

root@vs-lm1577:/opt/StorMan# newfs /dev/dsk/c1t0d0s0
newfs: construct a new file system /dev/rdsk/c1t0d0s0: (y/n)? y
Warning: 304 sector(s) in last cylinder unallocated
/dev/rdsk/c1t0d0s0:     3904880336 sectors in 635560 cylinders of 48 tracks, 128 sectors
1906679.9MB in 4445 cyl groups (143 c/g, 429.00MB/g, 448 i/g)
super-block backups (for fsck -F ufs -o b=#) at:
32, 878752, 1757472, 2636192, 3514912, 4393632, 5272352, 6151072, 7029792,
7908512,
Initializing cylinder groups:
...............................................................................
.........
super-block backups for last 10 cylinder groups at:
3896557984, 3897436704, 3898315424, 3899194144, 3900072864, 3900951584,
3901830304, 3902709024, 3903587744, 3904466464
root@vs-lm1577:/opt/StorMan# mkdir /tmp/a
root@vs-lm1577:/opt/StorMan# mount /dev/dsk/c1t0d0s0 /tmp/a
root@vs-lm1577:/opt/StorMan# df -h /tmp/a
Filesystem            Size  Used Avail Use% Mounted on
/dev/dsk/c1t0d0s0     1.9T  257M  1.8T   1% /tmp/a

Destroy a Logical Disk

root@vs-lm1577:/opt/StorMan# ./arcconf delete 1 logicaldrive 0 noprompt
Controllers found: 1

WARNING: logical device 0 may contain a partition.
All data in logical device 0 will be lost.
Deleting: logical device 0 ("TestVolume")

Command completed successfully.

Get list of all physical drives

root@vs-lm1577:/opt/StorMan# ./arcconf  getconfig 1 PD | \
grep "Reported Channel,Device(T:L)" |grep -v 0:0 | \
cut -d: -f 3 |cut -d'(' -f 1

0,8
0,9
0,11
0,12
0,13
0,14
0,15
0,16
0,17
0,18
0,19
0,21
0,22
0,23
0,24
0,25
0,26
0,27
0,28
0,29
0,31

How to turn all physical drives into JBODs suitable for ZFS

root@vs-lm1577:/opt/StorMan# list=`./arcconf  getconfig 1 PD |\
 grep "Reported Channel,Device(T:L)" |grep -v 0:0 |cut -d: -f 3 | \
cut -d'(' -f 1 | sed 's/,/ /g'` && ./arcconf create 1 jbod $list

Controllers found: 1
Created JBOD: 0,8
Created JBOD: 0,9
Created JBOD: 0,11
Created JBOD: 0,12
Created JBOD: 0,13
Created JBOD: 0,14
Created JBOD: 0,15
Created JBOD: 0,16
Created JBOD: 0,17
Created JBOD: 0,18
Created JBOD: 0,19
Created JBOD: 0,21
Created JBOD: 0,22
Created JBOD: 0,23
Created JBOD: 0,24
Created JBOD: 0,25
Created JBOD: 0,26
Created JBOD: 0,27
Created JBOD: 0,28
Created JBOD: 0,29
Created JBOD: 0,31

Command completed successfully.

How to create a hardware RAID6 logical disk

root@vs-lm1577:/opt/StorMan# ./arcconf create 1 logicaldrive max 6 $list
Controllers found: 1

Do you want to add a logical device to the configuration?
Press y, then ENTER to continue or press ENTER to abort: y

Creating logical device: Device 0

Command completed successfully.

How to start a verify_fix operation

What to do when your array has an ‘impacted’ status. Basically this means the array build process has not completed. You can verify this with the following command:

root@vs-lm1577:/opt/StorMan# ./arcconf getstatus 1
Controllers found: 1
Logical device Task:
Logical device                 : 0
Task ID                        : 100
Current operation              : Build/Verify
Status                         : In Progress
Priority                       : High
Percentage complete            : 0

Here we see that while we created the giant RAID6 volume, there is still background initialization going on. The volume is healthy, but performance is degraded while it completes the initialization.

This is how you start a build/verify check on your volume. This fails in my example as the job is already running.

root@vs-lm1577:/opt/StorMan# ./arcconf task start 1 logicaldrive 0 verify_fix noprompt
Controllers found: 1
Task 'Build/Verify' is already running on this device.  Aborting

Command aborted.

How to extract the event log from the RAID controller in XML format

root@vs-lm1577:/opt/StorMan# ./arcconf getlogs 1 event
Controllers found: 1

Command completed successfully.

How to extract the event log from the RAID controller in tabular format

root@vs-lm1577:/opt/StorMan# ./arcconf getlogs 1 device tabular
Controllers found: 1

ControllerLog
controllerID ..................... 0
type ............................. 0
time ............................. 1308362117
version .......................... 3
tableFull ........................ false

How to convert time stamps in the RAID controller to real time

root@vs-lm1577:/opt/StorMan# perl -e 'print scalar localtime(shift),"\n";' 1308362117
Fri Jun 17 18:55:17 2011

How to rescan the Bus for new or removed drives

root@vs-lm1577:/opt/StorMan# ./arcconf rescan 1
Controllers found: 1

Command completed successfully.

How to convert a RAID10 volume to RAID5. He we are using the above RAID10.
First, let’s view our RAID10 volume:

root@vs-lm1577:/opt/StorMan# ./arcconf getconfig 1 ld 0
Controllers found: 1
----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
   Logical device name                      : TestVolume
   RAID level                               : 10
   Status of logical device                 : Optimal
   Size                                     : 1906678 MB
   Stripe-unit size                         : 256 KB
   Read-cache mode                          : Enabled
   MaxIQ preferred cache setting            : Disabled
   MaxIQ cache setting                      : Disabled
   Write-cache mode                         : Enabled (write-back)
   Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
   Partitioned                              : No
   Protected by Hot-Spare                   : No
   Bootable                                 : Yes
   Failed stripes                           : No
   Power settings                           : Disabled
   --------------------------------------------------------
   Logical device segment information
   --------------------------------------------------------
   Group 0, Segment 0                       : Present (0,8)      WD-WMATV4097675
   Group 0, Segment 1                       : Present (0,9)      WD-WMATV4111023
   Group 1, Segment 0                       : Present (0,10)      WD-WMATV4095249
   Group 1, Segment 1                       : Present (0,11)      WD-WMATV4109136

yep, it is still there.

Now, let’s convert it to RAID5

root@vs-lm1577:/opt/StorMan# ./arcconf modify 1 from 0 to max \
5 0 8 0 9 0 10 0 11 noprompt

Controllers found: 1
Reconfiguring logical device: TestVolume

Command completed successfully.

and now verify…

root@vs-lm1577:/opt/StorMan# ./arcconf getconfig 1 ld 0
Controllers found: 1
----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
Logical device name                      : TestVolume
RAID level                               : 5
Status of logical device                 : Logical device Reconfiguring
Size                                     : 2860022 MB
Stripe-unit size                         : 256 KB
Read-cache mode                          : Enabled
MaxIQ preferred cache setting            : Disabled
MaxIQ cache setting                      : Disabled
Write-cache mode                         : Enabled (write-back)
Write-cache setting                      : Enabled (write-back) when protected by battery/ZMM
Partitioned                              : No
Protected by Hot-Spare                   : No
Bootable                                 : Yes
Failed stripes                           : No
Power settings                           : Disabled
--------------------------------------------------------
Logical device segment information
--------------------------------------------------------
Segment 0                                : Present (0,8)      WD-WMATV4097675
Segment 1                                : Present (0,9)      WD-WMATV4111023
Segment 2                                : Present (0,10)      WD-WMATV4095249
Segment 3                                : Present (0,11)      WD-WMATV4109136

Yep, it worked as advertised.

Replace a Failed JBOD Disk

See my previous article on how to replace a failed JBOD


 

Posted in Adaptec, arcconf, OpenIndiana, UNIX CLI, ZFS | Tagged , , , , | 3 Comments

ZFS: How to check disk based write cache status

 

One of my OpenIndiana origin servers suddenly starting writing very slowly. The cause is still under investigation. One thing I wanted to verify was that the on-board disk write cache was enabled which is required for ZFS to do its job well.

Normally, it is rather painful to determine the disk cache state. You have to run the format command with the -e switch, the select the cache menu, then the write-cache menu, then the display option. When you are done with all that, you have to exit all those menus and select another disk and iterate through your topology. If you have any reasonable number of disks, this is type of verification run by hand operates at ‘ludicrous speed.’

I put a very simple tool together to probe the status of the disk write cache on a per controller basis. The tool is not fully automagic. The operator will still have to determine which controller to scan and pass it as an argument.

However, it works like this…

root@vs-lm1577:/home/jmatthew# ./format.script c1
trying disk c1t72d0  Write Cache is enabled
trying disk c1t73d0  Write Cache is enabled
trying disk c1t74d0  Write Cache is enabled
trying disk c1t75d0  Write Cache is enabled
trying disk c1t76d0  Write Cache is enabled
trying disk c1t77d0  Write Cache is enabled
trying disk c1t78d0  Write Cache is enabled
trying disk c1t79d0  Write Cache is enabled
trying disk c1t80d0  Write Cache is enabled
trying disk c1t81d0  Write Cache is enabled
trying disk c1t82d0  Write Cache is enabled
trying disk c1t83d0  Write Cache is enabled
trying disk c1t84d0  Write Cache is enabled
trying disk c1t85d0  Write Cache is enabled
trying disk c1t86d0  Write Cache is enabled
trying disk c1t87d0  Write Cache is enabled
trying disk c1t88d0  Write Cache is enabled
trying disk c1t89d0  Write Cache is enabled
trying disk c1t90d0  Write Cache is enabled
trying disk c1t91d0  Write Cache is enabled
trying disk c1t92d0  Write Cache is enabled
trying disk c1t93d0  Write Cache is enabled
trying disk c1t94d0  Write Cache is enabled
trying disk c1t95d0  Write Cache is enabled

 
And of course, here is the source for your viewing pleasure…
 


#!/bin/bash
#
# written by jason matthews
# jason_\@_broken.net
#
# kid tested, mother approved
#
# this script scans disks associated to a storage controller
# and displays the status of the disk based on board write cache.
# At time of writing, this script works with all modern Solaris,
# OpenSolaris, OpenIndiana, IllumOS type distributions.
#
# This script is offered with no warranty, expressed or implied.
# Operate at your own risk
#
# The script takes one argument, the controller number such as
# c1, c2, c3, etc
#
# usage: ./scan-wc c1

# close stderr
exec 2>&-

controller=$1
disk_list=`iostat -En |grep $controller | cut -d" " -f 1`

echo -e cache\\nwrite\\ndisplay\\nq\\nq\\nq\\n > /tmp/command-sequence.txt

for disk in $disk_list; do
echo -n trying disk $disk " "
format -f /tmp/command-sequence.txt -e $disk | grep -i "write cache is"
done
Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, OpenIndiana, Performance, ZFS | Tagged , , , , , | Leave a comment

The Art of Parallelization with GNU xargs and pssh

As you can imagine, running a for() loop on 1000+ systems is fraught with potential issues and is very very slow. Running commands in parallel greatly speeds things up and keeps normal network time outs from bringing your work to a grinding halt.

I typically work in environments where I manage one or more thousands of systems (at present, they out number me 1564 to 1). I always use a configuration management system such as CFengine, Puppet, or Chef to manage the systems. However, there is still occasional need to run one-off commands.

Generally, my first approach is to use GNU xargs (often named gxargs on standard Unix systems). There is a handy -P switch that allows for parallelization.

Here is an example of how to run a simple command locally in parallel.

seq 1200 1299 | xargs -P 32 -n 1 -INUMBER mkdir  NUMBER

Using this incarnation, we create 100 directories maintaining 32 instances at one time until completion. Awesome, now we can apply this to ssh and run lots of commands remotely — right?.

While I often use this invocation for simple ssh commands I find that it falls short for running more complex operations, particularly when you need to maintain coherent logs for later review.

When you do this, the order of the output is not guaranteed. Infact, it is just a jumble of lines and if the commands you are running have multiple output lines, it is down right unintelligible.

So what can you do about this if need to maintain coherent log files from each machine?

In a case like this, I sometimes use pssh. pssh takes a number of arguments including (-h) a path to a file containing target hostnames or IP addresses, (-o) path to a directory for which holds log files from data sent to standard out, (-t) timeout, (-p) number of concurrent threads, amoung others.

[jmatthew@vs-lm960 pssh-1.4.3]# pssh -h /tmp/hostlist -o /tmp/outfiles hostname
[1] 00:04:46 [SUCCESS] vs-lm496 22
[2] 00:04:46 [SUCCESS] vs-lm1204 22
[3] 00:04:46 [SUCCESS] vs-lm1203 22
[4] 00:04:46 [SUCCESS] vs-lm1201 22
[5] 00:04:46 [SUCCESS] vs-lm1202 22
[6] 00:04:46 [SUCCESS] vs-lm1200 22
[root@vs-lm960 pssh-1.4.3]# ls -l /tmp/outfiles
total 24
drwxr-xr-x  2 jmatthew aolusers 160 May 27 00:04 ./
drwxrwxrwt 10 jmatthew aolusers 1440 May 27 00:04 ../
-rw-r--r--  1 jmatthew aolusers 25 May 27 00:04 vs-lm1200
-rw-r--r--  1 jmatthew aolusers 25 May 27 00:04 vs-lm1201
-rw-r--r--  1 jmatthew aolusers 25 May 27 00:04 vs-lm1202
-rw-r--r--  1 jmatthew aolusers 25 May 27 00:04 vs-lm1203
-rw-r--r--  1 jmatthew aolusers 25 May 27 00:04 vs-lm1204
-rw-r--r--  1 jmatthew aolusers 24 May 27 00:04 vs-lm496
[jmatthew@vs-lm960 pssh-1.4.3]# cat /tmp/outfiles/vs-lm1200
vs-lm1200.websys.aol.com

Right away we see that pssh allows us to run commands in parallel and maintain logs which could be reviewed later. This has a definitely advantage when working over multiple systems.

At the time of writing, the latest source code can be downloaded from Google Tools.

Hopefully this saves you as much time as it saves me.

Posted in Parallelization, Performance, UNIX CLI | Tagged , , , , | 1 Comment

Adjusting disk time outs on errors

Here is very interesting post about adjusting disk time outs to minimize disk subsystem brown outs on your favorite OpenIndiana, Solaris, Nexenta based system.

The techniques describe can be applied to SATA drives as well, as it boils down to tweaks in the sd driver config file. The author indicates that SATA drives may need longer time outs, but I am not yet convinced of that. Discussion regarding topic is encouraged. What do you know about the topic?

Posted in IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, OpenIndiana, ZFS | Tagged , | Leave a comment

Simple COMSTAR iSCSI, FCoE, FC config view

Someone reached out to me indicated they were having problems masking their iSCSI LUNs. It occured to me that there is no simplified means to quickly view all the iSCSI related STMF configuration data. I don’t count dumping the STMF database and parsing the XML simplified.

Instead, I put this script together in a couple of minutes. It quickly dumps the highlights in a human readable format. It should help speed up configuration validation.

Use in good health.

j.

#!/bin/bash
# jason _\@_ broken.net
# 20May2011 - Kid tested, mother approved

echo --- List View and Associated Host Groups IQNs ---
for i in `stmfadm list-lu | cut -d" " -f3`; do
   stmfadm list-view -l $i
   host_group=`stmfadm list-view -l $i | grep "Host group" | cut -d: -f 2`
   client_iqn=`stmfadm list-hg -v $host_group| grep -v "Host Group"`
   echo Client IQN $client_iqn

   echo
done

echo --- Target group config ----

stmfadm list-tg -v
echo

echo --- LU List \(verbose\) ---
stmfadm list-lu -v

Here is the sample output from my home server.

root@caprica:~# ./show-comstar-iscsi-config.sh
--- List View and Associated Host Groups IQNs ---
View Entry: 0
    Host group   : ragnarok
    Target group : ragnarok
    LUN          : 0
Client IQN Member: iqn.1991-05.com.microsoft:ragnarok

View Entry: 0
    Host group   : hasone
    Target group : hasone
    LUN          : 0
Client IQN Member: iqn.1991-05.com.microsoft:hasone

View Entry: 0
    Host group   : cylon
    Target group : cylon
    LUN          : 0
Client IQN Member: iqn.1991-05.com.microsoft:cylon

View Entry: 0
    Host group   : cylon
    Target group : cylon
    LUN          : 1
Client IQN Member: iqn.1991-05.com.microsoft:cylon

--- Target group config ----
Target Group: hasone
        Member: iqn.1986-03.com.sun:02:db5c7e1a-aace-e5c5-9d5c-b4694339bdf1
Target Group: ragnarok
        Member: iqn.1986-03.com.sun:02:8b68bc8d-5a7b-cd0f-ff0d-8b5d63e79cd5
Target Group: cylon
        Member: iqn.1986-03.com.sun:02:8da9779e-ee50-6287-d7da-97b2928928ea

--- LU List (verbose) ---
LU Name: 600144F0080027461FAE4DA3E3EB0001
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/data/iscsi/ragnarok/backup000
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/data/iscsi/ragnarok/backup000
    Meta File         : not set
    Size              : 2146409906176
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : OI
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Enabled
    Access State      : Active
LU Name: 600144F0080027B4FACE4DAA70AB0001
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/data/iscsi/hasone/backup000
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/data/iscsi/hasone/backup000
    Meta File         : not set
    Size              : 2146409906176
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : OI
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Enabled
    Access State      : Active
LU Name: 600144F0080027B4FACE4DB911710002
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/rdsk/data/iscsi/cylon/data000
    View Entry Count  : 1
    Data File         : /dev/zvol/rdsk/data/iscsi/cylon/data000
    Meta File         : not set
    Size              : 2096103424
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : OI
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Enabled
    Access State      : Active
LU Name: 600144F0080027B72E824DC3A2A60001
    Operational Status: Online
    Provider Name     : sbd
    Alias             : /dev/zvol/dsk/data/ragnarok-data-clone-for-cylon
    View Entry Count  : 1
    Data File         : /dev/zvol/dsk/data/ragnarok-data-clone-for-cylon
    Meta File         : not set
    Size              : 2146409906176
    Block Size        : 512
    Management URL    : not set
    Vendor ID         : OI
    Product ID        : COMSTAR
    Serial Num        : not set
    Write Protect     : Disabled
    Writeback Cache   : Disabled
    Access State      : Active
Posted in COMSTAR, IllumOS, OpenIndiana, OpenSolaris, Nexenta, & Solaris, iSCSI, NAS, OpenIndiana, SAN, Uncategorized, ZFS | 1 Comment