For some time my production systems have been plagued with intermittent slow login performance over ssh. In day to day operations, it was rarely an issue, except for a couple of zones where it was always persistent. It really seemed to bite us the most when we tried to deploy mass configuration changes.
Naturally, I looked at the usual suspects.
- DNS lookups [LookupClientHostnames, VerifyReverseMapping]
- SSH login DoS protection [MaxStartups]
- Kerberos (mis)configuration [GSSAPIAuthentication]
- DNS Server Timeouts
- Network drops casing issues for any of these services
- Unintentional interactions in PAM
None of those panned out. I spent hours trying to prove the problem was one of those features and it was completely futile. So, I resorted to denial. Maybe the problem was just in my head and no one else would notice. Denial never works. Just ask the tax man.
The other night some cronjobs that ran on one server but executed work on a bunch of other servers failed to restart/pstart some processes. For whatever reason, my interest in solving the problem was renewed.
Here are the symptoms of my problem:
root@heimdall:~# ssh -v web003
Sun_SSH_1.5, SSH protocols 1.5/2.0, OpenSSL 0x009080ff
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: Rhosts Authentication disabled, originating port will not be trusted.
debug1: ssh_connect: needpriv 0
debug1: Connecting to web003 [W.X.Y.Z] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/identity type -1
debug1: identity file /root/.ssh/id_rsa type 1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: Remote protocol version 2.0, remote software version Sun_SSH_1.5
debug1: match: Sun_SSH_1.5 pat Sun_SSH_1.5*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-Sun_SSH_1.5
debug1: use_engine is 'yes'
debug1: pkcs11 engine initialized, now setting it as default for RSA, DSA, and symmetric ciphers
debug1: pkcs11 engine initialization complete
debug1: Failed to acquire GSS-API credentials for any mechanisms (No credentials were supplied, or the credentials were unavailable or inaccessible
)
debug1: SSH2_MSG_KEXINIT sent
At this point, the system hangs for seconds to minutes. In order to try figure out what was going on the server side i used truss.
Since web003 is a regular zone, I chose to work from the global zone. The first thing I did was find the PID of the parent sshd process on the web003 zone.
root@www002:~# ps -efZ |grep ssh |grep web003
web003.a root 13880 3502 0 11:30:12 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13967 13965 0 11:30:55 ? 0:00 /usr/lib/ssh/sshd
web003.a root 3502 1 0 18:28:06 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13881 13880 0 11:30:12 ? 0:00 /usr/lib/ssh/sshd
web003.a root 13965 3502 0 11:30:55 ? 0:00 /usr/lib/ssh/sshd
There it is. PID 3502 owned by init. Now, we’ll use truss to see what is going on.
root@www002# truss -pedf 3502
[some truss output removed for interest of brevity]
14426: 30.1188 mmap(0x00000000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFE820000
14426: 30.1188 memcntl(0xFE830000, 71660, MC_ADVISE, MADV_WILLNEED, 0, 0) = 0
14426: 30.1198 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, 0x00000000, SOV_DEFAULT) = 5
14426: 30.1199 brk(0x080EA228) = 0
14426: 30.1200 brk(0x080EC228) = 0
3502: pollsys(0x08047470, 2, 0x00000000, 0x00000000) (sleeping...)
14426: connect(5, 0x08047180, 16, SOV_DEFAULT) (sleeping...)
14424: pollsys(0x080453B0, 1, 0x00000000, 0x00000000) (sleeping...)
Now we see that the process hangs at this connect call. What in the world is this connect() call doing? Why is ssh connecting to something? Is this a DNS lookup over TCP? Could it be something else? I had to investigate this, so I changed the options to truss to get the arguments passed to connect().
root@www002# truss -pdef -tall -v connect 3502
[truss data removed]
14970: 21.6030 brk(0x080EC228) = 0
3502: pollsys(0x08047470, 2, 0x00000000, 0x00000000) (sleeping...)
14968: pollsys(0x080453B0, 1, 0x00000000, 0x00000000) (sleeping...)
14970: connect(5, 0x08047180, 16, SOV_DEFAULT) (sleeping...)
14970: AF_INET name = 127.0.0.1 port = 30003
Okay, now we see that ssh is trying to connect to port 30003 on localhost. That is useful information, but why is it doing that? A quick check of /etc/services revealed nothing — which seems like an omission. Googling port 30003 eventually pointed me to a software package called Trousers which is “Trusted Platform Module” or TPM. What in the world is it anyway?
My colleague had a great description of it:
Looks like some sort of “Trusted Computing” TPM bigbrotherware that was invented to appease Hollywood and Microsoft.
Welp, Hollywood and Microsoft don’t pay my bills so how do I disable it? I found via Google that in OpenIndiana the Trousers package is branded as the service tcsd
root@www002:~# svcs tcsd
STATE STIME FMRI
disabled 18:18:09 svc:/application/security/tcsd:default
The service is disabled by default. Hrm. Is this a dead end? Why is ssh connecting to this again? Perhaps it is part of the cryptographic framework. Those crypto guys are genius and a little sneaky. What can be learned about the default configuration?
root@web003:~# cryptoadm list
User-level providers:
Provider: /usr/lib/security/$ISA/pkcs11_kernel.so
Provider: /usr/lib/security/$ISA/pkcs11_softtoken.so
Provider: /usr/lib/security/$ISA/pkcs11_tpm.so
Kernel software providers:
des
aes
arcfour
blowfish
ecc
sha1
sha2
md4
md5
rsa
swrand
Kernel hardware providers:
Holy Cow Batman, there it is! TPM, right there installed in the kernel as a module. Son of a biscuit. So, I tried to disable the module with:
cryptoadm disable provider=/usr/lib/security/\$ISA/pkcs11_tpm.so
but that didnt work. So I rebooted the zone. That made no difference. Here is what I did to make the difference:
cryptoadm uninstall provider=/usr/lib/security/\$ISA/pkcs11_tpm.so
Don’t freak out. This doesnt uninstall anything from your system. It simply does what “cryptoadm disable” doesnt seem to do. It disables the feature by removing it from the crypto framework.
So here are the results.
Initially, web003 performed like this:
root@heimdall:~# time ssh web003 hostname
web003
real 3m3.169s
user 0m0.034s
sys 0m0.009s
Can you believe that? Three minutes to login into a system? Bleh.
And after removing Trousers from the crypto framework, it looks like this:
root@heimdall:~# time ssh web003 hostname
web003
real 0m0.373s
user 0m0.037s
sys 0m0.012s
Yes, that is 373 miliseconds compared to 3 minutes, 3 seconds, and 169 miliseconds.
I think we can put a nail in the coffin of this problem. It must be a bug that the cryptographic framework invokes a service that is disabled by default. I guess mistakes happen. Managing a large project like IllumOS or OpenIndiana is definitely a challenge.

