Sunday, October 27, 2013

Troubleshooting SSH connections in Ansible


 I'm a firm believer that how an application is deployed and configured is just as important as how its written and is as much the responsibility of the devs as the ops folks. It's become clear to me that if the delivery and configuration process is accessible and transparent to developers, it dramatically effects their design decisions and very often reduces the overall complexity of systems they build. 

Making this transparent is not easy and requires some serious devops tooling. Over the years I've used my fair share of the usual configuration management suspects, chef, puppet et al, trying to find the perfect tool to bridge the gap between dev and ops. Lately I've been working alot with Ansible, and in it, I finally think i've found the perfect solution. So much so that in an incredibly short time its become the cornerstone of our continuous delivery strategy.

Ansible's elegant simplicity makes it incredibly accessible; its devops for developers, devops for people with out time for devops. Simplicity is a key tenet of the project and it shows in all aspects of the tool. Its installation is trivial, a glaring contrast to the the rube goldberg configuration required to stand up a typical chef server installation. It's yml play books read almost like english, capturing in no uncertain terms the intent of what's about to be executed. Anyone can read them and in little or no time understand what's going on.

Like Git, Ansible trades the complexity of a proprietary transport mechanism for the robust and battle tested power of ssh. This extremely effective and clever, but it is also generally a pain point for new users trying out Ansible for the first time. What follows is meant to ease this pain; it's more about troubleshooting ssh than Ansible, but should outline some of the more common errors you'll encounter.

Your key is not known

If you're using ssh keys and you start seeing login issues related to ssh, your first action should be to check that you're using the right key. Super obvious I know but you'd be surprised how many times this is the problem. If you're passing the key at the command line make sure that its the right one, naming the keys after the user really helps here. If you're using an ssh-agent to store your keys, check that the key has been added as you may easily be on a different machine or shell where the key has not yet been added. 

The following will give a list of the keys that have been added to the agent 
$ ssh-add -l
2048 06:c9:5c:14:de:83:00:94:ec:15:e5:c9:4e:86:4f:a6 /Users/sgargan/devroot/projects/devops/ansible/keys/ansible (RSA)
2048 dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6 /Users/sgargan/.vagrant.d/insecure_private_key (RSA)
If there are keys you use frequently, you should make aliases to add them or indeed automatically add them to a custom .ssh/config so they are automatically known


The key is missing from authorized_keys

For keypair based access by a user to a host the user's public key must be present on the server to authenticate the connection. With ssh this means that public key must in their .ssh/authorized_keys. You can copy this to the server and just cat it to the bottom of the existing file 
cd ~user
cat newkey.pub >> .ssh/authorized_keys

Alternatively ssh-copy-id can be used to do essentially the same from the local machine. In general adding keys by hand is tedious and you should automate this as much as possible. Better yet is to have ansible role that is run with password access to bootstrap the host with users and corresponding keys before removing the password access. 


Your ssh-agent is not running

If you try to run ansbile and get the rather uniformative generic ssh failure error, check that your ssh agent is running
export | grep SSH
SSH_AGENT_PID=14738
SSH_AUTH_SOCK=/tmp/ssh-U4z3bbdQJiqx/agent.14737
SSH_CLIENT='192.168.10.26 59808 22'
SSH_CONNECTION='192.168.10.26 59808 10.0.30.103 22'
SSH_TTY=/dev/pts/0
A handy script for managing the ssh agent can be found here


Known hosts conflicts

Repeatedly spinning up vms with new ssh installations can cause conflicts in your known hosts, where your local install thinks that the ip should be associated with a different ssh key. If your ansible run fails at any stage your first action should be to test the key by logging in via straight ssh

ssh ansible@yourhost -i <ansible_key> -vvv
The -vv's crank up the logging and give invaluable debugging information.

you may then see
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
1b:c6:87:9e:12:88:1d:70:4e:a1:27:68:71:eb:98:4d.
Please contact your system administrator.
Add correct host key in /Users/sgargan/.ssh/known_hosts to get rid of this message.
Offending RSA key in /Users/sgargan/.ssh/known_hosts:87

you can use the following to remove the offending key
sed -i "87d" /home/vagrant/.ssh/known_hosts

alternatively, though for obvious security implications, not recommended, you can set this by default for ansible via by editing /etc/ansible/ansible.cfg or ~/.ansible.cfg:
[defaults]
host_key_checking = False

SSHD config

More innocuous, and harder to troubleshoot, are errors in the sshd config on your target server. If you try and log using a crypto key and it's failing for some reason other than the usual it's probably related to some issue with the sshd configuration on the target server. I personally ran into this when the ansible user on the target machine was setup to use the z shell but the shell had not been installed. 
To debug you'll need to log into the server by some other mechanism and start the ssh daemon in debug mode on some other port.
/usr/sbin/sshd -d -p 1234
now if you try logging in while watching this console.

ssh -i <your_key> ansible@yourhost -p 1234
any issues with the login or the ssh config will be immediately evident.


Ansible debug logging

On the extremely rare occasions that I've had trouble with hanging ansible scripts, ansible's detailed logging has come to the rescue. You enable it similar to how its done with ssh via an increasing number of -v's, the more v's the more detailed the logging.

It will show you what users are being used for the executions and what scripts are being executed if you're having real trouble, you can log in and run them locally to see what's going on. Typically hangs end up being issues with waiting for input on the server side e.g. sudo passwords or other input.

I ran into this as the vagrant user trying to sudo to postgres to create a db, running with -vvv gave the following verbose output in which you can see the users and script in red. From this output I was able to tell the users involved log into the machine and see what happens when that gets executed manually. Often you'll find that something has prompted for input and is blocking the execution.

TASK: [create the killer-app database] *******************************************
<10.0.0.3> ESTABLISH CONNECTION FOR USER: vagrant
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'mkdir -p /tmp/ansible-1381193065.3-124473528207558 && chmod a+rx /tmp/ansible-1381193065.3-124473528207558 && echo /tmp/ansible-1381193065.3-124473528207558'"]
<10.0.0.3> REMOTE_MODULE postgresql_db db=killer-app encoding='UTF-8'
<10.0.0.3> PUT /var/folders/vl/zh7s26bd7f140kngj4p_ymxw0000gn/T/tmphF3QuO TO /tmp/ansible-1381193065.3-124473528207558/postgresql_db
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'chmod a+r /tmp/ansible-1381193065.3-124473528207558/postgresql_db'"]
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', '/bin/sh -c \'sudo -k && sudo -H -S -p "[sudo via ansible, key=ioxngtjrdsjbunfdmjmppjfuefskdrew] password: " -u postgres /bin/sh -c \'"\'"\'/usr/bin/python /tmp/ansible-1381193065.3-124473528207558/postgresql_db\'"\'"\'\'']
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'rm -rf /tmp/ansible-1381193065.3-124473528207558/ >/dev/null 2>&1'"]


Sudo failures

Be careful changing the hostname of the target host. If you make a change, be careful to also make a corresponding change to the localhost entry in /etc/hosts. Sudo will try to lookup the host before taking action if the hostname does not match the entry in hosts it will go out to DNS and eventually complain that it can't determine the hostmane. Ansible may fail the step as a result.