I'm a firm believer that how an application is deployed and configured is just as important as how its written and is as much the responsibility of the devs as the ops folks. It's become clear to me that if the delivery and configuration process is accessible and transparent to developers, it dramatically effects their design decisions and very often reduces the overall complexity of systems they build.
Making this transparent is not easy and requires some serious devops tooling. Over the years I've used my fair share of the usual configuration management suspects, chef, puppet et al, trying to find the perfect tool to bridge the gap between dev and ops. Lately I've been working alot with Ansible, and in it, I finally think i've found the perfect solution. So much so that in an incredibly short time its become the cornerstone of our continuous delivery strategy.
Ansible's elegant simplicity makes it incredibly accessible; its devops for developers, devops for people with out time for devops. Simplicity is a key tenet of the project and it shows in all aspects of the tool. Its installation is trivial, a glaring contrast to the the rube goldberg configuration required to stand up a typical chef server installation. It's yml play books read almost like english, capturing in no uncertain terms the intent of what's about to be executed. Anyone can read them and in little or no time understand what's going on.
Like Git, Ansible trades the complexity of a proprietary transport mechanism for the robust and battle tested power of ssh. This extremely effective and clever, but it is also generally a pain point for new users trying out Ansible for the first time. What follows is meant to ease this pain; it's more about troubleshooting ssh than Ansible, but should outline some of the more common errors you'll encounter.
Your key is not known
If you're using ssh keys and you start seeing login issues related to ssh, your first action should be to check that you're using the right key. Super obvious I know but you'd be surprised how many times this is the problem. If you're passing the key at the command line make sure that its the right one, naming the keys after the user really helps here. If you're using an ssh-agent to store your keys, check that the key has been added as you may easily be on a different machine or shell where the key has not yet been added.The following will give a list of the keys that have been added to the agent
$ ssh-add -l 2048 06:c9:5c:14:de:83:00:94:ec:15:e5:c9:4e:86:4f:a6 /Users/sgargan/devroot/projects/devops/ansible/keys/ansible (RSA) 2048 dd:3b:b8:2e:85:04:06:e9:ab:ff:a8:0a:c0:04:6e:d6 /Users/sgargan/.vagrant.d/insecure_private_key (RSA)
The key is missing from authorized_keys
For keypair based access by a user to a host the user's public key must be present on the server to authenticate the connection. With ssh this means that public key must in their .ssh/authorized_keys. You can copy this to the server and just cat it to the bottom of the existing filecd ~user cat newkey.pub >> .ssh/authorized_keys
Alternatively ssh-copy-id can be used to do essentially the same from the local machine. In general adding keys by hand is tedious and you should automate this as much as possible. Better yet is to have ansible role that is run with password access to bootstrap the host with users and corresponding keys before removing the password access.
Your ssh-agent is not running
If you try to run ansbile and get the rather uniformative generic ssh failure error, check that your ssh agent is runningexport | grep SSH SSH_AGENT_PID=14738 SSH_AUTH_SOCK=/tmp/ssh-U4z3bbdQJiqx/agent.14737 SSH_CLIENT='192.168.10.26 59808 22' SSH_CONNECTION='192.168.10.26 59808 10.0.30.103 22' SSH_TTY=/dev/pts/0A handy script for managing the ssh agent can be found here
Known hosts conflicts
Repeatedly spinning up vms with new ssh installations can cause conflicts in your known hosts, where your local install thinks that the ip should be associated with a different ssh key. If your ansible run fails at any stage your first action should be to test the key by logging in via straight sshssh ansible@yourhost -i <ansible_key> -vvv
The -vv's crank up the logging and give invaluable debugging information.
you may then see
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is 1b:c6:87:9e:12:88:1d:70:4e:a1:27:68:71:eb:98:4d. Please contact your system administrator. Add correct host key in /Users/sgargan/.ssh/known_hosts to get rid of this message. Offending RSA key in /Users/sgargan/.ssh/known_hosts:87
you can use the following to remove the offending key
sed -i "87d" /home/vagrant/.ssh/known_hosts
alternatively, though for obvious security implications, not recommended, you can set this by default for ansible via by editing /etc/ansible/ansible.cfg or ~/.ansible.cfg:
[defaults] host_key_checking = False
SSHD config
More innocuous, and harder to troubleshoot, are errors in the sshd config on your target server. If you try and log using a crypto key and it's failing for some reason other than the usual it's probably related to some issue with the sshd configuration on the target server. I personally ran into this when the ansible user on the target machine was setup to use the z shell but the shell had not been installed.To debug you'll need to log into the server by some other mechanism and start the ssh daemon in debug mode on some other port.
/usr/sbin/sshd -d -p 1234
now if you try logging in while watching this console.
ssh -i <your_key> ansible@yourhost -p 1234any issues with the login or the ssh config will be immediately evident.
Ansible debug logging
On the extremely rare occasions that I've had trouble with hanging ansible scripts, ansible's detailed logging has come to the rescue. You enable it similar to how its done with ssh via an increasing number of -v's, the more v's the more detailed the logging.It will show you what users are being used for the executions and what scripts are being executed if you're having real trouble, you can log in and run them locally to see what's going on. Typically hangs end up being issues with waiting for input on the server side e.g. sudo passwords or other input.
I ran into this as the vagrant user trying to sudo to postgres to create a db, running with -vvv gave the following verbose output in which you can see the users and script in red. From this output I was able to tell the users involved log into the machine and see what happens when that gets executed manually. Often you'll find that something has prompted for input and is blocking the execution.
TASK: [create the killer-app database] *******************************************
<10.0.0.3> ESTABLISH CONNECTION FOR USER: vagrant
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'mkdir -p /tmp/ansible-1381193065.3-124473528207558 && chmod a+rx /tmp/ansible-1381193065.3-124473528207558 && echo /tmp/ansible-1381193065.3-124473528207558'"]
<10.0.0.3> REMOTE_MODULE postgresql_db db=killer-app encoding='UTF-8'
<10.0.0.3> PUT /var/folders/vl/zh7s26bd7f140kngj4p_ymxw0000gn/T/tmphF3QuO TO /tmp/ansible-1381193065.3-124473528207558/postgresql_db
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'chmod a+r /tmp/ansible-1381193065.3-124473528207558/postgresql_db'"]
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', '/bin/sh -c \'sudo -k && sudo -H -S -p "[sudo via ansible, key=ioxngtjrdsjbunfdmjmppjfuefskdrew] password: " -u postgres /bin/sh -c \'"\'"\'/usr/bin/python /tmp/ansible-1381193065.3-124473528207558/postgresql_db\'"\'"\'\'']
<10.0.0.3> EXEC ['ssh', '-tt', '-q', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/Users/sgargan/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'StrictHostKeyChecking=no', '-o', 'Port=22', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'User=vagrant', '-o', 'ConnectTimeout=10', '10.0.0.3', "/bin/sh -c 'rm -rf /tmp/ansible-1381193065.3-124473528207558/ >/dev/null 2>&1'"]