Frustrations with the inherent differences between a deterministic configuration management system like Puppet vs. whatever Ansbile is supposed to be.

VS.

I’ve had a lot of different titles over my past few jobs, but whether HR chooses to label you a “Software Engineer”, a “DevOps Engineer”, or just a “Systems Administrator” the job functions have pretty much been the same - codifying the configuration of a fleet (mostly Linux) servers. There are tons of tools to do this with, and I don’t pretend to have used even a large number of them, but I have worked with a few.

I’ve worked with hand-tuned shell scripts, cfengine gobbledy-gook, Puppet, Terraform, and now at my latest job Ansible. I’ve dabbled with Ansible before, but that was just for little one-off tasks where we had to touch a lot of servers to do something that didn’t need to be coded in Puppet. This is my first job where it’s our primary configuration tool. And I’m not that impressed with it for this task.

I prefer having the desired state of a server’s configuration coded and managed by a tool like Puppet that figures out how to get the server from “where it is” to “where I want it to be”. I worked with Puppet for so long that I still find myself thinking that way. Of course, this approach kind of assumes that the server’s configuration doesn’t change all that often, or at least that it can be changed in predictable ways.

I’m sure Ansible can be used to get a server into a desired configuration state and keep it there, but it feels a lot more like a big hammer instead of a scalpel. Maybe I’m just not familiar enough with it yet.

Background

In my current role, we manage a fleet of close to 3000 servers which are basically just linux servers at the bottom of the stack. The “configuration” that we manage is generally at the higher levels of the stack in the proxy/loadbalancer layers. There are modules(?) in Ansible that interact with this software layer, but our code base doesn’t use them for various reasons. Instead we do a lot of shell commands that load config files, or make specific changes to the config. This is generally sufficient for managing the base systems and common configs. However, there is a whole other layer of “customer” configs sitting above that that are managed by an in-house automation system. A customer can have dozens or even hundreds of endpoints defined - each with custom settings and behaviors, and each of those endpoints can be spread across multiple hosts in multiple data centers. They can even be moved automatically or on-demand in the case of outages or maintenance.

This poses a problem where the configuration is basically split into 2 layers. The “base” layer is what we try to manage with Ansible. The upper “customer” layer is managed by the in-house automation system. Unfortunately, there are lots of dependencies between the layers and the deliniation line is not always that clear.

Cruft

The service that we manage has been around long enough that there is significant cruft built up. We’ve gone through several major releases of the network operating system our hosts run. We’re really behind on Ansible versions but we still use playbooks and roles written for both 2.3 and 2.8. Luckily, the python version on our hosts is relatively recent (>=3.6).

As someone new to Ansible, I find it frustrating looking up documentation only to find that the version we’re using doesn’t support feature X or behaves entirely differently than the current version. Lots of examples on the web are written for the latest versions of ansble. As I write more playbooks and roles for this ancient version, I’m sure I’ll get used to it’s idosyncrasies, but it is a chore.

Unlearning Puppet

I guess the biggest challenge for me has been unlearning puppet. I got used to a mindset where we were able to manage the entire config of a server (or at least all the parts that we cared about) in the puppet DSL. You could be sure that running puppet agent on a box would get to the desired state, and you could easily determine what the desired state should be from looking at the puppet code.

Perhaps in some environments Ansible can be used this way, but in my current role there is always going to be a disconnect between the actual state of the server and what we have configured in Ansible. Our ansible code does not and cannot manage the entire config on a box.

When asked to make changes in our system my first reaction is to look for ways to modify the codebase to implement the change. That’s not always possible in our environment. The change that needs to be implemented may not be part of the base config or part of our codebase that managed it was only valid on initial deployment and has changed over time. Sometimes it is quicker and easier to just run some shell commands to accomplish what is being asked for. I need to be more aware of that fact when planning my changes.