Showing all posts tagged devops:

Systems of Operation

I have, to misquote J. R. R. Tolkien, a cordial dislike of overly rigid classification systems. The fewer the dimensions, the worse they tend to be. The classic two-by-two grid, so beloved of management consultants, is a frequent offender. I suspect I am not alone, as most such systems quickly get complicated by the addition of precise placement along each axis, devolving into far more granular coordinate systems on at least one plane, rather than the original four simple boxes. But surely the worst of the lot are simple binary choices, this or that, no gradations on the spectrum allowed.

We have perhaps more than our fair share of these divisions in tech — or perhaps it makes sense that we have more than other fields? (That's a joke, because binary) Anyway, one of the recurring binary splits is the one between development and operations. That it is obviously a false binary is clear by the fact that these days, the grey area at the intersection — DevOps — gets far more consideration than either extreme. And yet, as it is with metaphors and allegories (back to JRRT!), so it is with classifications: all of them are wrong, but some of them are useful.

The Dev/Ops dichotomy is a real one, no matter how blurred the intersection has got, because it is based in a larger division. People tend to prefer either the work of creation, architecting and building, or the work of maintaining, running and repairing. The first group get visibility and recognition, so certain personality traits cluster at this end of the spectrum — flashy and extrovert, dismissive of existing constraints. At the opposite end, we find people who value understanding a situation deeply, including how it came to be a certain way, and who act within it to achieve their goals.

I am trying to avoid value judgments, but I think it is already clear where my own sympathies lie. Someone I have worked with for a long time subscribes to Isaiah Berlin's analogy: the fox knows many things, but the hedgehog knows one big thing. I am an unashamed fox: I know a little about a lot, I love accumulating knowledge even if I do not have an immediate obvious use for it, and I never saw a classification system I did not immediately question and find the corner-cases of. These traits set me up to be a maintainer and an extender rather than a creator.

I value the work of maintenance; designing a new thing starting with a clean sheet is an indulgence, while working within the constraints of an existing situation and past choices to reach my objectives is a discipline that requires understanding both of my own goals and those of others who have worked on the same thing in the past. In particular, good maintainers extend their predecessors the grace of assuming good intent. Even if a particular choice seems counter-intuitive or sub-optimal, this attitude does the courtesy of assuming there was a good and valid reason for making it, or a constraint which prevented the more obvious choice.

Embrace Failure — But Not Too Tightly

There are many consequences to this attitude. One is embracing failure as an opportunity for learning. The best way to learn how something works is often to break it and then fix it — but please don't blame me if you break prod! Putting something back together is the best way to truly understand how different components fit one another and interact with one another in ways that may or may not be planned in the original design. It is also often a way of finding unexpected capabilities and new ways of assembling the same bits into something new. I did both back when I was a sysadmin — broke prod (only the once) and learned from fixing things that were broken.

Embracing failure also does not mean that we should allow it to happen; in fact the maintainer mindset assumes failure and values redundancy over efficiency or elegance of design. Healthy systems are redundant, both to tolerate failure and to enable maintenance. I had a car with a known failure mode, but unfortunately the fix was an engine-out job, making preventative maintenance uneconomical. The efficiency of the design choice to use plastic tubing and route it in a hot spot under the engine ultimately came back to bite me in the shape of a late-night call to roadside assistance and an eye-watering bill.

Hyperobjects In Time

There is one negative aspect to the maintainer mindset, beyond the lack of personal recognition; people get awards for the initial design, not for keeping it operating afterwards. Lack of maintenance (or of the right sort of maintenance) is not immediately obvious, especially to hedgehog types. It is not the sort of one big thing that they tend to focus on. Instead, it is more of a hyperobject, visible only if you take a step back and add a time dimension. Don't clean the kitchen floor for a day, it's probably fine. Leave it for a week, it's nasty, and probably attracting pests. I know this from my own student days, when my flatmates explored the boundaries of entropy with enthusiasm.

Hyperobjects extend through additional dimensions beyond the usual three. In the same way that a cube is a three-dimensional object whose faces are two-dimensional squares, a hypercube or tesseract is a four-dimensional object whose faces are all three-dimensional cubes. This sort of thing can give you a headache to think about, but does make for cool screensaver visualisations. In this particular formulation, the fourth dimension is time; deferred maintenance is visible only by looking at its extent in time, while its projection into our everyday dimensions seems small and inconsequential when viewed in isolation.

These sorts of hyperobjects are difficult for hedgehogs to reason about precisely because they do not fit neatly into their two-by-two grids and one big thing. They can even sneak up on foxes because there is always something else going on, so the issues can remain undetected, hidden by other things, until some sort of failure mode is encountered. If that failure can be averted or at least minimised, maintainer foxes can learn something from it and modify the system so that it can be maintained more easily and avoid the failure recurring.

All of these reflections are grounded in my day job. I own a large and expanding library of content, which is continuously aging and becoming obsolete, and must be constantly maintained to remain useful. Leave one document untouched for a month or so, and it's probably fine; the drift is minimal, a note here or there. Leave it for a year, and it's basically as much work to bring it back up to date as it would be to rewrite it entirely. It's easy to forget this factor in the constant rush of everyday work, so it's important to have systems to remind us of the true extent of problems left unaddressed.

In my case, all of this rapidly-obsolescing content is research about competitors. This is also where the intellectual honesty comes in: it's important to recognise that creators of competing technology may have had good reasons for making the choices they made, even when they result in trade-offs that seem obviously worse. In the same way, someone who adopted a different technology probably did so for reasons that were good and valid for their time and place, and dismissing those reasons as irrelevant will not help to persuade them to consider a change. This is known as "calling someone's baby ugly", and tends to provoke similar negative emotional reactions as insulting someone’s actual offspring.

Good competitive positioning is not about pitching the One True Way and explaining all the ways in which other approaches are Wrong. Instead, it's about trying to understand what the ultimate goal is or was for all of the other participants in the conversation, and engaging with those goals honestly. Of course I have an agenda, I'm not just going to surrender because someone made a choice years ago — but I can put my agenda into effect more easily by understanding how it fits with someone else's agenda, by working with the existing complicated system as it is, rather than trying to raze it to the ground and start again to build a more perfect design, whatever the people who rely on the existing system might think.

I value the work of maintainers, the people who keep the lights on, at least as much as that of the initial designers. And I know that every maintainer is also a little bit of a designer, in the same way that every good designer is also thinking at least a little bit about maintenance. Maybe that is my One Big Thing?

The curve points the way to our future


Just a few days ago, I wrote a post about how technology and services do not stand still. Whatever model we can come up with based on how things are right now, it will soon be obsolete, unless our model can accomodate change.

One of the places where we can see that is with the adoption curve of Docker and other container architectures. Anyone who thought that there might be time to relax, having weathered the virtualisation and cloud storms, is in for a rude awakening.

Who is using Docker?

Sure, the latest Docker adoption survey still shows that most adoption is in development, with 47% of respondents classifying themselves as "Developer or Dev Mgr", and a further 15% as "DevOps or Release Eng". In comparison, only 12% of respondents were in "SysAdmin / Ops / SRE" roles.

Also, 56% of respondents are from companies with fewer than 100 employees. This makes sense: long-established companies have too much history to be able to adopt the hot new thing in a hurry, no matter what benefits it might promise.

What does happen is that small teams within those big companies start using the new cool tech in the lab or for skunkworks projects. Corporate IT can maybe ignore these science experiments for a while, but eventually, between the pressure of those research projects going into production, and new hires coming in from smaller startups that have been working with the new technology stack for some time, they will have to figure out how they are going to support it in production.

Shipping containers

If the teams in charge of production operations have not been paying attention, this can turn into Good news for Dev, bad news for Ops, as my colleague Sahil wrote on the official Moogsoft blog. When it comes to Docker specifically, one important factor for Ops is that containers tend to be very short-lived, continuing and accelerating the trend that VMs introduced. Where physical servers had a lifespan of years, VMs might last for months - but containers have been reported to have a lifespan four times shorter than VMs.

That’s a huge change in operational tempo. Given that shorter release cycles and faster scaling (up and down) in response to demand are among the main benefits that people are looking for from Docker adoption, this rapid churn of containers is likely to continue and even accelerate.

VMs were sometimes used for short-duration tasks, but far more often they were actually forklifted physical servers, and shoe-horned into that operational model. This meant that VMs could sometimes have a longer lifespan than physical servers, as it was possible for them simply to be forgotten.

Container-based architectures are sufficiently different that there is far less risk of this happening. Also, the combination of experience and generational turnover mean that IT people are far more comfortable with the cloud as an operational model, so there is less risk of backsliding.

The Bow Wave

The legacy enterprise IT departments that do not keep up with the new operational tempo will find themselves in the position of the military, struggling to adapt to new realities because of its organisational structure. Armed forces set up for Cold War battles of tanks, fighters and missiles struggle to deal with insurgents armed with cheap AK-47s and repurposed consumer technology such as mobile phones and drones.

In this analogy, shadow IT is the insurgency, able to pop up from nowhere and be just as effective as - if not more so than - the big, expensive technological solutions adopted by corporate. On top of that, the spiralling costs of supporting that technological legacy will force changes sooner or later. This is known as the "bow wave" of technological renewal:

"A modernization bow wave typically forms as the overall defense budget declines and modernization programs are delayed or stretched in the future," writes Todd Harrison of the Center for Strategic and International Studies. He continues: "As this happens the underlying assumption is that funding will become available to cover these deferred costs." These delays push costs into the future, like a ship’s bow pushes a wave forward at sea.

(from here)

What do we do?

The solution is not to throw out everything in the data centre, starting from the mainframe. Judiciously adapted, upgraded, and integrated, old tech can last a very long time. There are B-52 bombers that have hosted three generations from the same family. In the same way, ancient systems like SABRE have been running since the 1960s, and still (eventually) underpin every modern Web 3.0 travel-planning web site you care to name.

What is required is actually something much harder: thought and consideration.

Change is going to happen. It’s better to make plans up front that allow for change, so that we can surf the wave of change. Organisations that wipe out trying to handle (or worse, resist) change that they had not planned for may never surface again.

Not NoOps, but SmartOps

Or, Don't work harder, work smarter

I have always been irritated by some of the more extreme rhetoric around DevOps. I especially hate the way DevOps often gets simplified into blaming everything that went wrong in the past on the Ops team, and explicitly minimising their role in the future. At its extreme, this tendency is encapsulated by the NoOps movement.

This is why I was heartened to read There is no such thing as NoOps, by the reliably acerbic IT Skeptic.

Annoyingly, in terms of the original terminology, I quite agree that we need to get rid of Ops. Back in the day, there was a distinction between admins and ops. The sysadmins were the senior people, who had deep skills and experience, and generally spent their time planning and analysing rather than executing. The operators were typically junior roles, often proto-sysadmins working through an apprenticeship.

Getting rid of ops in this meaning makes perfect sense. The major cause of outages is human error, and not necessarily the fairly obvious moment when the poor overworked ops realize one oh-no-second after hitting Enter that the login was not where they thought it was. What leads to these human-mediated outages is complexity, so the issue is the valid change that is made here but not there, or the upgrade that happened to one component but did not flow down to later stages of the lifecycle. These are the types of human error that can either cause failures on deployment, or those more subtle issues which only show up under load, or every second Thursday, or only when the customer's name has a Y in it.

There have been many attempts to reduce the incidence of these moments by enforcing policies, review, and procedures. However, by not eliminating the weakest link in the chain - the human one - none of these well-meaning attempts have succeeded. Instead of saying "it will work this time, really!", we should aim to to eliminate downtime and improve performance by removing every possible human intervention and hand-over, and instead allowing one single original design to propagate everywhere automatically.

So yes, we get rid of ops by automating their jobs - what I once heard a sysadmin friend describe to a colleague as "monkey-compatible tasks", basically low-value-added, tactical, hands-on-keyboard activity. However, that does not mean that there is no role for IT! It simply means that IT's role is no longer in execution, or in other words, as the bottleneck in every request.

Standard requests should not require hands-on-keyboard intervention from IT.

This is what all these WhateverOps movements are about: preventing IT from becoming a bottleneck to other departments, whether the developers in the case of DevOps, or the GRC team in the case of SecOps that I keep banging on about lately, or whatever other variation you like.

IT still has a very important role to play, but it is not the operator's role, it is the sysadmin's role: to plan, to strategise, to have a deep understanding of the infrastructure. Ultimately, IT's role is to advise other teams on how best to achieve their goals, and to emplace and maintain the automation that lets them do that - much as sysadmins in the past would have worked to train their junior operators to deliver on requests.

The thing is, sysadmins themselves can't wait to rid themselves of scut work. Nothing would make them happier! But the state of the art today makes that difficult to achieve. DevOps et al are the friend of IT, not its enemy, at least when they're done right. Done wrong, they are the developer's enemy too.

In that sense, I say yes to NoOps - but let's not throw the baby out with the bathwater! Any developer trying to do completely without an IT team will soon find that they no longer have any time to develop, because they are so busy with all this extraneous activity, managing their infrastructure1, keeping it compliant, updating components, and all the thousand and one tasks IT performs to keep the lights on.

  1. No, Docker, "the cloud", or whatever fad comes next will not obviate this problem; there will always be some level of infrastructure that needs to be looked after. Even if it works completely lights-out in the normal way of things, someone will need to understand it well enough to fix it when (not if) it breaks. That person is IT, no matter which department they sit in. 


I've been blogging a lot about messaging lately, which I suppose is to be expected from someone in marketing. In particular, I have been focusing on how messaging can go wrong.

The process I outlined in "SMAC My Pitch Up" went something like this:

  • Thought Leaders (spit) come up with a cool new concept
  • Thought Leaders discuss the concept amongst themselves, coming up with jargon, abbreviations, and acronyms (oh my!)
  • Thought Leaders launch the concept on an unsuspecting world, forgetting to translate from jargon, abbreviations and acronyms
  • Followers regurgitate half-understood jargon, abbreviations and acronyms
  • Much clarity is lost

Now the cynical take is that the Followers are doing this in an effort to be perceived as Thought Leaders themselves - and there is certainly some of that going on. However, my new corollary to the theory is that many Followers are not interested in the concept at all. They are name-checking the concept to signal to their audience that they are aware of it and gain credibility for other initiatives, not to jump on the bandwagon of the original concept. This isn't the same thing as "cloudwashing", because that is at least about cloud. This is about using the cloud language to justify doing something completely different.

This is how we end up with actual printed books purporting to explain what is happening in the world of mobile and social. By the time the text is finalised it's already obsolete, never mind printed and distributed - but that's not the point. The point is to be seen as someone knowledgeable about up-to-date topics so that other, more traditional recommendations gain some reflected shine from the new concept.

The audience is in on this too. There will always be rubes taken in by a silver-tongued visionary with a high-concept presentation, but a significant part of the audience is signalling - to other audience members and to outsiders who are aware of their presence in that audience - that they too are aware of the new shiny concept.

It's cover - a way of saying "it's not that I don't know what the kids are up to, it's that I have decided to do something different". This is how I explain the difficulties in adoption of new concepts such as cloud computing1 or DevOps. It's not the operational difficulties - breaking down the silos, interrupting the blamestorms, reconciling all the differing priorities; it's that many of the people talking about those topics are using them as cover for something different.

Images from Morguefile, which I am using as an experiment.

  1. Which my fingers insist on typing as "clod computing", something that is far more widespread but not really what we should be encouraging as an industry. 

DevOps is killing us

I came across this interesting article about the changes that DevOps brings to the developer role. Because of my sysadmin background, I had tended to focus on the Ops side of DevOps. I had simply not realised that developers might object to DevOps!

I knew sysadmins often didn’t like DevOps, of course. Generalising wildly, sysadmins are not happy with DevOps because it means they have to give non-sysadmins access to the systems. This is not just jealousy (although there is often some of that), but a very real awareness that incentives are not necessarily aligned. Developers want change, sysadmins want stability.

Actually, that point is important. Let me emphasise it some more.

Developers want change, sysadmins want stability

Typical pre-DevOps scenario: developers code up an application, and it works. It passes all the testing: functional, performance, and user-acceptance. Now it’s time to deploy it in production - and suddenly the sysadmins are begin difficult, complaining about processes running as root and world-writable directories, or talking about maintenance windows for the deployment. Developers just want the code that they have spent all this time working on to get out there, and the sysadmins are in the way.

From the point of view of the sysadmins, it’s a bit different. They just got all the systems how they like them, and now developers are asking for the keys? Not only that, but their stuff is all messy, with processes running as root, world-writable directories, and goodness knows what. When the sysadmins point out these issues and propose reasonable corrections, the devs get all huffy, and before you know it, the meeting has turned into a blamestorm.


The DevOps movement attempts to address this by getting developers more involved in operations, meaning instead of throwing their code over the proverbial wall between Dev and Ops, they have to support not just deployment but also support and maintenance of that code. In other words, developers have to start carrying pagers.

The default sysadmin assumption is that developers can’t wait to get the root password and go joy-riding in their carefully maintained datacenter - and because I have a sysadmin background, sell to sysadmins, and hang out with sysadmin types, I had unconsciously bought into that. However, now someone points it out, it does make sense that developers would not want to take up that pager…