What are common strategies and tools used by tech companies to manage configuration of hundreds of servers?

(I originally answered this on Quora but I figured it also merited a blog post.) There are two elements to answer here: strategy and tools. Strategies are hard. Lists of tools are easy. What I can tell you also varies greatly on what you define as “strategy” too. I’m going to try and answer mostly the strategy element because to me it’s the more interesting domain. I’ll also list some tools I’ve seen in common use. Firstly, though I’d query is why “tech companies”? What is a “tech” company? These days few companies don’t rely on technology to function - every bank in the world for example is a technology company in some way … but I digress. Pretty much every big company (and lots of smaller ones too) now have hundreds, if not thousands, of servers. The organisations (banks, web companies, pharmaceutical, universities, defence, telecommunications, etc, etc) that know they need to solve these problems (and employ the smart people who can solve them) don’t tend to attack the problem any differently from technology companies. I’m going to try and cover strategy and tools that cover medium to large organisations with lots of IT assets. The strategic approaches to the problem of “how do we manage lots of new servers” vary greatly. Which strategies you adopt can also greatly depend on the tactical realities of your organisation, for example the mix of physical/virtual hosts, organisational structure and culture, security model, application architecture, geographic deployment/distribution and a number of other variables. I’m going to present the broad items I think you need to cover if you want to successfully improve the management of your hosts and then list some tools that might help. Naturally your milage may vary depending on the variables listed above.

1. A workflow/life cycle.

You should map (including procurement for physical assets, software licensing, outsourcing, etc) out the life cycle of your IT assets including hosts. There is no point making it really easy and fast to provision a host if the rest of your life cycle crawls. You, depending on your role, may not be able to solve all the problems in the life cycle but you should be able to highlight and drill down on your components of it. Remember, sometimes improving a process can improve a situation equally or better than just introducing a tool. Also remember, introducing a tool to a broken process doesn’t fix the process. Start at a high level and cover each step - for example, a life cycle might include (or less or more … depends on the organisation):

Customer requests new host
IT identifies requirements
IT buys/provisions host
IT builds host
IT Security tests host
IT Operations deploys host
Customer starts using host
Customer finishes with host
IT Operations decommissions host
IT Security, IT Procurement/Asset Management, Audit confirm host is decommissioned and data purged.

Identify the elements of that you control and manage and work to make those streamlined, efficient, produces good quality product, and is cost-effective. I can’t emphasis enough how important it is to deliver quality and cost-effective services. Measure each step of the process and put in metrics and service levels on those metrics. Metrics == crucial to validate your progress/state and crucial to measure cost, time and quality savings. Some metrics might include:

Time since request
Time in each stage
Percentage of acceptance criteria passed upon customer hand-over

Remember to reach out to the other people in the life cycle and get them involved too. Don’t do this in an aggressive way: “We’re making IT asset management better and we’ve identified your piece of it is a problem.” Approach them as fellow service providers with the same customers: “We want to deliver a faster and better service to our customers and we’d like to get you involved. It’s a win-win if the customers get what they want faster and cheaper.” Having demonstrable results of the improvements will also go a long way to showing people it can be done. As another digression, people who get this kind of “cross-functional” collaboration working also tend to be viewed as “rock stars” in corporate organisations. That doesn’t hurt at bonus time.

2. Context & Information

Approaching the problem of managing a large number of hosts that are currently unmanaged is daunting. The “where do I start?” question is hard to answer. Too often you can’t answer this question because you don’t know how a host or application is configured. Or you don’t know why a certain file or package is present. This all points to a classic IT problem: lack of context (or information). Any successful configuration management program is going to require more or better information. You need to approach any project with the view that gathering and storing information is crucial. You need to be able to analyse the infrastructure you want to manage and understand what it does and what it communicates with. Don’t go overboard though - lots of these projects end up in “data gathering limbo” or “analysis paralysis”. Start with a small component, application or group of applications. Build a process and/or tools to break them down and analyse them. Record the results. I am reluctant to use the term Configuration Management Database because too many of these try to bend your process to the tool rather than the best process for your organisation but this is the broad objective. Model the host or application and its configuration and then roll out your process and tool(s). Actually attacking a problem or component and executing on it is often tells much more about potential future challenges. Then accelerate from this - if your process works then start to scale it to other components and applications. Having a good track record on execution And at the end … you have both a better managed environment and a solid store of data to help you manage (and monitor, and backup, and secure, etc, etc) that environment.

3. Chose the right playing field.

I personally think the best types of projects to start with in this area are “green fields”: new projects, potentially more acceptable to have a slightly higher risk profile/experiment, and less of the “we’ve always done it this way” resistance. These projects also tend to have higher profiles in the organisation and you can leverage this marketing: “Ann used the new IT life management design and x, y, and z tools to make Project NewProduct deliver their IT components in just two weeks! We saved $manydollars and launched on time. Maybe there is something in this…” The same thing done in a legacy environment might not pick up the same marketing gloss but can be a useful approach too. This is particularly true in organisations where the costs of legacy operating system and application maintenance are huge. For example, there are lots of people with Windows NT 4.0 hosts still running mission- critical services. If you provide a solution that makes this cost go down - both in terms of people and dollars - then that can make you very popular.

4. Buy-in from management and the IT organisation

It’s often perceived as easier in a big organisation to build a “skunkworks” (http://en.wikipedia.org/wiki/Sku…) - a team of people or bunch of tools operating outside the corporate structure who can explore “radical” innovation or change. Lots of IT management and tool uplift projects start this way. But ultimately they need to grow beyond that both because skunkworks don’t tend to scale and because they quickly encounter resistance from the more formal elements of the organisation. To get the right improvements you need to have buy-in from your management and your customers. They need to support the risks you’re taking, fund the changes you want and support you when things go wrong. The best way to get this support is to demonstrate real cost benefits. Don’t start with technology messages. Use:

Dollars,
Time, and
Quality.

Measure the cost of currently managing the way you are doing things and then model how the new method will work and the difference. Do the same for time. And quality. Your customers aren’t going to care about the new cool tool you might want to introduce. They are going to sit up and pay attention if you tell them their applications will be delivered faster, cheaper and better quality than before. This makes you, your management and the IT organisation look good. That sort of looking good gets you bonuses, promotions and pay rises. I personally am a big fan of all three. Finally, a lot of IT management change is cultural. Don’t under-estimate the power of messages like “But we’ve always done it this way” to derail your efforts. Get people across teams involved, communicate as widely as possible. Make people understand that you’re trying to make things better. We IT people tend to be stubborn and cynical lot. But we tend to also be very keen on things that make our lives easier. Sell your improvement on the basis that it’ll make my life easier then. That it’ll get me onto the cool projects I want to do rather than on activities like provisioning and configuration management that tend to be boring and time sinkholes. If you can present that message then I am much more likely to help you rather than hinder you.

5. Tools

Lastly, you have tools. Choosing the right tool is hard. Keeping in mind that tools are not the “Holy Grail” and won’t immediately solve all of your problems is often harder. I’d recommend gathering real world case studies. Talking to other people in your industry who have done some of these things. Cost and performance are obviously key issues. If you’re going to grow (and most companies host count is rising rapidly) then choose a tool that scales both in terms of cost and performance. Demand to see real world demos not canned screen casts. Ask to talk to customers and ask them “what sucked about tool x” because that’s often more important than what was great. Also ask them if they have metrics and data they can share - that’s often a useful leg up on a business case. Finally, very much from my experience and certainly not discounting other tools not listed, the major players in the enterprise configuration management space are: Bladelogic, CfEngine, Opsware, Puppet, Tivoli and tools like SCCM in the Microsoft Windows world. If you include Cloud, WebOps and start-ups you see a much wider variety of tools. This is a very configuration management-centric list.