Troubleshoot Your Problems by Going Back to Basics
Whenever I do consulting work for a customer who is having serious problems, I tend to look at the basics first. And I mean the extreme basics. Things like computer names, network routes, DNS resolution, Active Directory membership, etc. I’m going to stereotype a bit here and say that the people I work with who’ve been around a while tend to accept it and let me do my thing. The younger ones don’t seem to understand why I need to double check the things they’ve already done and get irritated.
To some extent, I can understand this. After all, at some level I’m saying that I don’t trust anything they’ve done or any information that they’ve provided to me. The truth is that I don’t. I’m not saying outright that they’re liars, or that they don’t know what they’re talking about. It’s just that sometimes, the implications of certain configurations can have side effects that they may not be aware of. Let me give you an example.
One of the products I work with is called the Altiris Client Management Suite. I had a customer who swore up and down that the software didn’t work and they were going to throw it out. I came in for a week and sat down to review their processes because lets be honest, software like this never just “doesn’t work”. There’s always a reason and usually, that reason is because the customer isn’t following all of the rules and recommendations. That’s a bit nicer than saying they don’t know what they’re doing, but humor me here, I’m trying to be polite.
Before I get started, let me get a tiny bit technical to make sure you understand the issue that I suspected was present. When the Altiris software is installed, it generates a GUID that uniquely identifies it to the main server that acts as a central console for all of the computers being managed. When this GUID is duplicated on multiple computers, the central console can get confused regarding who it is managing. It doesn’t use the machine name, the serial number, or anything else. Every computer contains its own GUID and uses that GUID to check in for policy updates.
When the GUID is duplicated, the environment ends up with multiple computers which are each identifying themselves as a single computer. At best, this means that inaccurate inventory information is being reported back to the main console. At worst, the environment becomes almost unmanageable and virtually nothing works.
Back to my example, I started with my standard line of questions, regarding how they were imaging computers, when they were installing the client agents, etc. Remember when I said that some customers get irritated at these questions? For three days, I followed various leads trying to determine the problem and each day, new computers appeared on the network with duplicate ID’s. Standard practice at many companies is to create what is called a standard image of the client computers and deploy that same image to every new computer as it is deployed.
If the Altiris agent is installed into that image, the GUID will be duplicated on the network. If it is installed afterward, it will create a new GUID and there’s not a problem. There’s also a mechanism to reset the GUID during the imaging process, but I knew this wasn’t the case because the customer insisted that the agent was not installed into the image and if it was, then this additional step would not have been taken.
Repeatedly I asked for proof that the agent was not installed into the base image. Finally at the end of the third day, I think someone got irritated enough that in order to shut me up, they checked the image. It was at this time that they learned the Altiris agent was installed into it,which was causing their problems.
Joel Spolsky alludes to a resolution to this problem in his article called “Seven steps to remarkable customer service” with step number two, “Suggest blowing out the dust”. Unfortunately, it doesn’t work so well when you’ve already asked the customer a yes or no question, they’ve given you an answer, and there isn’t a way for them to back down gracefully.
That’s a pretty long winded way of getting to this, but what I’m trying to say is that when you run into technical problems, you need to be assured that the fundamental building blocks that your assumptions are based on are valid. This is harder than it sounds, especially when you’re working with other people and you need them to double check everything that they’ve said and done.
However, I’m not immune to needing a dose of my own medicine. A few months ago, I ran into some pretty serious network routing issues and couldn’t figure them out. If you’ve ever dealt with setting up network routes, you know how difficult it can be to debug them. In my lab I have a VMWare ESX server that has a private domain that is segmented from the rest of my network. I use it primarily for testing purposes, but I route traffic out of it through the domain controller in that network. I also pass all incoming traffic through this domain controller. For some reason, the network routes stopped working.
I spent nearly two hours trying to figure out why these machines couldn’t reach Google because I could do it without a problem from my desktop. Finally I got so frustrated that I decided to reboot my router. In the process of pulling the power cable, I watched dumbfounded as the cable from my ESX server swung back and forth next to the router, quite obviously not plugged in. I plugged it in and everything worked just fine.
Maybe it’s not stupidity to assume that network cables don’t unplug themselves, but stranger things have happened. The next time things go haywire and stuff just isn’t working, go back to the basics. If the fundamental assumptions are wrong, any correct conclusion that you arrive at is little more than a lucky guess.
Not only is a result based on bad basics a lucky guess. When the base problem is rectified your solution may become the new problem.
I have noticed a general rule that seems to apply to software support: the more serious the symptom, the more basic the problem, and vice versa.