I feel as though I’ve spent enough time doing Systems Administration that I figure it’s time to vent a bit and point out some common fallacies and reasons why these are anti-patterns for both successful SysAdmins as well as overall health of tech organizations.
“Just restart the service/host.”
If companies wanted this kind of response to a problem, they would just invoke an event-driven service to restart services or hosts when a failure is detected. This just masks the problem.
- Ask “Why did this fail?”
- If you’re not asking why, you’re behaving much like the oft-maligned “Windows Admin” or “Bastard Operator From Hell (BOFH)”–restarting services or servers and not digging in deeper.
- Look at logs.
- If you’re not looking at logs for a service or a host, you’re missing a huge part of the work that makes Systems Administrators one of the most important roles in a technology organization.
- Look at service and/or host statistics.
- Back up your assertions with data, otherwise you’re just conjecturing.
“It’s not important/it’s intermittent.”
Bullshit. If it’s important enough to page someone, it’s important enough to debug and find a way to keep it from happening again. If it’s intermittent, then there’s a cause. Ignore the symptoms, find the source.
“I’m too busy with other stuff.”
Again, bullshit. If you’re too busy, say so. It’s your job to tell someone (your manager, your peers, your employees, whatever) that you’re over-subscribed. Anything worth doing is worth doing right and devoting as close to your full attention to as possible, so speak up.
“I don’t know how to do this.”
Then you’ve just been gifted with a learning opportunity. The best SysAdmins use learning opportunities rather than avoid them.
- Read documentation.
- Look at logs.
- Debug the problem. Try to duplicate it. See how it differs given different conditions.
- Understand the services and hosts that you’re responsible for.
- Be thorough and verbose in your investigation.
“This is never going to be fixed, so why bother?”
Your entire job is predicated on the notion that you’re supposed to help provide context to decision-makers around impact of bugs, flaws, and failures. If you’re not doing it or aren’t interested in doing it, then why are you even working?
- File bugs/stories/incidents.
- Provide data to back up your assertions/findings.
- Dig in. Look at logs and statistics for commonalities and outliers.
- If people aren’t listening, take it up to the next level. Find the Product Owners or Product Managers.
- Explain why it’s a problem and what they could do to resolve it.
- Talk about the cost of the problem in terms of payroll hours wasted (this generally gets people’s attention).
These are definitely common sins, but this is by no means a complete list. What do you think? Have any you’d like to share?