A while back I decided I wanted to put down on paper some of the lessons I learned working as a systems admin and having worked in an enterprise environment with thousands of servers. Some lessons I learned in the small shops, some I learned at bigger shops. This is going to be the fist in my “Ops 101” series, and a more broad series of essays about lessons I’ve learned.
Process. Procedures. We all hate doing them. Without them, however, all the other stuff never gets done. Sure ? you?ll start off documenting the changes you make, and checking in that script, but unless there?s a process in place and accountability to back that process up, a couple of months down the line you?re going to be asking, ?wait, where does that script run again?? and ?that was upgraded? Since when?!? — without good process, you?ll lose all of those good habits that you?ll thank god you have when things start to break.
Change management basics
The following is all the things you will need to have a successful change management system:
- A meeting. Sorry: you?ll just have to live with it. 30 minutes a day won?t kill you
- A ticketing system
- Signoff from the ops team that they will use it 100% of the time
- Signoff from the rest of the company that they will only escalate things using the ticketing system
Changes fall into three basic buckets: Emergency Change Requests (ECR), (Scheduled) Change Requests (CR), and Standard Operating Procedures (SOP). These should be fairly self explanatory: ECRs happen in emergencies. If an army of zombies breaks into the office, your first thought should rightly be: how do I deal with the zombies. Afterwards, providing you live, you would create an ECR. This will enable the next guy to do a quick search for ?zombies? in the ticketing system and see that there is an emergency shotgun hidden behind the UPS in the server room, thus not having to go through the harrowing ordeal of sacrificing all of those sales guys before remembering it was there. He could, instead, just sacrifice *some* of the sales guys.
CRs will come up a lot of different ways. Client Services will request things. Deployments will need to be done. Sysadmins will think of better ways to do things. Change happens. Anything you don?t need to do *right now* goes into the CR bucket. You look at these in your change meeting, decide when and if they should be executed and if the process of the change can be improved. Then you dole them out to your various system admins to do the actual work.
SOPs are the basic ?can you run that script that you put together that fixes the mailserver again? type stuff. You do it often enough that it?s ?no big deal?. The ticket is just there a) for tracking purposes b) when that guy who runs the script is out, so you can look it up. These will eventually be a good chunk of the stuff in your wiki, but more on that later.
The Change Review Meeting
ECRs generally get phone approval from someone in management and then are documented afterward. They generally are followed by a meeting to explain what the heck happened, and a formal root cause analysis.
The agenda is simple: is it approved? Who?s doing it? When do we want to do it? Next. This should be a quick meeting. It won?t be. Who should be there: Senior Systems Admin, Director of Operations and a representative from the other teams. Dev & CS, at least should have a seat at the table, others are probably optional.
There are many out there and as you grow you might want to look at purchasing a commercial ticketing system or making modifications to your existing one, but given the pricetag of free and how widely used it is, I?d recommend RTi from bestpractical.com. It?s simple, it?s good and it works.
RPM Install page ii Current version-release: 3.4.5-2
Summary: This aims to be the solution to an easy RPM install of RT on RHEL4/CentOS4. /(Although this packages have been reported to run under Fedora Core 4, seems that they have they own now see section bellow)/
Old releases: rt-3.0.10-3 still available under 3.0.x directory.
WARNING: This packages were built on the assumption that SELinux is turned off (*Any help on make it support both modes would be great!!!*).
rtIt was built with mysql and apache2/modperl2 (2.0.1), it has no patches at the moment, but might have to correct known problems, to see details, at any moment do:
rpm -qp –changelog rt-<version>-<release>.noarch.rpm
rt-mail-dispatcher This is a setup for a RT mail dispatcher using sendmail and procmail. It is based on the assumption that you use one domain for all your RT queues, e.g. @rt.yourdomain.com.
This allows you to setup queues in RT, using the following convention syntax:
without having to reconfigure everytime your mail settings.
‘postmaster’ is reserved to be RFC822 compliant, and should be setup correctly, defaults to user postmaster. You can always change it to be a RT queue as well.
With [yum http://linux.duke.edu/projects/yum/download.ptml]
RT’s three step install procedure:
- Download the file: http://campus.fct.unl.pt/paulomatos/rt/repository/3.4.x/rt-3.4.x.repo
- Copy it to /etc/yum.repos.d/ or
rt-3.4.x.repo >> /etc/yum.conf
install rt rt-mail-dispatcher
Note: Depending upon which Perl modules you had installed in the past, you may have to update before installing via yum. If a whole lot of dependency errors display when you run yum install, then type the following:
install rt rt-mail-dispatcher
Just download everything to a directory and do:
rpm -Uvh *.rpm
A user pointed me
out that he was in such a hurry to try it out he lost the messages
that appeared after install. He also suggested I created a file with
those messages inside. Meanwhile here they are:
generate an editable site config file.
must now configure RT by editing /etc/rt/RT_SiteConfig.pm and
will definitely need to set RT’s database password before continuing.
Not doing so could be very
that, you need to initialize RT’s database by running
something goes wrong you can always drop everything, by executing
must now configure somethings by editing /var/rt/home/.procmailrc,
i – http://bestpractical.com/