Awesome Conferences

How to make a product "sysadmin-friendly"?

CHIMIT (2007,2008,2009) is a conference for researchers that study the habits and workflows of IT workers in an effort to find ways to make them more productive (they call this "human factors in IT"). Anyone trying to make my work easier is alright in my book.

At the most recent conference I moderated a panel of system administrators who had been in the audience watching the first day of presentations. It was our turn to "speak up" about what we had seen.

One of the useful things that came out of this panel was a list of "signs that a product was designed to be easy for system administrators to install and maintain."

Here is a short version of the list:

  • as a command line interface
  • has an API so it can be remotely administered
  • has a "silent install" mode so it can be cleanly deployed
  • has a config file that is ASCII so it can be stored in a revision control system; and the same file can be input INTO the system.
  • has a clearly defined way to do backups and restores.
  • has a clean way to monitor for up/down issues (know when there is an emergency) AND vital statistics that relate to scaling/latency (know how to debug slowness) AND historical monitoring (be able to predict far in advance when we need to buy more capacity)

What would you add to this list?

Posted by Tom Limoncelli in Academic study of SA

No TrackBacks

TrackBack URL:

15 Comments | Leave a comment

I'd add a couple things:

* Be amenable to packaging or installation in another directory. I'm thinking of a couple packages I've had to work with (one programming library, one protein modelling tool) that insist on being run/linked to from the directory where they're unpacked or compiled. Want to turn it into an RPM, or copy it into /opt? Sorry...not supported. (The programming library, in particular, produced a spaghetti-like mess of symbolic links, only some of which used relative paths, during compilation. You ended up with a 1.5 GB directory that couldn't be moved.)

* Do not insist on being run as root -- and if you do, be explicit about it. The documentation for one scientific package hinted that you could run it as an ordinary, non-privileged user -- and since it was basically a web-based search engine, there's certainly nothing requiring root privileges. I set up everything to run as a separate user, only to find out that much of the package's initialization assumed you were running as root and hard-coded that everywhere ("ssh root@othermachine -t 'install_me'"). My bad for not checking beforehand; their bad for using root and not saying so.

* For bonus points, generate an installation config file that's based on the interactive one you just did. I'm thinking of how Red Hat/CentOS will drop a kickstart file into your newly-installed system, which lets you use that as a starting point for new installs or further customization. I wish that Red Hat Directory Server did the same.

* Must contain configurable automatic log rotation and purging.

* should be able to change log verbosity easily

I would add...if your software is going to be licensed, don't tie it to something physical on that machine. I've seen licensed software that refused to run on any machine with a different MAC address than the one it was registered for. No big deal until the machine fails and you're scrambling for a replacement.

Further to 'silent install': I want OS packages so I can easily back the software out, upgrade it, or manage with Puppet/CfEngine/Chef.

I think it's a bit late in the day to put with with shar files, etc.

* Have versioned protocols (if required)
* Have an operational manual detailing major component operations, configuration options, and more, distributed in text with the package
* Detail how/if hot upgrades can be done
* Detail any clustering semantics/high availability
* Detail how state is maintained and where
* Have sane defaults that the usage message/man page explains

The installation files should be available in a fairly up to date version. I don't want to patch or "slipstream" service packs into the installation media. That's the vendor's job.

If the product has a mechanism to control licensing, it should be very easy to use, support automated installs and not require "activation" with the vendor.

Logging should work:
- Easy to centralize (e.g. syslog)
- Relevant IDs logged, so objects can be tracked in the system.
- Each log entry should have some meaning without depending on another entry.
- Logging should be focused on what the sysadmin needs, not what the developer of the product need to debug.

The product should follow the operating system's best practices for updates. RPM/YUM for RedHat, Deb/Apt for Debian, MSI/MSP for Windows...

The vendor should have a way to warn the customers about security issues. A separate mailing list or RSS feed are good examples.

Network applications should have a short document with port numbers and such, to give to "the firewall guys".

The documentation should be available, linkable and findable on the web. If someone blogs about a solution to a problem, they should be able to link directly to the relevant documentation.

* Provide detailed procedures for changing passwords/keys of service accounts and process-to-process communications.
* Document all configuration options with context as to their expected use. Provide examples of the expected scenarios in which every non-default option would be useful.
* Support multiple, flexible options for user provisioning, deprovisioning, single sign-on, and multi-factor authentication.
* Security logs should be configurable to easily ascertain who did what and when; for both user and admin activities; including reads, data creation, writes, deletes, permissions changes, metadata changes, etc; user ID numbers in addition to user names; and a privacy option to disable logging of names, IPs, user-agents. etc.
* For web interfaces, generate a self-signed certificate during install and only enable SSL connections by default. Make it easy to replace the self-signed certificate with a trusted one through the web-based admin interface.
* Check and warn on unexpected or risky file system permissions during install and start.

Don't assume that the software can write all over the filesystem, or even in its own directory. Even better, don't assume it can make *any* changes to its install directory ever.

Don't incorporate pieces of lots of applications or support libraries provided by the system. Alternatively, *include* specific versions of software your software is dependent on so it can be installed and run independent of the OS. (I've got packages that are annoying in both directions.)

If it has a license manager or lock files, allow you to specify where those pieces are installed or where they write those files. (I have scientific software that uses its own license management system (different and nonstandard for each package, natch) and that insists on writing lockfiles to /var/tmp, even though that's not the standard location on my OS, which means I've had to spend a bunch of time writing scripts to make sure those files don't get "cleaned up" by the system.

Remember that logging is an interface/API for consumption by systems administrators, enable run time manipulation of logging levels without restarting the application, ensure logging is easily parsable by line based filtering tools to enable easy searching. Multi-line logs such as stack traces require adding state detection into parsing scripts. Allow correlation of events in complex system via logs (eg log thread ids, process ids) so that it's possible to trace through the context before an issue occured.

If possible don't require a desktop based client programme to administer it.

If you use a web-based management interface make sure it works with IE, Firefox, Safari and with Windows, Linux and Mac desktops. Avoid using java in the browser.

If the program uses http then it should follow the http_proxy environment variable and/or a configured http proxy.

Please make it easy to run multiple copies of the app on the same server listening to different IPs/Ports. make sure all directories are configurable.

Have a simple healthcheck method that can be accessed by simple healthchecks scripts or load balancers.

If the connection to a remote resource stops working then drop the connection and restart it. Failure to do this causes all sorts of problems with load balancers and firewalls that occasionally purge long lived connections

On a server air intake, buttons and lights go on the front. On the back you have cables and air exhaust. Network management and serial ( 9 pin, 9600 8N1 ) are useful.

Use a racking system that supports different types and depths of racks. Can be easily installed by 1 or two people (I like HPs) in minutes with minimal screw, nuts and bolts.

Netapp has/had a webpage with how many sites/servers were running each version of their software with average uptime for each version. This is a great way to encourage sites to upgrade by showing them that version works on other sites.

Fully qualified datestamps in logs. I don't want to have to guess whether you are logging in localtime or UTC or something else.

Do not run anything as root unless it is genuinely needed. If something needs to run as root, isolate it from all non-root, so it can be run separately by the system administrator. Document exactly why you need root.

Do not require root to install user space code. Do not ever use OS packages for user space applications. If you must package it into a package manager, use an application package manager separate from the OS package manager so that someone who isn't authorized for full root can do the install.

Do not presume that any portions running as root will live in the same directory as anything else, they are likely to be quarantined to prevent abuse.

Make all authentication mechanisms pluggable so the site can plug in something better than passwords.

Don't lock to a single version of java|perl that comes with the OS and expect that version never gets patched. If you must use java|perl, make it trivially relocatable or provide your own copy.

For a Windows app it puts all of its files under c:\program files\%parent folder%.

No files in \Windows or its subfolders.

If it's software, and has an external configuration file that's editable by the user (basically, anything that reads its config from /etc/):

* Re-read the configuration file on SIGHUP, SIGUSR1 or SIGUSR2
* Document this behaviour in the man page, along with any configuration settings that aren't re-read for any reason
* Log when this happens via the configured/default logging interface
* In the man page, document and justify why this *isn't* the case *if* it isn't.

* Re-read different areas of the configuration file on SIGHUP, SIGUSR1 or SIGUSR2, if it makes sense to separate a "minor restart" from a "major restart"
* Document which signals correspond to re-reading which configuration variables in the man page

This actually lines up with two posts I've been working on for my (very neglected) blog, dealing with some appliances I've been working with lately...

* If it *needs* something to be done with client software, write in Java or something else cross-platform. Working in an all Linux/Unix shop, there's nothing worse than buying an appliance *because it runs Linux* and then finding I need Windows-only software to administer it.

* Do things logically in line with your base system. If your product runs Linux, present an interface and documentation that can be understood by Linux people - don't deviate from LSB without good reason, don't rename normal binaries, don't invent new terms without explaining the common terms they relate to.

* MOST IMPORTANTLY: Keep the insides of your software/systems logical. There's nothing worse than having an appliance fail at 3 AM, using that undocumented method to get root that one of the field techs showed you, and then finding out that the innards look NOTHING like the CentOS 5.2 it was based on.

A corollary to that is: If you're selling a proprietary appliance, your customers already signed a license agreement. Unless you're prepared to provide 24/7 on-site support, give them root.

Leave a comment