The LOPSA-East talks schedule was published yesterday. It is broken into 4 tracks: DevOps, Infrastructure, Career Development and "General". I'm impressed! The DevOps Track has a lot of good culture talks, best practices, and big names like Mandi Walls. The Infrastructure Track has case studies as well as talks about how to do it yourself. The Professional/Career Talks Track has a mix of sessions for both junior and senior people. The "General" Track has a huge diversity: network (the hardware kind), networking (the community kind), "lightning talks" and more.
There's also a lot of excellent training classes, which I'll write about in another post. Plus there will be 2 keynote speakers announced soon.
Registration is open! Sign up today!
If you are taking any of my classes (or not) you can use the discount code "CasIT14-Presenter-guest" to get an additional 10% discount on a 2-day registration for the conference.
The code expires at midnight, the evening of Saturday, 1 MAR, 2014.
You don't have to be taking any of my classes to use the code.
What's with the trend of making user interfaces that hide until you mouse over them and then they spring out at you?
How did every darn company hop on this trend at the same time? Is there a name for this school of design? Was there a trendy book that I missed? Is there some UI blog encouraging this?
For example look at the new Gmail editor. To find half the functions you need to be smart enough or lucky enough to move the mouse over the right part of the editor for those functions to appear. Microsoft, Facebook, and all the big names are just as guilty.
I get it. The old way was to show everything but "grey out" the parts that weren't appropriate at the time. People are baffled by seeing all the options, even if they can't use them. I get it. I really do. Only show what people can actually use should be, in theory, a lot better.
However we've gone too far in the other direction. I recently tried to help someone use a particular web-based system and he literally couldn't find the button I was talking about because we had our mouses hovering over different part of the screens and were seeing different user interface elements.
Most importantly the new user interfaces are "jumpy". When you move the mouse across the screen (say, to click on the top menu bar) the windows you pass over all jump and flip and pop out at you. It is unnerving. As someone that already has a nervous and jittery personality, I don't need my UI to compete with me for being more jumpy, nervous and jittery.
I'm not against innovation. I like the fact that these designs give the user more "document space" by moving clutter out of the way. I understand that too many choices is stifling to people. I read The Paradox of Choice before most people. I swear... I get it!
But shouldn't there be a "reveal all" button that shows all the buttons or changes the color of all the "hover areas" so that if, like me, you didn't think of moving the mouse to the-left-side-of-the-screen-but-not-the-top-one-inch-because-for-some-reason-that-isn't-part-of-the-hover-space-oh-my-god-that-prevented-me-from-finding-an-option-for-two-months.
Why can't there be a way to achieve these goals without making a user interface that is jumpy and jittery?
User interfaces should exude confidence. They should be so responsive that they snap. Applications that are jumpy and jittery look nervous, uncomfortable, and unsure.
I can't trust my data to a UI that looks that way.
The first issue of The FreeBSD Journal has finally shipped!
I got to read an early draft of the first issue and I was quite impressed by the content. It was a great way to learn what's new and interesting with FreeBSD plus read extended articles about specific FreeBSD technologies such as ZFS, DTrace and more.
Even if you don't use FreeBSD, this is a great way to learn about Unix in general and expand your knowledge of advanced computing technologies.
The Journal is a brand new, professionally produced, on-line magazine available from the various app stores, including Apple iTunes, Google Play, and Amazon Kindle.
Issue #1 is dedicated to FreeBSD 10 and contains articles not only about the latest release but also about running FreeBSD on the ARM-based Beagle Bone Black, working with ZFS, and all about our new compiler tool chain.
The Journal is guided by an editorial board made up of people you know from across the FreeBSD community including: John Baldwin, Daichi Goto, Joseph Kong, Dru Lavigne, Michael W. Lucas, Kirk McKusick, George Neville-Neil, Hiroki Sato, and Robert Watson. It is published 6 times a year.
You can subscribe by going to any of the following links:
The Journal is supported by the FreeBSD Foundation.
I was reminded of this excellent blog post by Leon Fayer of OmniTI.
As software developers, we often think our job is to develop software, but, really, that is just the means to an end, and the end is to empower business to reach their goals. Your code may be elegant, but if it doesn't meet the objectives (be they time or business) it doesn't f*ing work.
Likewise I've seen sysadmin projects that spent so much time in the planning stage that they never were going to ship unless someone stood up and said, "we've planned enough. I'm going to start coding whether you like it or not". Yes, that means that some aspect of the design wasn't perfect. Yet, the suggestion that more planning would lead to the elimination of all design imperfections is simply hubris. (If not hubris, it is a sign that one's OCD or OCD-like tendencies is being used as a cowardly excuse to not get started.)
But what I really want to write about today is...
"The big project that won't ship".
There once was a team that had a large software base. One part of it was obsolete and needed to be rewritten. It was written in an unsupported language. It didn't have half the features it needed. It didn't even have a GUI.
There were two proposals:
One was to refactor and recode bits of it until the system was replaced. Along the way every few weeks the results would see the light of day. There were many milestones: add a read-only "viewer" GUI, build a better data storage system, refactor the old code to use the new GUI, enhance the GUI to include full editing, etc.
The competing proposal was to assign 4 developers to build a replacement system. They'd be given 2 years to write the new system from scratch. During that time they'd be protected and, essentially, hidden. The justification for this was that the old system was so broken that doing any kind frankenstein half-old half-new system would be flatly impossible or would be a drag to efficiency. It would be more efficient to code it "pure" and not constantly be dealing with the old system.
Management approved the competing proposal. 1.5 years later the project hadn't gotten anywhere. When people were needed for other projects, management looked around and decided to steal the 4 engineers. This is because it is good management to take resources away from low priority projects and put it on high priority projects. Any project, no matter how noble, with no results for 18 months, is lower priority than a project with a burning need. In fact, the definition of a low priority project is that you can wait 2 years for the results.
The project was cancelled and 1.5 years of work was thrown away. 4 engineers times 18 months... at least a million dollars down the tube.
Meanwhile the person that proposed the incremental project had gone forward in parallel with the first milestone: a simple enhancement to the existing system that solved the biggest complaint of the system. It talked to the old datastore and would have to be re-engineered when the new datastore was finally available, but it worked and solved a very serious problem. It was a "half measure" but served its purpose.
The person that created the "half measure" had been scolded for wasting time on a parallel project. Yet, the "big" project was cancelled and this "half measure" is still in use today. At least he had the gravitas to not say "I told you so".
The biggest "cost" to a company is opportunity cost. That is, the loss of $$$ from not taking action. By shipping early and often you grab opportunity.
Imagine a factory that made widgets for 24 months, stored them in a warehouse, and then started shipping them all at once. That would be crazy. A factory sells what they make as soon as they are manufactured. Software companies used to write code for years and then ship it. That was crazy. Now you make a minimum viable product, ship that, and use the knowledge gained to make the next iteration.
My career advice is to only do projects that produce usable output every few weeks or months. Being on a project that will not show any results for a year or more is a good way to hide from management. Being invisible is a career killer. For software projects this means setting early milestones of some kind of minimal viable product. For purely operational projects be able to announce milestones or progress (number of machines converted, number of ms latency improvement, etc.)
At StackExchange there is a big project coming up related to how we provision new machines. While a "green field" approach would be nice, I'm looking into how we can refactor the current cruddy bits so that we can do this project incrementally. The biggest problem is that we have a crappy CMDB with no API. Everything seems to touch that one element and replacing it is going to be a pain. (I'd like to evaluate Flipkart/HostDb if anyone has opinions, let me know.) However I think we can restructure the project into 5 independent milestones. By "independent" I mean they can be done in any order with the other 4 requiring minimal refactoring as a result.
This will have a few benefits: We'll get the benefit of each milestone as it happens. Certain milestones can be done in parallel by different sub-teams. If the first few completed milestones make the process "good enough", we don't have to do the other milestones.
The schedule of talks and tutorials has been published!
- Talks: http://lopsa-east.org/2014/lopsa-east-14-talks/
- Tutorials: http://lopsa-east.org/2014/lopsa-east-14-training-schedule/
I'm glad to announce that I'll be teaching 2 tutorials and giving 2 talks: "Tom's Top 5 Time Management Tips" and "Book Preview: The Practice of Cloud Administration".
My tutorials include "Evil Genius 101", which was standing-room only last year plus "Intro to Time Management for System Administrators" which hasn't been taught at LOPSA-East in quite a few years.
Registration opens soon. I look forward to seeing you at this year's conference!
I'm excited to see that long-time sysadmin and author Æleen Frisch will be the keynote of this year's Cascadia IT conference, Seattle, March 7-8! If you don't recognize her name, check your bookshelf. You probably have a few of her books! http://casitconf.org/
There is still time to register. There are still a few seats left in the tutorials "Evil Genius 101" and "Team Time Management & Collaboration". Don't wait, register today!
There are also a dozens other excellent tutorials and talks. Plus, there are a lot of networking opportunities.
Hope to see you there!
A friend of mine told me of a situation where a cron job took longer to run than usual. As a result the next instance of the job started running and now they had two cronjobs running at once. The result was garbled data and an outage.
The problem is that they were using the wrong tool. Cron is good for simple tasks that run rarely. It isn't even good at that. It has no console, no dashboard, no dependency system, no API, no built-in way to have machines run at random times, and its a pain to monitor. All of these issues are solved by CI systems like Jenkins (free), TeamCity (commercial), or any of a zillion other similar systems. Not that cron is all bad... just pick the right tool for the job.
Some warning signs that a cron job will overrun itself: If it has any dependencies on other machines, chances are one of them will be down or slow and the job will take an unexpectedly long time to run. If it processes a large amount of data, and that data is growing, eventually it will grow enough that the job will take longer to run than you had anticipated. If you find yourself editing longer and longer crontab lines, that alone could be a warning sign.
I tend to only use cron for jobs that have little or no dependencies (say, only depend on the local machine) and run daily or less. That's fairly safe.
There are plenty of jobs that are too small for a CI system like Jenkins but too big for cron. So what are some ways to prevent this problem of cron job overrun?
It is tempting to use locks to solve the problem. Tempting but bad. I once saw a cron job that paused until it could grab a lock. The problem with this is that when the job overran there was now an additional process waiting to run. They ended up with zillions of processes all waiting on the lock. Unless the job magically started taking less time to run, all the jobs would never complete. That wasn't going to happen. Eventually the process table filled and the machine crashed. Their solution (which was worse) was to check for the lock and exit if it existed. This solved the problem but created a new one. The lock jammed and now every instance of the job exited. The processing was no longer being done. This was fixed by adding monitoring to alert if the process wasn't running. So, the solution added more complexity. Solving problems by adding more and more complexity makes me a sad panda.
The best solution I've seen is to simply not use cron when doing frequent, periodic, big processes. Just write a service that does the work, sleeps a little bit, and repeats.
while true ; do process_the_thing sleep 600 done
Simple. Yes, you need a way to make sure that it hasn't died, but there are plenty of "watcher" scripts out there. You probably have one already in use. Yes, it isn't going to run precisely n times per hour, but usually that's not needed.
You should still monitor whether or not the process is being done. However you should monitor whether results are being generated rather than if the process is running. By checking for something that is at a high level of abstraction (i.e. "black box testing"), it will detect if the script stopped running or the program has a bug or there's a network outage or any other thing that could go wrong. If you only monitor whether the script is running then all you know is whether the script is running.
And before someone posts a funny comment like, "Maybe you should write a cron job that restarts it if it isn't running". Very funny.
I make a distinction between tool building and automation. Tool building improves a manual task so that it can be done better. Automation eliminates the task. A process is automated when a person does not have to do it any more. Once a process is automated a system administrator's role changes from doing the task to maintaining the automation.
There is a discussion on Snopes about this photo. It looks like the machine magically picks and places bricks. Sadly it does not.
If you watch this video, you see that it requires people to select and place the bricks. It is a better tool. It doesn't eliminate the skilled work of a bricklayer, but it assists them so that their job is done easier and better. Watch the video:
(The music was very soothing, wasn't it?)
This machine doesn't automate the process of bricklaying. It helps the bricklayer considerably.
In a typical cloud computing environment every new machine must be configured for its role in the service. The manual process might involve loading the operating system, installing certain packages, editing configuration files, running commands and starting services. An SA could write a script that does these things. For each new machine the SA runs the script and the machine is configured. This is an improvement over the manual process. It is faster, less error-prone, and the resulting machines will be more consistently configured. However, an SA still needs to run the script, so the process of setting up a new machine is not automated.
Automated processes do not require system administrator action. To continue our example, an automated solution means that when the machine boots, it discovers its identity, configures itself, and becomes available to provide service. There is no role for an SA who configures machines. The SA's role transforms into maintaining the automation that configures machines.
Cloud administrators often maintain the systems that make up a service delivery platform. To give each new developer access an SA might have to create accounts on several systems. The SA can create the accounts manually, write a script that creates the accounts, or write a job that runs periodically to check if new developers are listed in the human resources database and then automatically create the new accounts. In this last case, the SA no longer creates accounts---the human resources department does. The SA's job is maintaining the account creation automation.
SAs often repeat particular tasks, such as configuring machines, creating accounts, building software packages, testing new releases, deploying new releases, deciding that more capacity is needed, providing that capacity, failing over services, moving services, moving or reducing capacity. All of these tasks can be improved with better tools. Good tools are a stepping stone to automation. The real goal is full automation.
Another advantage of full automation is that it enables SAs to collect statistics about defects, or in IT terms: failures. If certain situations tend to make the automation fail, those situations can be tracked and investigated. Often automation is incomplete and certain edge cases require manual intervention. Those cases can also be tracked, categorized, and the more pervasive ones become prioritized for automation.
Tool building is good, but full automation is required for scalable cloud computing.