SMS is nearly "free" for telecom carriers
The protocol is called "SS7" (Signaling System 7). Like most teleco protocols it is difficult to parse and ill-defined. This is how telcos keep new competition from starting. They hype SS7 as something so complicated that only rocket scientists could ever understand it. Of course, it is an ITU standard, so it isn't a secret how it works. You just have to pay a lot of money to get a copy of the standard. In fact, once Cisco had a working SS7 software stack the downfall of Lucent/AT&T/others was only years away. Heck, Cisco published a book demystifying SS7. It turns out the emperor had no clothes and Cisco wanted everyone to know. SS7 is big and scary, but only as bad as most protocols. I guess SMTP or SNMP would be scary too if you had never seen a protocol before. (Remember that non-audio networks are still "new" to the telecom world, or at least their executives.)
SS7 is all about setting up "connections". When I dial a number, SS7 packets are sent out that query databases to translate the phone number I want to dial to a physical address to connect to, then an SS7 query goes out to request that all the phone switches from point A to point B allocate bandwidth and start letting audio through. The nomenclature dates back to what was used when phone calls were set up by ladies sitting in front of switchboards.
What makes international dialing work is that there are SS7 gateways between all the carriers. They don't charge each other for this bandwidth because it is just the cost of doing business. The logs of what calls are actually made is used to create billing records, and the carrier do charge each other for the actual calls. Thus, there is no charge for the SS7 packets between AT&T and O2 (O2 is a big cell provider in Europe), but O2 does back-bill AT&T for the phone call that was made. (This is called "Settlement" and my previous employer processed 80% of the world's settlement records on behalf of the phone companies.)
Setting up a connection for an SMS would be silly. An entire connection for just a 160-byte message? No way. That's more trouble than it is worth. Therefore, SMS is the only service where the actual service is provided over SS7. The 160-byte limit comes from a limit in SS7 packet size.
However, the phone companies don't really do anything for free. The SMS records are used to construct billing data and the companies certainly do back-bill each other for SMS carried by each other's networks. If you SMS from AT&T to O2, there is settlement going on after the fact. However, SMS between two AT&T customers has no real cost.
"Multimedia SMS" (photos) are not sent over SS7, though SS7 is used to setup/teardown the connection just like a phone call. If they were smart they'd use SS7 to just transmit an email address and then send the photo over the internet. It would probably be cheaper. (Though, when has a telco has a well-run email system? Sigh.)
So, SMS is "free" because it rides on the back of pre-existing infrastructure. The "cost" is due to the false economics created to "extract value" out of the system (i.e. "charge money").
If they were doing it all from scratch, they could probably run it all over the internet for "free" too. Heck, it wouldn't be much bandwidth even if people learned to type 100x faster.
Why was SMS permitted to use SS7 unlike any other service? The real reason, I'm told, wasn't entirely technical. It was due to the fact that the telecos thought that nobody would actually use the service. Little did they know that it would catch on among teens and then spread!
More info:
- Google for [ss7 protocol]
- Comparison of SS7 to TCP/IP
- Wikipedia article on SS7
- Cisco's SS7 book (also available on O'Reilly's Safari Books Online)
Amazon's Kindle
I got a demo of Amazon's Kindle the other day and was very impressed. I hadn't realized that it had a built-in cellphone-based data connection so you could always download more content. The speed was a little slow, but for reading a book I think it was perfect. I'm considering getting one.
Today I got email from Amazon reminding me that if I shill for them on my blog, readers can get a $100 discount. You just have to apply for an Amazon credit card and use this link.
Do I feel bad about shilling for Amazon? Well, not if it gets my readers a $100 discount. It is a product that friends of mine are happy with and I'm impressed by the demos I've seen.
April showers bring May Flowers... but May brings...
April Showers bring May Flowers. What does May bring? Three-day weekends that make A/C units fail!
This is a good time to call your A/C maintenance folks and have them do a check-up on your units. Check for loose or worn belts and other problems. If you've added more equipment since last summer your unit may now be underpowered. Remember that if your computers consume 50Kw of power, your A/C units should be using about the same (or more) to cool those computers. That's the laws physics speaking, I didn't invent that rule. The energy it takes to create heat equals the energy required to remove that much heat.
Why do A/C units often fail on a 3-day weekend? During the week the office building has its own A/C. The computer room's A/C only has to remove the heat generated by the equipment in the room. On the weekends the build's A/C is powered off and now the 6 sides (4 walls, floor and ceiling) of the computer room are getting hot. Heat seeps in. Now the computer room's A/C unit has more work to do.
A 3-day weekend is 84 hours (Friday 6pm until Tuesday 6am). That's a lot of time to be running continuously. Belts wear out. Underpowered units overheat and die. Unlike a home A/C unit which turns on for a few minutes out of every hour, a computer-room A/C unit ("industrial unit") runs 40-50 minutes out of every hour. Something running that much has to be specially engineered.
Most counties have a 3-day weekend in May. By the 2nd or 3rd day the A/C unit is working as much as a typical day during the summer. If your computer room doesn't survive that weekend, imagine a summer full of days just like it.
To prevent a cooling emergency make sure that your monitoring system is also watching the heat and humidity of your room. There are many SNMP-accessible units for less than $100. If you detect temperatures of 38 degrees C you should be alerted. More if that rises to 40 within 30 minutes it is unlikely that the temperature will go down on its own. You can reduce some of the heat in the room by simply shutting down some non-essential machines (The Practice of System and Network Administration has tips about creating a "shutdown list"). Having the ability to remotely power off machines can save you a trip to the office. Lacking that, shutting down a machine will make it generate less heat even if it is powered up. Sitting at a "press any key to boot" prompt often generates little heat compared to a machine that is actively processing. If powering off the non-critical machines isn't enough, shut down critical equipment but not the equipment involved in letting you access the monitoring systems (usually the network equipment). That way you can bring things back up remotely. Of course, as a last resort you'll need to power off those bits of equipment too.
Having cooling emergency? Cooling units can be rented on an emergency basis to help you through a failed cooling unit, or to supplement a cooling unit that is underpowered. There are many companies looking to help you out with a rental unit.
If you have a small room that needs to be cooled (a telecom closet that now has a rack of machines) I've had good luck with a $300 unit available at Walmart. For $300 it isn't great, but I can buy one in less than an hour without having to wait for management to approve the purchase. Heck, for $300 you can buy two and still be below the spending limit of a typical IT manager. The Sunpentown 1200 and the Amcor 12000E are models that one can purchase for about $600 that re-evaporates any water condensation and exhausts it with the hot air. Not having to empty a bucket of water every day is worth the extra cost. The unit is intended for home use, so don't try to use it as a permanent solution. (Not that I didn't use it for more than a year at one company.) It has one flaw... after a power outage it defaults to being off. I guess that is typical of a consumer unit. Be sure to put a big sign on it that explains exactly what to do to turn it back on after a power outage. (The sign I made says step by step what buttons to press, and what color each LED should be if it is running properly. I then had a non-system administrator test the process.)
In summary: test your A/C units now. Monitor them, especially on the weekends. Be ready with a backup plan if your A/C unit breaks. Do all this and you can prevent an expensive and painful meltdown.
HostDB 1.002 released!
A few years ago I released HostDB, my simple system for generating DNS domains. The LISA paper that announced it was called: HostDB: The Best Damn host2DNS/DHCP Script Ever Written.
I just released 1.002 which adds some new features that make it easier to generate MX records for domain names with no A records, and not generate NS records for DNS masters. Other bug fixes and improvements are included.
HostDB is released under the GPL, supported on the HostDB-fans mailing list, and supported by the community. This recent release includes patches contributed by Sebastian Heidl.
Easier Xen management with Google Ganeti
Managing Xen instances is a drag. So my buddies in the Google Zürich office built a system for managing them . Now life is great! The team I manage has put Xen clusters all over the world, all managed with Ganeti. It rocks. I'm proud to see it is available to everyone now under a GPLv2 license.
When I first heard the name, I thought it sounded like an new kind of Italian dessert. But what do you expect from a guy with a last name like "Limoncelli"?
Hardware password recovery
Hardware didn't used to have passwords. Your lawnmower didn't have a password, your car didn't have a password, and your waffle iron didn't have a password.
But now things are different. Hardware is much smarter and now often requires a password. Connecting to the console of a Cisco router asks for a password. A Honda Prius has an all-software entry system.
My first experience with being "locked out" of my own hardware was a Cisco router in 1991. Luckily every Cisco device has a way to work around the password. In fact, Cisco maintains the Password Recovery Process web site that links to the procedure for every device they've ever made. Now one might say, "but if there is a way to work around a password, it isn't very secure, is it?" These procedures all require some kind of physical access. A Sun workstation requires you to press L1 and "A" at the same time, and these keys are only on a physical console. Some appliances require you to boot them while holding down a button. Requiring that the machine be powered up while holding down a certain button means you have access to the power switch, and if you have access to the power switch you can perform one heck of a denial of service attack. If you have physical access there are worse things you can do than reset the password, like smash the box with a sledge hammer.
In the 2nd edition of The Practice of System and Network Administration one of the updates we needed to include was our discussion of physical access to machines. The first edition was written mostly in 1999/2000 and at the time remote access to consoles was something that only Unix servers did, and Windows servers were just starting to get KVM switches that permitted over-the-network access (IP-KVMs). The 2nd edition tries to treat the issue more evenly-handed since both Windows and Unix communities now recognize the benefit of remote consoles. We also re-emphasize the importance of security in KVM and other remote-console access systems. Hardware designers assume that physical access will be restricted. Adding a remote-console system means attackers no longer need physical access to attack the console.
I've always looked for the ability to reset hardware passwords on any new equipment I buy. I make sure there are three ways to access these instructions. On my "sysadmins wiki" I make a link to the instructions on the vendor's web site, and I copy those instructions into the website so that I have a recent copy just in case I need them when I don't have internet access. My third copy is a printed copy that I tape to the side of the device (be careful not to block the vents).
Not every company realizes the importance of this. I recently bought a used Sony tape library (LIB-162) and couldn't find a way to reset the password. Luckily I didn't need the password for the basic functionality required to do backups. However, to get access to the web-based administration system or do software upgrades one needs the password. The previously owner doesn't know the password thus I am stuck.
The manual says that one can't reset the password and I should call my dealer. I figured they just didn't want the information spread around, so I contacted Sony and they gave me the right people to speak with. A very helpful person names Lucia informed me that I could send in the device and for only $899 they would reset the password. That seemed unreasonable, but escalating it brought me no joy. John Marshall, Customer Service/Support Manager at Sony (not to be confused with the Chief Justice nor the famous percussionist) was very polite and friendly, but was not able to tell me the secrets to doing the process myself. I even offered to sign a non-disclosure if the process was secret. No luck. He offered to reduce the rate to $699 but that was unstatisfactory.
More and more of the products that sysadmins deal with are sold as appliances. It's a relatively new industry (I'm being sarcastic) so companies are still figuring out the "norms" that customers expect. I'm not sure what the entire list should be, but I know that the ability to reset the configuration without spending $899 should be one of them.
So in the meanwhile I'll be using only the minimal features of the device which is ok because I had planned on using this purely for a hobby project. However, I can't recommend that anyone purchase products from Sony Storage until they stop designing their products this way. There is too much business risk in a product like the Sony LIB-162 AIT tape library at any price.
Anti-spam trick: Grey listing
There is an anti-spam technique called "Grey Listing" which has almost completely eliminated spam from my main server. What's left still goes through my SpamAssassin and Amavis-new filters, but they have considerably much less work to do.
The technique is more than a year old but I've only installed a greylist plug in recently and I'm impressed at how well it works. I hope by writing this article other people that have procrastinated will decide to install a greylist system.
(for those that want technical specifics, I'm using Postfix plus Postgrey. If you use FreeBSD, just do "portinstall mail/postgrey" assuming you are already using Postfix. Sendmail users, please post some comments directing people to the Milter equivalent!)
So how does grey listing work?
Well, you know that a "black list" is a list of sites you block, and a "white list" is a list of sites that you always permit. A grey list is somewhere in between.
The basic principle is that spammers don't retry an email that couldn't be delivered. There are two kinds of "can't be delivered" (actually, more than that but two are important here). One is a "hard failure"... the email can't be delivered and nothing is going to fix it. For example, you are trying to send email to an account that doesn't exist. The second type is a "soft failure", which is a problem that is temporary. In other words, a disk is full, or there is some kind of system problem that will be fixed soon. If you get a "hard error" the email is bounced. If you get a "soft failure" the sending server is supposed to wait a bit of time and retry. That's why when you run out of disk space email stops flowing, but when you fix the problem (delete that out-of-control log file or whatever) you suddenly get a flood of backlogged email.
Spammers don't retry sending email whether it is a hard or soft failure. When you are sending email to tens of millions of addresses, its too difficult to keep track of failure codes. Besides, even if they don't get their spam sent to 20% of their list, they're still sending it to millions of addresses. Good enough, eh?
So here's how grey listing works. The first time someone tries to send you email, send a "soft error" result code. If they reply more than 5 minutes later, then actually accept it. If they are a spammer you'll never get a retry. If they are legitimate then you'll get a retry.
Implementing this is extremely simple. When someone tries to send email, gather 3 other item of information: the source IP address, the From:, the To:. Maintain a database of these 3-tuples. If you haven't seen that 3-tuple before, send the "soft failure" code. If you have seen that 3-tuple already and it was more than 5 minutes ago, accept the message.
It's amazingly simple yet it seems to be blocking about 80% of my spam right now.
Now, you may be thinking, "I can't have a 5-minute delay on all my email! That's crazy!" Well don't worry. Systems like Postgrey take this all one step further. For example, if 5 emails get through in the last month, Postgrey decides this IP address must be ok and adds it to a list that is "white listed".
Thus, the system tunes itself. Common senders immediately get into the whitelist (Yahoo, gmail, and so on). Site that disappear eventually get expired from the list because you don't hear from them in 30 days. That makes the database self-cleaning. All maintenance is automatic.
I can't believe I didn't install this years ago!
--Tom
P.S. I've also added "reject_non_fqdn_hostname" to the Postfix variable "smtpd_helo_restrictions". That means that when an STMP server issues a "HELO hostname" the email is rejected if "hostname" isn't a FQDN. This rejects about 80% of the spam I'm getting... most of which just sends "HELO friend". I haven't had any complaints from users about false-positives since I implemented this a month ago. This technique reduced spam by 80% and Postgrey reduced spam by a different-but-overlaping 80%. When both are enabled, I receive very little spam. Enough for Amavis-new and SpamAssassin to take care of easily.
Today's Unix Security Trivia
If you write to a file that is SUID (or SGID) the SUID (and SGID) bits on the file are removed as a security precaution against tampering (unless uid 0 is doing the writing).
(See FreeBSD 5.4 source code, sys/ufs/ffs/ffs_vnops.c:739)
The Jifty buzz
Everyone that has seen me speak knows that I love RT for tracking user requests. I was IMing with the author of RT today and he said that for his next product he realized he should first write a good tool that lets him make AJAXy applications without having to do all the work manually. He's done that, and its called Jifty. Now he's building apps based on that. The first one has as many features as RT but is 1/10th the code base. Awesome! Sounds like Jifty is going to be a big hit! (You can find Jifty in CPAN already.)
Oh, and what's the new app called? Hiveminder.
Let the rumors fly! :-)
Compressing logs is good
It's obvious but I didn't think of one particular reason why until the end of this journey.
Read more...
[ This is a first draft. Feedback is greatly appreciated.]
It's obvious but I didn't think of it until the end. Compressed logs are good. Really good.
I just had a "disk full" situation on /var. No problem, a little "du -sk *" and I identify the problem. /var/logs is huge, nearly the whole 1G disk allocated to /var. I do "du -sk /var/log/*" and discover that FreeBSD's default Apache installation puts all its logs in "/var/log/web/" and that is the real culprit.
No worries. I made /var too small and I'll solve this the way I always do: Move the directory to another place and make a symbolic link.
I'm documenting what I did because it might be educational to people new to such things.
First some background...
I don't have this problem on my other server because there I have a custom Apache config that puts everything in /home/apache (/home/apache/logs, /home/apache/conf, /home/apache/this and /home/apache/that). /home is huge, so I don't worry too much. I was caught off-guard on this server. As you see, for more than one reason.
One of my annoyances with Apache is that every operating system has a different layout of where the various Apache files are kept. I am a bear of little brain, so on most machines I create a directory called /home/web and then make symbolic links in it for conf, logs and htdocs that point to where ever that OS decided to put the configuration files, the logs, and the documents. It really saves me a lot of time.
On my Solaris server I custom-compile Apache and made my own "layout" file so that when I build a new release it will be configured for my particularly "/home/apache" layout. I've done this for so long that I often forget that everyone doesn't do this.
When I set up my new server with FreeBSD I was so happy with the "ports" system for installing things like Apache that I forgot to note that it was putting the logs on /var/log/web, which is a small partition on that system. Actually, I didn't "forget to note" this. I noticed it enough that I made a symbolic link from /home/web/logs to /var/log/web. So obviously I noticed it, but I didn't stop to think, "Is that a good place for logs?" and now that that disk has filled I realize that I should have taken the time back then to store the logs someplace with more room.
Anyway...
I have three policies for how I store weblogs. First, I use a custom log so that I record referal data. Second, I never throw away logs. Third, I keep the logs for each virtual host in a different file. Thus, I have everythingsysadmin.com-access_log (and -error_log) as well as, for example, whatexit.org-access_log (and -error_log). The log for everythingsysadmin.com was extremely big, thus my disk space problem.
Moving the files with minimal downtime
I'm very particular about this kind of surgery. I want to move the data without corrupting it, I don't want to make mistakes, and I want my web server to be down for the least amount of time.
Therefore I rsync'ed the data "live", then shut down Apache, rsync'ed the data again, moved the log directory, created the symbolic link, and restarted Apache. This minimized the outage, which is especially important today since my book was mentioned on Slashdot today and I was expecting a decent number of hits.
Here's what I did in more detail:
mkdir -p /home/web/logs.new ; cd /var/log/web && rsync -avP . /home/web/logs.new/.
We're copying live data. That's bad. The moment we're done with the copy more will appear. However, we'll mitigate that. Read on.
Notice the caution in this statement. We make the new directory and use "-p" so that if it already exists it won't be an error. Then the "cd" is joined to the "rsync" with "&&". This means "don't do the second command if the first command failed." In other words, if I mistyped "cd /var/log/web" then the rsync won't be attempted. This is good because normally a failed "cd" might have left me in "/home/root" and I wouldn't want all those files copied to /home/web/logs. Also I don't use the "-R" (relative) option to rsync. Instead, I make sure that the source and destination directories already exist, and then specify them both as "." (or "blah/blah/blah/."). I do this because "rsync" has different behavior depending on whether it had to create the destination directory or not. That's bad. Bad and confusing. Bad, confusing and frustrating when I'm developing a command that I'm going to run more than once.
As another precaution, I did that command in a window I opened just for that purpose. I'm going to run that sequence of commands a few times, so now I can just use command history to run it over and over. Why is this important? Because I don't want to re-type this long sequence of commands every time I do the process. I want it 100% repeatable. I could make it into a shell script, but that's overkill.
Also note that I'm not copying the data to "/home/web/logs" but to "/home/web/logs.new". That's because /home/web/logs is a symbolic link to /var/log/web, and it would be silly to copy things to where they already are. Scripts and cronjobs might be accessing /home/web/logs so I don't want to muck with it until I'm ready.
While that's copying I used a different window to construct this command:
cd /home/web && mv web web.old ; mv web.new web ;
cd /var/log && mv web web.old ; ln -s ../../home/web/logs web
(I actually constructed this as one long line, but it is split here to be more readable.)
(To be clear, I typed this command but didn't execute it.)
The first part of this moves /home/web/logs out of the way and moves the newly copied log directory into place. The second part of this moves my current log directory to "web.old" and makes a symbolic link to the new location.
Now I move my mouse to that other window and repeat the rsync command. This time it should run a lot quicker because rsync is an incremental copy. If it sees that the data hasn't changed much, it only copies "what's new". (And if you want to know how it does that, read this amazing transcript of a lecture by the author.)
The second copy went very quickly just as I expected. That's a good sign. If it tool a long time I'd start checking to see if I had mistyped the command. If it happened instantly, I'd be worried because it should be fast but not instant.
Now I'm ready to make "the big switch".
Here's what I did:
- Re-run the rsync
- Immediately do a
apachectl stop(this shuts down the web server) - Re-run the rsync again. This time it should be extremely fast, nearly a noop.
- Press ENTER on that command line that switches around the directories and synlinks.
- Test! "cd /home/web/logs" and "cd /var/log/web" and make sure you get the expected results.
- Restart Apache with
apachectl start - Test the web sites I host to make sure they're still working.
And during that 7-step process, don't forget to breath. It has to happen quickly, but not if "fast" means "I'm going to make mistakes".
So what about compression?
Well, when I was setting up this FreeBSD server and was very impressed by the "ports" collection (which is like RPM's from Linux, except it doesn't suck). So impressed that I forgot that there was more work to be done.
I have a script that rotates the weblogs when they get too big. It's a trickey task because I want to rotate them when they get to a certain size, not every so-and-so days. However, if you rotate the -access_log you have to rotate the -error_log too. The files are then compressed, but only after being rotated. I wrote a script that I use on my Solaris server.
I copied the script over to this server, checked it for portability issues, and ran it. Since the files had not been rotated or compressed in ages, it rotated nearly every file and then started compressing them. Web logs compress down to 1% or 2% of their original size. It's quite impressive.
The "disk full" problem was, fundamentally, that the script wasn't running. If the logs aren't compressed, they take 1Gig of space instead of 10Meg. In fact, at 10Meg they could have stayed in the original place. However, I didn't notice that until the entire process was done.
Oh well. Hindsight really is 20/20!
P.S. On the other hand, having them on /home is much better than /var for other reasons. I tend to be a little more careful about backing up /home.
Update:
Why not newsyslog.conf? An excellent question.
First, I already had a script that did exactly what I wanted. I want all my servers to have the same, repeatable process.
Secondly, the script is able to move the -error_log file if any -access_log is moved. I don't think newsyslog.conf can do parallel moves.
Lastly, I don't use a ".0", ".1", ".2" system. Instead, I use .YYYYMMDD:HHMMSS. That way I can process logs easier. Since I'm keeping them forever, this is better than .0, .1, .2. I don't think newsyslog.conf can do that (though I haven't done a lot of research). Since I'm keeping them forever, I don't want to rotate the files (doing n renames for n files), I just want to do 1 rename for each file.
Raised Floors not sufficient for datacenters?
techtarget.com reports:
Two different vendors are promoting more aggressive cooling systems for modern racks.
Monad, Microsoft's answer to Bash
Ars Technica has an excellent article about MSH.
If you love perl and/or bash, you'll be interested in reading this tutorial. It gives some excellent examples that explain the language.
Always the friendly sysadmin
"When I see a person I don't recognize in the office, I always smile, stop, introduce myself, and ask for the person's name. I then ask to read it off his ID badge "to help me remember it. I'm a visual learner." New people think I'm being friendly. I'm really checking for trespassers."This and other great tips can be found in here.
Solaris users: Blastwave.org needs our help!
A while back I recommend BlastWave as a great source of pre-built binaries for Solaris. Their service has saved me huge amounts of time.
Sadly, they are running low on funds. It's expensive to keep a high-profile web site like this up and running. Corporate donors are particularly needed.
I just donated $50. I hope you consider donating to them too. Otherwise, in less than 48 hours, they may have to shut down.
Solaris package tip
Since I'm more of an OS X/FreeBSD/Linux person lately, I've gotten a bit out of touch with Solaris administration. I was quite pleasently surprised to find CSW - Community SoftWare for Solaris which includes hundreds of pre-built packages for Solaris. More importantly, it provided the three I really needed and didn't have time to build. :-)
The system is really well constructed. I highly recommend it to everyone. Give this project your support!

