June 2015 Archives


[This is still only draft quality but I think it is worth publishing at this point.]

Internally at Stack Exchange, Inc. we've been debating the value of certain file formats: YAML, JSON, INI and the new TOML format just to name a few.

[If you are unfamiliar with TOML, it is Tom's Obvious, Minimal Language. "Tom", in this case, is Tom Preston-Werner, founder and former CEO of GitHub. The file format is still not reached version 1.0 and is still changing. However I do like it a lot. Also, the name of the format IS MY FREAKIN' NAME which is totally awesome. --Sincerely, Tom L.]

No one format is perfect for all situations. However while debating the pros and cons of these formats something did dawn on me: one group is for humans and another is for machines. The reason there will never be a "winner" in this debate is that you can't have a single format that is both human-friendly and machine-friendly.

Maybe this is obvious to everyone else but I just realized:

  1. The group that is human-friendly is easy to add comments to, and tolerant of ambiguity, is often weakly typed (only differentiating between ints and strings).

  2. The group that is machine-friendly is difficult (or impossible) to add comments, is less forgiving about formatting, and use often strongly typed.

As an example of being unforgiving about formatting, JSON doesn't permit a comma on the last line of a list.

This is valid JSON:

   "a": "apple", 
   "alpha": "bits", 
   "j": "jax"

This is NOT valid JSON:

   "a": "apple", 
   "alpha": "bits", 
   "j": "jax",

Can you see the difference? Don't worry if you missed it because it just proves you are a human being. The difference is the "j" line has a comma at the end. This is forbidden in JSON. This catches me all the time because, well, I'm human.

It also distracts me because diffs are a lot longer as a result. If I add a new value, such as "p": "pebbles" the diff looks very different:

$ diff x.json  xd.json 
<    "j": "jax"
>    "j": "jax",
>    "p": "pebbles"

However if JSON did permit a trailing comma (which it doesn't), the diffs would look shorter and be more obvious.

$ diff y.json yd.json 
>    "p": "pebbles",

This is not just a personal preference. This has serious human-factors consequences in an operational environment. It is difficult to safely operate a large complex system and one of the ways we protect ourselves if by diff'ing versions of configuration files. We don't want to be visually distracted by little things like having to mentally de-dup the "j" line.

The other difference is around comments. One camp permits them and another camp doesn't. In operations often we need to be able to temporarily comment out a few lines, or include ad hoc messages. Operations people communicate by leaving breadcrumbs and todo items in files. Rather than commenting out some lines I could delete them and use version control to bring them back, but that is much more work. Also, often I write code in comments for the future. For example, as part of preparation for a recent upgrade, we added the future configuration lines to a file but commented them out. By including them, they could be proofread by coworkers. It was suggested that if we used JSON we would simply add a key to the data structure called "ignore" and update the code to ignore any hashes with that key. That's a lot of code to change to support that. Another suggestion was that we add a key called "comment" with a value that is the comment. This is what a lot of JSON users end up doing. However the comments we needed to add don't fit into that paradigm. For example we wanted to add comments like, "Ask so-and-so to document the history of why this is set to false" and "Keep this list sorted alphabetically". Neither of those comments could be integrated into the JSON structures that existed.

On the other hand, strictly formatted formats like JSON are, in theory, faster to parse. Supporting ambiguity slows things down and leads to other problems. In the case of JSON, it is just plain so widely supported there are many reasons to use it just for that reason.

Some formats have typed data, others assume all data are strings, others distinguish between integer and string but go no further. YAML, if you implement the entire standard, has a complex way of representing specific types and even supports repetition with pointers. All of that turns YAML's beautifully simple format into a nightmare unsuitable for human editing.

I'm not going to say "format XYZ is the best and should be used in all cases" however I'd like to summarize the attributes of each format:

M Formal standard YES YES soon no
M Strongly typed YES YES string/int no
M Easy to implement
the entire standard
H Awesome name! no no YES no
H Permits comments no start of line only YES usually
H diffs neatly no YES (I think) YES YES
H Can be
updated without losing
format or comments
yes-ish NO soon NO

The * column indicates if this quality is important for machines (M) or humans (H). NOTE: This chart is by no means complete.

Personally I'm trying to narrow the file formats in our system down to two: one used for machine-to-machine communication (that is still human readable), and the other that is human-generated (or at least human-updated) for machine consumption (like configuration files). (Technically there's a 3rd need: Binary format for machine-to-machine communication, such as ProtoBufs or CapnProto.)

I'm very optimistic about TOML and look forward to seeing it get to a 1.0 standard. Of course, the fact that I am "Tom L." sure makes me favor this format. I mean, how could I not like that, eh?

Update: 2015-07-01: Updated table (TOML is typed), and added row for "Awesome name".

I literally never thought I'd see this day arrive.

In 1991/1992 I was involved in passing the LGB anti-discrimination law in New Jersey. When it passed in January 1992, I remember a reporter quoting one of our leaders that marriage was next. At the time I thought Marriage Equality would be an impossible dream, something that wouldn't happen in my lifetime. Well, less than quarter-century later, it has finally happened.

In the last few years more than 50% of the states approved marriage equality and soon it became a foregone conclusion. States are the "laboratory of democracy" and with 26 states (IIRC) having marriage equality, its about time to declare that the experiment is a success.

There were always predictions that marriage equality would somehow "ruin marriage" but in the last decade of individual states having marriage equality not a single example has come forward. What has come forward has been example after example of problems from not having marriage equality. The Oscar winning documentary "Freeheld" is about one such example. Having different laws in different states don't just create confusion, it hurts families.

"Human progress is neither automatic nor inevitable", wrote Martin Luther King Jr. It is not automatic: it doesn't "just happen", it requires thousands of little steps.

This day only happened because of thousands of activists working for many years, plus hundreds of thousands of supporters, donors, and millions of "like" buttons clicked.

A lot of people make jokes about lawyers but I never do. No civil rights law or court decision ever happens without a lawyer writing legislation or arguing before a court. The legal presentations given in Obergefell v. Hodges were top notch. Implementing the decision requires operational changes that will require policy makers, legal experts, and community activists to work together.

This is really an amazing day.

Posted by Tom Limoncelli in CommunityPolitics

Recently we were having the most difficult time planning what should have been a simple upgrade. There is a service we use to collect monitoring information (scollector, part of Bosun). We were making a big change to the code, and the configuration file format was also changing.

The new configuration file format was incompatible with the old format.

We were concerned with a potential Catch-22 situation. Which do we upgrade first, the binary or the configuration file? If we put the new RPM in our Yum repo, machines that upgrade to this package will not be able to read their configuration file and that's bad. If we convert everyone's configuration file first, any machine that restarts (or if the daemon is restarted) will find the new configuration file and that would also be bad.

The configuration files (old and new) are generated by the same configuration management system that deploys the new RPMs (we use Puppet at Stack Exchange, Inc.). So, in theory we could specify particular RPM package versions and make sure that everything happens in a coordinated manner. Then the only problem would be newly installed machines, which would be fine because we could pause that for an hour or two.

But then I realized we were making a lot more work for ourselves by ignoring the old Unix adage: If you change the file format, change the file name. The old file was called scollector.conf; the new file would be scollector.toml. (Yes, we're using TOML).

Now that the new configuration file would have a different name, we simply had Puppet generate both the old and new file. Later we could tell it to upgrade the RPM on machines as we slowly roll out and test the software. By doing a gradual upgrade, we verify functionality before rolling out to all hosts. Later we would configure Puppet to remove the old file.

This reminds me of the fstab situation in Solaris many years ago. Solaris 1.x had an /etc/fstab file just like Linux does today. However, Solaris 2.x radically changed the file format (mostly for the better). They could have kept the filename the same, but they followed the adage and for good reason. Many utilities and home-grown scripts manipulate the /etc/fstab file. They would all have to be rewritten. It is better for them to fail with a "file not found" error right away, then work away and modify the file incorrectly.

This technique, of course, is not required if a file format changes in an upward-compatible way. In that case, the file name can stay the same.

I don't know why I hadn't thought of that much earlier. I've done this many times before. However the fact that I didn't think of it made me think it would be worth blogging about it.

Posted by Tom Limoncelli in Technical Tips

My talk and 2 tutorial proposals have been accepted at Usenix LISA LISA Conference!

  • Talk:
    • Transactional system administration is killing us and must be stopped
  • Tutorials:
    • How To Not Get Paged: Managing Oncall to Reduce Outages
    • Introduction to Time Management for busy Devs and Ops

The schedule isn't up yet at http://www.usenix.org/lisa15 but Usenix is encouraging speakers to post to social media early this year.

See you in Washington DC Nov 8-13, 2015!

P.S. You can follow LISA on various social networks:

Update: 2015-06-16 I changed the title to "some of my proposals" instead of "my proposals". To be clear, I had many rejections this year, I just don't blog about those. That said, I think LISA is a better conference when it increases its speaker diversity and you can't do that if the same few people give a lot of talks.

Posted by Tom Limoncelli in LISAUsenix

Thanks, QCon New York!

I had a great time at QCon New York last week. It was my first time there and my first time speaking too. The audience was engaged and had great questions. I did a book-signing at the Pearson booth and it was fun meeting readers (and future readers) of our books.

Videos of all talks will be available soon. For now you can view the slides. You should also check out the talk by David Fullerton, VP of Engineering of Stack Exchange (my boss's boss) who gave a great talk titled "Scaling Stack Overflow: Keeping it Vertical by Obsessing Over Performance"

Hope to see you next year!

Posted by Tom Limoncelli in Speaking

I'll be speaking at QCon in their "Architecting for Failure" track. My talk is titled "Fail Better: Radical Ideas from the Practice of Cloud Computing". More about the conference at qconnewyork.com. Hope to see you there!

Posted by Tom Limoncelli in AppearancesArchive

  • LISA16