[This is still only draft quality but I think it is worth publishing at this point.]

Internally at Stack Exchange, Inc. we've been debating the value of certain file formats: YAML, JSON, INI and the new TOML format just to name a few.

[If you are unfamiliar with TOML, it is Tom's Obvious, Minimal Language. "Tom", in this case, is Tom Preston-Werner, founder and former CEO of GitHub. The file format is still not reached version 1.0 and is still changing. However I do like it a lot. Also, the name of the format IS MY FREAKIN' NAME which is totally awesome. --Sincerely, Tom L.]

No one format is perfect for all situations. However while debating the pros and cons of these formats something did dawn on me: one group is for humans and another is for machines. The reason there will never be a "winner" in this debate is that you can't have a single format that is both human-friendly and machine-friendly.

Maybe this is obvious to everyone else but I just realized:

  1. The group that is human-friendly is easy to add comments to, and tolerant of ambiguity, is often weakly typed (only differentiating between ints and strings).

  2. The group that is machine-friendly is difficult (or impossible) to add comments, is less forgiving about formatting, and use often strongly typed.

As an example of being unforgiving about formatting, JSON doesn't permit a comma on the last line of a list.

This is valid JSON:

   "a": "apple", 
   "alpha": "bits", 
   "j": "jax"

This is NOT valid JSON:

   "a": "apple", 
   "alpha": "bits", 
   "j": "jax",

Can you see the difference? Don't worry if you missed it because it just proves you are a human being. The difference is the "j" line has a comma at the end. This is forbidden in JSON. This catches me all the time because, well, I'm human.

It also distracts me because diffs are a lot longer as a result. If I add a new value, such as "p": "pebbles" the diff looks very different:

$ diff x.json  xd.json 
<    "j": "jax"
>    "j": "jax",
>    "p": "pebbles"

However if JSON did permit a trailing comma (which it doesn't), the diffs would look shorter and be more obvious.

$ diff y.json yd.json 
>    "p": "pebbles",

This is not just a personal preference. This has serious human-factors consequences in an operational environment. It is difficult to safely operate a large complex system and one of the ways we protect ourselves if by diff'ing versions of configuration files. We don't want to be visually distracted by little things like having to mentally de-dup the "j" line.

The other difference is around comments. One camp permits them and another camp doesn't. In operations often we need to be able to temporarily comment out a few lines, or include ad hoc messages. Operations people communicate by leaving breadcrumbs and todo items in files. Rather than commenting out some lines I could delete them and use version control to bring them back, but that is much more work. Also, often I write code in comments for the future. For example, as part of preparation for a recent upgrade, we added the future configuration lines to a file but commented them out. By including them, they could be proofread by coworkers. It was suggested that if we used JSON we would simply add a key to the data structure called "ignore" and update the code to ignore any hashes with that key. That's a lot of code to change to support that. Another suggestion was that we add a key called "comment" with a value that is the comment. This is what a lot of JSON users end up doing. However the comments we needed to add don't fit into that paradigm. For example we wanted to add comments like, "Ask so-and-so to document the history of why this is set to false" and "Keep this list sorted alphabetically". Neither of those comments could be integrated into the JSON structures that existed.

On the other hand, strictly formatted formats like JSON are, in theory, faster to parse. Supporting ambiguity slows things down and leads to other problems. In the case of JSON, it is just plain so widely supported there are many reasons to use it just for that reason.

Some formats have typed data, others assume all data are strings, others distinguish between integer and string but go no further. YAML, if you implement the entire standard, has a complex way of representing specific types and even supports repetition with pointers. All of that turns YAML's beautifully simple format into a nightmare unsuitable for human editing.

I'm not going to say "format XYZ is the best and should be used in all cases" however I'd like to summarize the attributes of each format:

M Formal standard YES YES soon no
M Strongly typed YES YES string/int no
M Easy to implement
the entire standard
H Awesome name! no no YES no
H Permits comments no start of line only YES usually
H diffs neatly no YES (I think) YES YES
H Can be
updated without losing
format or comments
yes-ish NO soon NO

The * column indicates if this quality is important for machines (M) or humans (H). NOTE: This chart is by no means complete.

Personally I'm trying to narrow the file formats in our system down to two: one used for machine-to-machine communication (that is still human readable), and the other that is human-generated (or at least human-updated) for machine consumption (like configuration files). (Technically there's a 3rd need: Binary format for machine-to-machine communication, such as ProtoBufs or CapnProto.)

I'm very optimistic about TOML and look forward to seeing it get to a 1.0 standard. Of course, the fact that I am "Tom L." sure makes me favor this format. I mean, how could I not like that, eh?

Update: 2015-07-01: Updated table (TOML is typed), and added row for "Awesome name".

No TrackBacks

TrackBack URL:

4 Comments | Leave a comment

Wow. That is a lot of things I disagree with, all wrapped into one blog post.

But I'll stick to the facts: sensible JSON parsers allow trailing commas and comments, which seems to nullify most of your argument here. A lot of your points in this post seem to be summarized by "our JSON implementation is deficient and also we're using JSON for things it wasn't meant to be used for". So, uh, don't do that?

Cool! I welcome the debate. I'm sure there is a lot I can learn.

What implementation do you use that permits trailing commas and comments?

I don't see those features in: python, golang, or the "jq" command.

"sensible JSON parsers" that allow trailing commas are not JSON parsers. They parse a superset of JSON maybe or are JSON like, they are not JSON parsers.

Using nonstandard things and labeling it as a standard thing is like putting antifreeze in a Mountain Dew bottle in your refrigerator for safe keeping. You know it's poison and everyone else may end up killing themselves on it. But hey, they are both green liquids, what's the problem?

The value of a standard is in it's standardnesss. If a standard is non standard then it's not so useful and ends up getting you in unexpected places, in unexpected ways at unexpected times... .exactly the sort of things operations typically tasked with working to avoid.

Well, except for the fact that those "sensible" JSON parsers aren't parsing JSON anymore, but some other language, since the JSON specification doesn't allow trailing commas.

Also, what about comments? Having to quote strings (and especially keys)?

How many extensions to JSON are you willing to make before just using a format that is more suitable for configuration?

Leave a comment

  • LISA16