[This is still only draft quality but I think it is worth publishing at this point.]
Internally at Stack Exchange, Inc. we've been debating the value of certain file formats: YAML, JSON, INI and the new TOML format just to name a few.
[If you are unfamiliar with TOML, it is Tom's Obvious, Minimal Language. "Tom", in this case, is Tom Preston-Werner, founder and former CEO of GitHub. The file format is still not reached version 1.0 and is still changing. However I do like it a lot. Also, the name of the format IS MY FREAKIN' NAME which is totally awesome. --Sincerely, Tom L.]
No one format is perfect for all situations. However while debating the pros and cons of these formats something did dawn on me: one group is for humans and another is for machines. The reason there will never be a "winner" in this debate is that you can't have a single format that is both human-friendly and machine-friendly.
Maybe this is obvious to everyone else but I just realized:
The group that is human-friendly is easy to add comments to, and tolerant of ambiguity, is often weakly typed (only differentiating between ints and strings).
The group that is machine-friendly is difficult (or impossible) to add comments, is less forgiving about formatting, and use often strongly typed.
As an example of being unforgiving about formatting, JSON doesn't permit a comma on the last line of a list.
This is valid JSON:
{
"a": "apple",
"alpha": "bits",
"j": "jax"
}
This is NOT valid JSON:
{
"a": "apple",
"alpha": "bits",
"j": "jax",
}
Can you see the difference? Don't worry if you missed it because it just proves you are a human being. The difference is the "j" line has a comma at the end. This is forbidden in JSON. This catches me all the time because, well, I'm human.
It also distracts me because diff
s are a lot longer as a result. If I add a new value, such as "p": "pebbles"
the diff
looks very different:
$ diff x.json xd.json
4c4,5
< "j": "jax"
---
> "j": "jax",
> "p": "pebbles"
However if JSON did permit a trailing comma (which it doesn't), the diffs would look shorter and be more obvious.
$ diff y.json yd.json
4a5
> "p": "pebbles",
This is not just a personal preference. This has serious human-factors consequences in an operational environment. It is difficult to safely operate a large complex system and one of the ways we protect ourselves if by diff
'ing versions of configuration files. We don't want to be visually distracted by little things like having to mentally de-dup the "j"
line.
The other difference is around comments. One camp permits them and another camp doesn't. In operations often we need to be able to temporarily comment out a few lines, or include ad hoc messages. Operations people communicate by leaving breadcrumbs and todo items in files. Rather than commenting out some lines I could delete them and use version control to bring them back, but that is much more work. Also, often I write code in comments for the future. For example, as part of preparation for a recent upgrade, we added the future configuration lines to a file but commented them out. By including them, they could be proofread by coworkers. It was suggested that if we used JSON we would simply add a key to the data structure called "ignore" and update the code to ignore any hashes with that key. That's a lot of code to change to support that. Another suggestion was that we add a key called "comment" with a value that is the comment. This is what a lot of JSON users end up doing. However the comments we needed to add don't fit into that paradigm. For example we wanted to add comments like, "Ask so-and-so to document the history of why this is set to false" and "Keep this list sorted alphabetically". Neither of those comments could be integrated into the JSON structures that existed.
On the other hand, strictly formatted formats like JSON are, in theory, faster to parse. Supporting ambiguity slows things down and leads to other problems. In the case of JSON, it is just plain so widely supported there are many reasons to use it just for that reason.
Some formats have typed data, others assume all data are strings, others distinguish between integer and string but go no further. YAML, if you implement the entire standard, has a complex way of representing specific types and even supports repetition with pointers. All of that turns YAML's beautifully simple format into a nightmare unsuitable for human editing.
I'm not going to say "format XYZ is the best and should be used in all cases" however I'd like to summarize the attributes of each format:
* | Format | JSON | YAML | TOML | INI |
---|---|---|---|---|---|
M | Formal standard | YES | YES | soon | no |
M | Strongly typed | YES | YES | string/int | no |
M | Easy to implement the entire standard | YES | no | YES | YES |
H | Awesome name! | no | no | YES | no |
H | Permits comments | no | start of line only | YES | usually |
H | diffs neatly | no | YES (I think) | YES | YES |
H | Can be programmatically updated without losing format or comments | yes-ish | NO | soon | NO |
The *
column indicates if this quality is important for machines (M) or humans (H). NOTE: This chart is by no means complete.
Personally I'm trying to narrow the file formats in our system down to two: one used for machine-to-machine communication (that is still human readable), and the other that is human-generated (or at least human-updated) for machine consumption (like configuration files). (Technically there's a 3rd need: Binary format for machine-to-machine communication, such as ProtoBufs or CapnProto.)
I'm very optimistic about TOML and look forward to seeing it get to a 1.0 standard. Of course, the fact that I am "Tom L." sure makes me favor this format. I mean, how could I not like that, eh?
Update: 2015-07-01: Updated table (TOML is typed), and added row for "Awesome name".