Prompted by a thread on the IETF's JSON WG mailing list, I find myself wondering (again) why the current JSON landscape does not have an established schema language, and what it would take to change that. There is a long-expired Internet Draft, but as shown in the interesting "Foundations of JSON Schema" presentation at the recent WWW2016 conference, this draft is not well-specified, and as a result, current implementations differ in their behavior.
As a side note, it seems to me as if the draft also suffers from a lack of focus. It has grammar aspects to it, but also has quite a bit of functionality on a higher level, talking about hyperlinks and associated information, which to me seems to be a different concern. Not that I wouldn't consider hypermedia as important, but it just seems like an odd thing thrown into the mix.
But let's back up and think about why we might even want a JSON schema language. The three counter-arguments most often voiced are the following:
- Schemas are incomplete as a complete formal description of a JSON-based format, and thus trying to fully capture the semantics of a JSON-based language is futile. Documentation is needed anyway, so why write a schema?
- Schemas are too heavyweight and as a result, developers are overwhelmed by them. XML is often cited here, bringing up the heavyweight XSD language as a proof of how complex a schema language can be.
- Developers do not like schemas, neither as the ones writing them, nor as the ones reading them. Document your API with examples and that will be easier and good enough for everybody involved.
Of course, all these arguments have some truth to them, but my thinking is that a well-designed JSON schema language could provide more value than create problems, in all of these areas (and of course everybody is free to ignore such a language). I cannot present such a language here, but I am thinking that this could be a great contribution to the JSON technology landscape, an in particular to API and service design and documentation. So how would such a language look like? Here are some ideas:
- No JSON syntax: While eating your own dog food is a noble gesture, it's not always a great idea. XSD serves as a cautionary tale: The XML syntax is a complete disaster to work with, and part of the reason why RELAX NG thrives is because it has a syntax that looks reasonable (we tried the same approach for XSD but the idea never took off). Being able to parse/process/validate a schema language with itself is kind of a cute exercise, but not necessary to create a good language, and in fact may distract from the more important goal to create a syntax that can be read and written without jumping through too many syntax hoops. In addition, JSON has no comments (even though HJSON might help here), and having a schema language without comments is not a great idea.
- Grammar only: The only aspect covered by the language should be grammar-based constraints. This means that the language can only express structural constraints, and not higher-level ones such as key/value constraints, or application-specific ones such as the hypermedia support mentioned above. This creates a focused language that is easy to explain, learn, and apply.
- Good datatype support: JSON often contains typed data, and being able to express constraints on the data is important. This is a tricky one as highlighted by the XSD Part 2: Datatypes specification, which does nothing other than datatypes and in itself is a rather complex beast. I find myself wondering if there is any reasonably simple solution for this. Even RDF punted and decided to simply reuse the XSD datatypes.
- Wildcard support: It can help format (and thus API) evolution quite a bit when a format is clear about its extensibility points (a point I made here writing about "Robustness and Extensibility"). A schema language should allow to clearly identify these extensibility points, and maybe even specify a processing model for them.
Is that all it takes to be a useful and promising JSON schema language? Probably not, and it would be interesting to hear what else might be useful to be considered for such a focused language. But assuming that these points would be addressed, let's briefly revisit the counter-arguments from above:
- Schemas are incomplete: There is no doubt about that any schema needs additional documentation. But having a readable, concise language and comments can itself be a useful and helpful part of that documentation, in particular when it is machine-readable and thus can be used for simple tasks such as validating test data as a first line of defense.
- Schemas are complicated: A focused grammar-based language is not too hard to understand, and understanding the idea behind a grammar and why it matters is not overly complex. Other issues beyond grammars should be left as an exercise to other languages, either layered schema languages, or a DSL such as API documentation/description languages.
- Schemas are not wanted: Schemas can be a useful tool when they are used and can serve as part of the documentation. They allow to define basic structures, and, when consistently supported by tools, can be used to make working with JSON easier. Nobody is forced to use them, but in particular in the ever-growing complex landscape of APIs, having a bit more rigor might turn out to be useful in the long run.
In a recent blog post, Tim Bray wrote about JSON schema languages, and added some more specific ideas on how such a language might look like. The overall gist is that the language should be well-defined and thus implemented consistently, and that error reporting of tools should provide useful feedback (error location). That makes sense and for something like grammar checking, that shouldn't be too hard to get right.
I am curious to see how this space will develop. The subject comes up regularly, and over the years various languages have been proposed, but so far none has ever made any inroads. My guess is that at least in DSL such as API documentation/description, aspects of a schema language will show up. It would be better if such a concern were separated into a general JSON schema language, but we will see if we have one by 2020. My guess is that we will.
And finally, as a pet peeve: If you create a schema language for language X, please don't call it "X Schema". XSD initially did this (in an attempt to become not just one, but the only true XML schema language), and all it did was creating lasting linguistic confusion. The "JSON Schema" draft has gone the same way. It's like calling your cat "cat" in an attempt to be funny. It's mildly funny in the beginning, but makes it hard to have meaningful conversations that aren't full of ambiguities. Meaningful names make conversations much easier. Please use them.
Update May 2: I just learned about JSON Content Rules (JCR) and an additional specification for JCR Co-Constraints, and they seem to be close to the "how" described above: non-JSON syntax, grammar-focused, datatype support, wildcard support, and a cleanly layered specification for additional features. If you are shopping around for JSON schema languages, have a look.