Monday, September 15, 2014

Jayson Skima - Validating JavaScript Object Notation Data

A schema is simply a pattern. A pure form. Computationally it can be used to try to match an instance against the pattern or create an instance from it. That's how XML schema, and even DTD, were traditionally used - mostly for validation but also as an easy way to create a fill-in-the-blanks type of template. Since JSON has been taking over XML by storm, the need for a schema eventually (in my case, finally!) overpowered the simplicity minimalist instinct of JSON lovers. Thus was born JSON Schema. Judging from the earliest activity on the Google Group where the specification committee hung out (https://groups.google.com/forum/#!forum/json-schema), the initial draft was done in 2008 and it was very much inspired by XML Schema. Which is not necessarily a bad thing. For one, the syntax used to define a schema is just JSON. A quick example:

{
"type":"object",
"properties":{
  "firstName":{"type":"string"},
  "age":{"type":"number"}
},
"required":["firstName", "lastName"]
}

That is a complete, perfectly valid and useless schema, defined to validate some imaginary data describing people. People data will be the running example theme in this post, and if you prefer shopping cart orders, well, sorry to disappoint. So, when we validate against that schema, here is what we are enforcing:
  1. The thing must be a JSON object because we put "type":"object".
  2. If it has a firstName property, the value of that property must be a string.
  3. The value of the age property, if present, must be an number.
  4. The properties firstName and lastName are required.
Fairly straightforward. Nevermind that we haven't defined the format (i.e. the sub-schema) for the lastName property, we are still requiring its presence, we just don't care what the value is going to be. So, this is how it goes: a schema is a JSON object where you specify various constraints. If all the constraints are satisfied by a given JSON datum, the schema matches, otherwise it doesn't. The standard standardizes on what are the possible constraint, but with a few extra keywords to structure large, complex schemas.

Why Do I Care?

I can tell why I care then you can decide for yourself. When working with any sort of data structure, if you can't just assume that the structure has the expected form, it becomes very annoying, paranoia strikes at every corner and you become defensive and all sorts of mental disorders can ensue. That's why we use strongly typed languages. But if you, like me, have been doing the unorthodox thing and using JSON as your main data structure instead of spitting out a jar full of beans for your domain model then this sort of trouble befalls you.

A few years ago I made the decision to drop the beans for a project and since then I've been using the strategy in a few other, smaller scale projects where it works even better. But the popular Java libraries for JSON are a disaster. There has already been one JSR 353 about JSON, a "whatever" API, no wonder it seems dead on arrival, almost as bad as Jackson and Gson. And now Java 9 is promising a "lightweight JSON API" which looks like it might actually be well-designed, albeit it has different goals than what I need and simplicity is not one of them. So I wrote mJson. It is a small, 1 Java file, JSON library. I wanted something simple, elegant and powerful. The first two I think I've achieved, the "powerful" part is only half-way there. For instance, many people expect JSON libraries to have JSON<-> POJOs mappings and mJson doesn't, though it has extension points to do your own easily (frankly it takes 1/2 day to implement this stuff if you need it so much).

Modeling with beans offers the type checker's help on validating that structures have the desired form. If you are using JSON only to convert it to Java beans, I suppose the mapping process is a roundabout way to validate, to a certain extent. Otherwise, you either consent to live with the risk of bugs or to the extra bloat needed to defensively code against a structure that may be broken. To avoid these problems, you can write a schema, sort of like your own type rules and make use of it at strategic points in the program. Like when you are getting data from the internet. Or when you are testing a new module. Not that I'm advocating going for JSON + Schema instead of Java POJOs in all circumstances. But you should try it some time, see where it makes sense. By the way, in addition to a being a validation tool schemas are essentially metadata that represents your model (just like XML Schema).
Good. Now I want to give you a quick...

Crash Course on JSON Schema

First, the constraints are organized by the type of JSON. which is probably your starting point to describe how a JSON looks like:
{"type": "string"|"object"|"array"|"number"|"boolean"|"null"|"integer"|"any"}
As you can see, there are two additional possible types besides what you already expected: to avoid floats and to allow any type (which is the same as omitting a type constraint altogether).

Now, given the type of a JSON entity, there are further constraints available. Let's start with object. With properties you describe the object's properties' format in the form of sub-schemas for each possible property you want to talk about. You don't have to list all of them but if your set is exhaustive, you can state that with the additionalProperties keyword:
{
  "properties": { "firstName": { "type":"string"}, etc.... },
  "additionalProperties":false
}
That keywords is actually quite versatile. Here we are disallowing any other properties besides the ones explicitly stated. If instead we want to allow the object to have other properties, we can set it to true, or not set it altogether. Or, the value of the additionalProperties keyword can alternatively be a sub-schema that specifies the form of all the extra properties in the object.

We saw how to specify required properties in the example above. Two other options constrain the size of an object: minProperties and maxProperties. And for super duper flexibility in naming, you can use regular expression patterns for property names - this could be useful if you have some format made of letters and numbers, or UUIDs for example. The keyword for that is patternProperties:
{
  "patternProperties": { "j_[a-zA-Z]+[0-9]+": { "type":"object"} },
  "minProperties":1,
  "maxProperties":100
}
The above allows 1 to 100 properties whose names follow a j_letters_digits pattern. That's it about objects. That's the biggy.

Validating arrays is mainly about validating their elements so you provide a sub-schema for the elements with the items keyword. Either you give a single schema or an array of schemas. A single schema will apply to all elements of an array while an array has to match element for element (a.k.a. tuple typing). That's the basis. Here are the extras: we have minItems and maxItems to control the size of the array; we have additionalItems which only applies when items is an array and it controls what to do with the extra elements when there are some. Similarly to the additionalProperties keyword, you can put false to disallow extra elements or supply a schema to validate them. Finally you can require that all items of an array be distinct with the uniqueItems keyword. Example:
{
  "items": { "type": "string" },
  "uniqueItems":true, 
  "minItems":10,
  "additionalItems":false
}
Here we are mandating exactly 10 unique strings in an array. That's it for areas. Numbers and strings are pretty simple. For number you can define range and divisibility. They keywords are minimum (for >=), maximum (for <=), exclusiveMinimum (if true, minimum means >), exclusiveMaximum (if true, maximum means <). Strings can be validated through a regex with the pattern keyword and by constraining their length with minLength and maxLength. I hope you don't need examples to see string and number validation in action. The regular expression syntax accepted is ECMA 262 (http://www.ecma-international.org/publications/standards/Ecma-262.htm).

Notice that there aren't any logical operators so far. In a previous iteration of the JSON Schema spec (draft 3), some of those keywords admitted areas as values with the interpretation of an "or". For example, the type could be ["string", "number"] indicating that a value can be either a string or a number. Those have been abandoned in favor of a comprehensive set of logical operators to combine schema into more complex validating behavior. Let's go through them: "and" is allOf, "or" is anyOf , "xor" is oneOf, "not" is not. Those are literally to be interpreted as standard logic: not has to be a sub-schema which must not match for validation to succeed, allOf has to be array of schemas and all of them have to match for the JSON to be accepted. Similarly anyOf is an array of which at least one has to match while oneOf means that exactly one of the schemas in the array must match the target JSON. For example to enforce that a person is married, we could declare that it must either have a husband or a wife property, but not both:
{ "oneOf":[
  {"required":["husband"]},
  {"required":["wife"]}
}
If you have a predefined list of values, you could use enum. For example, a gender property has to be either "male" or "female":
{ 
  "properties":{
    "gender":{"enum":["male", "female"]}
  }
}

With that, you know almost everything there is to know about JSON Schema. Almost. Above I mentioned "a few extra keywords to structure large complex schemas". I exaggerated. Actually there is only one such keyword: $ref (the related keyword id is not really needed). $ref allows you to refer to schemas defined elsewhere instead of having to spell out again the same constructs. For example if there is a standard format for address somewhere on the internet, with a schema defined for it and if that schema can be obtained at http://standardschemas.org/address (a made up url), you could do:
{ 
  "properties":{
    "address":{"$ref":"http://standardschemas.org/address"}
  }
}
The fun part of $ref is that the URI can be relative to the current schema and and you can use JSON pointer in a URI fragment (the part after the pound # sign) to refer to a portion of the schema within a document! JSON Pointer is a small RFC (http://tools.ietf.org/html/rfc6901) that specs out Unix path-like expressions to navigate through properties and arrays in a complex JSON structure. For example the expression /children/0/gendre refers to the gendre of the first element in a children array property. Note that only the slash is used, no brackets or dots and that's perfectly enough. If you want to escape a slash inside a property name write ~1 and to escape a tilde write ~0. To gets your hands on some presumably rock solid zip code validation, for example, you could do:
{ 
  "properties":{
    "zip":{"$ref":"http://standardschemas.org/address#/postalCode"}
  }
}
So that means you can define the schemas for your RESTful API at a standard location and publish those and/or refer to them on your API responses. Any JSON validator has to be capable of fetching the right sub-schema and a good implementation will cache them so you don't have to worry about network hops. A reference URI can be relative to the current schema so if you have other schemas on the same base location, they can refer to each other irrespective of where they are deployed. As a special case to that, you can resolve fragments relative to the current schema. For example:
{ 
  "myschemas": {
     "properName": { "type":"string", "pattern":"[A-Z][a-z]+"}
  }
  "properties":{
    "firstName":{ "$ref":"#/myschemas/properName"},
    "lastName":{ "$ref":"#/myschemas/properName"}
  }
}
Because the JSON Schema specification allows properties that are not keywords, we can just pick up a name, like myschemas here, as a placeholder for sub-schemas that we want to reuse. So we've defined that a proper name must start with a capital letter followed by one or more lowercase letters, and then we can reuse that anywhere we want. This is such a common pattern then the specification has defined a keyword to place such sub-schemas. This is the definitions keyword which must appear at the top-level, has no role in validation per se, but is just a placeholder for inline schemas. So the above example should be properly rewritten as:
{ 
  "definitions": {
     "properName": { "type":"string", "pattern":"[A-Z][a-z]+"}
  }
  "properties":{
    "firstName":{ "$ref":"#/definitions/properName"},
    "lastName":{ "$ref":"#/definitions/properName"}
  }
}
To sum up, using the $ref keyword and the definitions placeholder is all you need to structure large schemas, split them into smaller ones, possibly in different documents, refer to standardized schemas over the internet etc.

Resources

Now to make use of JSON schema, there aren't actually that many implementations available yet. The popular (and bloated) Jackson supports draft 3 so far, and this part doesn't seem actively maintained. One of the JSON Schema spec authors has implemented full support on top of Jackson: https://github.com/fge/json-schema-validator, so you should know about that implementation especially if you are already a Jackson user. But if you are not, I want to point you to another option available since recently:mJson 1.3 which supports JSON Schema Draft 4 validation:

Json schema = Json.read(new URL("http://mycompany.com/schemas/model");
Json data = Json.object("firstName", "John", "lastName", "Smith").set("children", Json.array().add(/* etc... */));
Json errors = schema.validate(data);
for (Json e: errors.asJsonList())
   System.out.println("JSON validation error:" + e);

In all fairness, some of the other libraries also have support for generating JSON based on a schema, with default values specified by the default keyword which I haven't covered here. mJson doesn't do that yet, but if there's demand I'll put it in. The keywords I haven't covered are title, description (meta data keywords not used during validation) and id. To become and expert, you can always read the spec. Here it is, alongside some other resources:



For Dessert

To part ways, I want to leave you with a little gem, one more resource. Somebody came up with a much more concise language for describing JSON structures, It's called Orderly, it compiles into JSON Schema and I haven't tried it. If you do, please report back. It's at http://orderly-json.org/  and it looks like this:

object {
  string name;
  string description?;
  string homepage /^http:/;
  integer {1500,3000} invented;
}*;