Above is a recorded talk I gave on ‘Beyond JSON: Fantastic Serialization Formats and Where to Find Them’ for API Craft. Slides are available here.

This blog post is an aspirational transcript for the talk. Keep reading for more!

Today, JSON (Javascript Object Notation) is the de-facto serialization format for exchanging data between HTTP-connected services. Several features of JSON makes it a useful general purpose format: it’s human readable, easy to learn, and the ubiquity of Javascript.

In this talk, let’s look beyond JSON. We’ll learn about three different serialization formats (JSON, MessagePack, Protocol Buffers); and discover benefits unique to each.

Serialization

A Web API lets software communicate with each other over the network.

Serialization is a key step in this communication process. What language or dialect to these software systems communicate in?

Serialization is the process of translating object state into a format that can be transmitted and reconstructed later.

There are several applications of object serialization:

Communication

If you have two machines that are running the same code, and they need to communicate, an easy way is for one machine to build an object with information that it would like to transmit, and then serialize that object to the other machine. It’s not the best method for communication, but it gets the job done.

Persistence

If you want to store the state of a particular operation in a database, it can be easily serialized to a byte array, and stored in the database for later retrieval.

In the context of APIs, we focus more on the Communication use case.

A Common Language

Say that we have a handful of software systems, each written in a different language and / or platform:

Each system has its own native types and data structures, and different ways of encoding them. When any two systems need to communicate with each other, we’ll need some kind of intermediary ‘translator’ between them.

However, using the approach above means we’ll need a translator for each pair of software systems. That’s too much work!

Instead, we can opt for a smarter approach:

By having systems agree on a common serialization protocol, we can bypass the need for translation entirely.

Serialization Formats

In this article, we’ll look at three different serialization protocols: JSON, MessagePack, and Protocol Buffers.

Along the way we will examine the following key attributes:

  • Human readability
  • Types and validation
  • Interface definition
  • Documentation
  • Performance
  • Schema evolution

JSON

JSON is widespread and human readable (which is why we use it as a YAML alternative.) Browsers can consume the data directly without a separate unpacking step.

When serializing data from statically typed languages, however, JSON not only has the obvious drawback of runtime inefficiency, but also forces you to write more code to access data (counterintuitively) due to its dynamic-typing serialization system.

In this context, it is only a better choice for systems that have very little to no information ahead of time about what data needs to be stored.

There are no built-in type system for JSON, but you can use JSON Schema to add schemas and type validation.

MessagePack

MessagePack is like JSON, but with an efficient binary encoding.

MessagePack-ed data takes up less space:

Compared to JSON, in MessagePack there are additional packing and unpacking steps:

require msgpack
require json

cereal = {taste: :good}
json = cereal.to_json

# Serialization
msg = cereal.to_msgpack

# Deserialization
MessagePack.unpack(msg)

Unlike JSON and YAML, MessagePack is not meant to be human readable! It is a binary format, which means that it represents its information as arbitrary bytes, not necessarily bytes that represent the alphabet. The benefit of doing so is that its serializations often take up significantly less space than their YAML and JSON counterparts.

Although this does rule out MessagePack as a configuration file format, it makes it very attractive to those building fast, distributed systems.

Just for the record, MessagePack was not made for consumer-facing API’s.

Protocol Buffers

Where browsers and JavaScript are not consuming the data directly – particularly in the case of internal services – structured formats, such as Google’s Protocol Buffers, are a better choice than JSON for encoding data.

We should be able to capture information about our data models in our messages as well!

Protobufs involves an interface description language that describes the structure of some data and a program that generates source code from that description for generating or parsing a stream of bytes that represents the structured data.

You define how you want your data to be structured once, then you can use the CLI to generate source code for write and read your structured data to and from a variety of data streams and using a variety of languages.

In Protocol Buffers you have to explicitly tag every field with a number, and those numbers are stored along with the fields’ values in the binary representation.

Thus, as long as you never change the meaning of a number in a subsequent schema version, you can still decode a record encoded in a different schema version. If the decoder sees a tag number that it doesn’t recognize, it can simply skip it.

New fields could be easily introduced, and intermediate servers that didn’t need to inspect the data could simply parse it and pass through the data without needing to know about all the fields.

However, there a few gotchas with schema evolution:

  • you must not change the tag numbers of any existing fields.
  • you must not add or delete any required fields.
  • you may delete optional or repeated fields.
  • you may add new optional or repeated fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

In Closing