ArticlesEngineering

Dattle

A simple data specification format and its design decisions

Dattle is a new data specification format. Dattle sits in the same camp as JSON and EDN formats. Data described in Dattle is human readable and self-describing, e.g. you don't need an out-of-band schema to parse the data.

Examples

A demonstration of all Dattle syntax:

{"nil" nil
 "true" true
 "false" false
 "string" "UTF-8 \"escaping\""
 ["vector"] ["one" "two" true false]
 {"map" "example"} {"key" "value"
                    "name" "value"}}

Files containing data expressed in Dattle should have the extension .dt.

Motivation

I'm mulling over the design of a new programming language. I don't think it's any mistake that JSON became so popular because JavaScript became so popular. There's a huge advantage to having both a data format and a language which 'click' together. Data is the more fundamental thing, so I want to start there.

Goals

  • Human readable because it will be embedded into a programming language
  • Parseable without out-of-band schemas to allow the data to stand alone
  • Small set of composable data types to discourage coupling and reduce edge cases
  • Minimal syntax and simple grammar to reduce complexity and fragmentation in parsing

Riffing from EDN

The primary influence here is Extensible Data Notation (EDN) which is the data language of the programming language Clojure. This is another example of data format and language pairing. Compared to JavaScript/JSON this relationship is hardcore: most everything in Clojure is written in EDN.

EDN gets a lot right:

  • Unlike JSON, there's less noise. Map key-value pairs don't need the semicolon (:) between the keys and values. We certainly don't need the trailing comma (,).
  • Simple data structures, simple data equality. Unlike JSON, where all map keys must be strings, map keys can be any data value. This lets us use maps in more places as they don't have to be dictionaries. For example, JSON can't do integer keyed maps.

EDN has a bunch I don't need though:

  • While the noise is optional, it is optional. So there's nothing stopping anyone from doing {"a", "b", "c", "d"} which is just confusing. Optionality has very little room to stand in data format. People like to do stringify(parse(data)) to clean it up into its proper form so I want to reduce the number of allowable-but-not-proper forms. Dattle disallows commas outside of strings. This removes that piece of optionality.
  • EDN has a bunch of data structures, relative to JSON. These include string-likes symbols and keywords and collection types lists, vectors, and sets. It can be difficult to know when to use what and some languages struggle to elegantly represent these types uniquely. In languages that aren't Clojure it leads to a lot of verbosity akin to code-generated Protobufs. In Dattle, we keep it simple: we only have strings and we only have lists (vectors in EDN).
  • EDN is extensible. You can decorate the plain data with reader tags to give more context to pieces of data. These readers can determine the final output of the data, equality rules, etc. This is a means to be more self-describing. This extensibility is too complicated for a data format. This system results in a lack of portability due to reader fragmentation (along with hard to predict parsing performance) and people ignoring the reader system entirely to be compatible with the most things. Dattle has no extension system.

Specification

Data expressed in Dattle is defined using the following elements. Dattle is case sensitive, e.g. TRUE and True not valid replacements for true.

  • true and false for booleans
  • Strings. UTF-8, always double quoted, and supports escape sequences.
  • Vectors. The types of elements of a vector can be heterogeneous. These should be represented in the language's native array, random-access, deque-like structure. These are lists, but the use of "vector" is preferred to avoid association with linked-lists and Lisp cons.
  • Maps. The types of keys and values of a map can be heterogeneous. These should be represented in the language's native data structure for arbitrary key-value pairs if one exists. If such a structure does not exist, represent the map as a vector of key-value pairs.
  • nil has three intended use cases:
    • to communicate the intent to remove a value from a map
    • to communicate the absence of a value in a list
    • to achieve either of the above when Dattle is used in streaming applications where chunks of Dattle are being processed.

Dattle has a BNF specification.

Decisions

Some of the key decisions that make Dattle what it is.

Numbers not found

Notably absent from the specification is numbers. Both EDN and JSON have number types. It is an understatement to say numbers are common in data formats.

However, numbers go against the design goals because they are complex, both terms of problem space and syntax. There are hundreds of different number memory representations and syntaxes to pick from. For simplicity, Dattle does not make a decision here.

Both JSON and EDN chose which to support based on the influences of their languages. JSON numbers are 64-bit floating point numbers, matching JavaScript. JavaScript has since added support for BigIntegers of arbitrary-size. EDN supports integers and doubles with togglable precisions, rooted in Java long and double.

Mathmatical operations are not performed amongst inert data. Numbers are a conveinence in a data format. This conveinence forces formats to make a choice of:

  • Support a single set of blessed number types and nothing else. This is what both JSON and reader-less EDN do. JSON chose a single type, EDN chose multiple. For each supported option, the grammar gets more and more complicated. For example, supporting floating point numbers tends to result in absorbing all of that grammar into your own.
  • Blow out your data format with an extension system. In EDN, you can reach for the reader tags to do a bit better: #custom/uint8 4 such that the 4 is represented as the uint8 type the programming language or parser may or may not support.

Numbers can be represented using strings in Dattle: "123". This seems like the right outcome in terms of being explicit, given the complex nature of numbers. Both producer and consumer of Dattle will need to concious of the memory/string representation of their numbers:

number = 6
dattle = stringify({ amount: stringifyFloat(number) })
// => '{"amount" "6"}'

result = parse(dattle)

assert(parseFloat(result.get('amount')), number)

For anyone who has worked with floating point numbers or underspecified number formats before, this need to be explicit comes as a huge relief.

Dattle is UTF-8

With numbers left undecided, it seems incongruent that strings aren't also undecided. After all, strings have just as many different representations and are certainly complicated. However, strings are so complicated that for human readable formats we've seemed to all arrive at the decision to use Unicode.

UTF-8 was chosen with the expectation that since Dattle was human readable and used in programming contexts that the majority of strings would be ASCII but still need to be extensible to support more complex characters. Otherwise, UTF-16 would also have been fine.

Comments not found

Another missing feature is the ability to write comments in Dattle. They aren't supported in the spirit of reducing the number of allowable-but-not-proper forms. Comments are not composable, nor data. Having comments would lead people to develop half-baked extension systems inside their Dattle files.

Creating a "Dattle but with comments" would be trivial given how limited the grammar is. No one can stop others from tacking comments on to their data format. So let's talk about it here to avoid fragmentation.

Dattle with comments have the file extension .dtc and the syntax is as follows:

# <comment> #

# <multi-line-
  <comment> #

# <comment with a hash \# in it> #

The hash # is chosen for consistency with the other single character "container" characters for strings, vectors, and maps:

"abcd"
["a" "b" "c" "d"]
{"a" "b" "c" "d"}
# abcd #

The spacing between the comment text and hash are required. We want to discourage extension systems using comments and # marker # deters that urge more than #marker#.

We also don't support unclosed comments like # comment <end-of-line> as the Dattle grammar has no notion of end-of-line itself. End-of-line behaviors get weird for parsing done on streams of data as the parse now also needs to factor those in.

The "Dattle" name

Dattle was chosen because it:

  • is not a real word,
  • should sound distinctly recognizable to programmers,
  • should be easy to search for online,
  • contains most of the word "data" to emphasize it's intended use,
  • abbreviates to the relatively unclaimed file extension .dt.

Speaking of brand optics, we also need a logo so here we go:

The Dattle logo is a stretched out brown letter D

This logo was chosen because:

  • It's easy to draw.
  • It's simple to reflect the simplicity of the language.
  • The logo looks elongated to express how far Dattle can be taken because it is compositional.
  • The logo is one color so it's easy to modify to fit the situation.

And with that, Dattle has arrived.