ArticlesEngineering

Database interface evolution

A history of how my homemade database access interface has evolved.

At the start of 2018, I began a journey of building my own database interface. It has grown a lot over the years and now felt like a good time to look back at how it has evolved.

Raw SQL

I had used so many database wrappers, object relational mappers (ORMs), and query builder interfaces prior to 2018 that I didn't have a good understanding of the fundamentals. So I wanted to use an interface as close to the "metal" as possible: raw SQL strings.

I started using sql-template-strings on top of node-pg to write queries like this:

const { Pool } = require('pg')
const sql = require('sql-template-strings')

const pg = new Pool({ connectionString: uri })

function findHuntById(huntId) {
  const query = sql`
    select
      id,
      name,
      code,
      type,
      description,
      begin_message as "beginMessage",
      end_message as "endMessage",
      clue_list as "clueList"
    from hunts
    where id = ${huntId}
  `
  return pg.query(query).then(({ rows: [hunt] }) => hunt)
}

Things to call out:

  • I wanted to use a bunch of snake_case_name as "camelCaseName" so the columns would look idiomatic in the resulting JavaScript. Snake case is the superior casing: easier to read, break into words, and doesn't suffer from the acronym problem (e.g. "HttpServer" vs "HTTPServer"). Otherwise I could have made the table columns also use camel case to skip this annoying conversion.

  • I wanted to avoid select * and be explicit about the columns I needed so there wouldn't be any waste. There are trade-offs with choosing to be explicit each time: no waste, but also higher maintenance as new columns are added or removed.

    Most applications only need a subset of the data, however for the intermediate steps between that use case and where the data is loaded from the database, mirroring that usage is difficult. If you really do care about performance, you can eat the cost of a proliferation of specific queries and less reuse to support that. Being explicit also means if you run a migration to delete an unused column, if you don't update the queries first you will get "column not found" errors instead of getting some data gracefully.

  • This query could be wasteful/inefficient. We really only want a single hunt but I forgot to add the LIMIT 1 to say that in the query. The ID is unique so we'd hopefully stop searching after the first result. This is to say our use of results can not match our query, queries may rely on implicit assumptions, and human error can creep in easily.

  • This can easily be debugged in isolation. Simply fire up psql and run this same query to see what happens. There are not layers and layers of indirection that make understanding what is actually happening hard.

At this point I wasn't thinking about evolving beyond these strings. However as months went on, I made more and more of the same queries over and over with only slight variations. This repetition and boilerplate along with all the brittleness seemed like it needed some abstraction.

Raw query strings could be optimized really well and were the most comprehensive interface, however it wasn't a good programming interface.

SQL as data

Many people over the years have expounded on the problematic legacy SQL has left by only exposing a human string interface. The same can be said of terminal shells and other things that speak over text instead of structured data. They are painful to program against.

So the next phase I went through was building a data-oriented interface for constructing SQL queries. This project I wrote and published as querie, allows writing queries without string templates:

import { createQuery } from 'querie'

const huntTable = {
  name: 'hunts',
  aliases: {
    beginMessage: 'begin_message',
    endMessage: 'end_message',
    clueList: 'clue_list',
  },
}

function findHuntById(huntId) {
  const query = createQuery({
    kind: 'select',
    table: huntTable,
    columns: [
      'id',
      'name',
      'code',
      'type',
      'description',
      'beginMessage',
      'endMessage',
      'clueList',
    ],
    where: {
      id: ['=', huntId],
    },
  })

  return pg.query(query).then(({ rows: [hunt] }) => hunt)
}

Things to note:

  • Dang, now we need to learn this too. We no longer have something we can directly copy and paste into psql. We do have a separation between query building and query running here so we could set a breakpoint to grab query, but we've got to get into a place to execute this code.

  • Already we are losing many features provided by raw SQL. For example, there is ambiguity with selecting "column names". These column names will be properly escaped to avoid SQL injection attacks, but that means functions like "count(*)" will also escape as columns and not work.

    Our where clause is also limited since it is modeled as a hash of {[column]: [operation, value]}. If we wanted to have a query like column > 3 AND column < 8 we need a more powerful syntax, which querie does:

    where: [
      ['>', 'column', 3],
      ['<', 'column', 8],
    ]
    
  • We have moved the column casing issue: we now have an alias map so we can think in JS terms universally in the code. This takes some additional boilerplate but that boilerplate only lives in one place and can be reused.

SQL has so many features that I never had time to create a data interface to replicate. Definitely querie serves a purpose that plain SQL could not: it let me start building cooler abstractions on top. However, I also think querie has a sweet spot: it is stateless, side-effect free, and only builds query strings. Having this serve as a library made it really easy to put to work in many places.

Models

On top of querie I started building something like an ORM. It isn't an ORM because models and results were not really object-oriented objects. They were collections of functions that would accept and return plain data.

Models were where I could start defining common methods like findById(id) because I could start to assume things you couldn't with plain SQL (e.g. that every row has an ID). These models did the query building and ran the query to give the result:

const huntModel = createModel({
  tableName: 'hunts',
  columns: [...]
})

const hunt = await huntModel.findById(huntId)

Things to note:

  • We aren't really talking about queries anymore, we are calling functions and hoping they return the right data. It is yet another layer of indirection on top of querie.

  • We aren't really talking about SQL anymore, we could be loading these from a non-SQL data store.

While common utilities are nice, we also need the flexibility to query things only on specific models. To satisfy this need, models included a model.escapeHatchQuery which directly exposed querie's more flexible interface.

Models and executor

Models had become well established in the code for about a year before I made a somewhat large refactor to fix one thing. Testing the model interface as it was designed was super painful. The coupling between query and performing the side-effect leads to gross hacks to be able to test. I had set in the rest of the codebase a high bar for testing: everything needed to be testable without mocking.

So I aligned models more closely with querie, they were rewritten to return only query data. A new side-effect performer would take that data and finally do the work:

const query = huntModel.findById(huntId)
const hunt = await db.exec(query)

These enable some really cool things. For example, say we wanted to use a totally different database per customer (e.g. sharding by customer). We don't need to plumb the customer down to calls on models, only the db needs to route to the right database. These benefits apply more generally to connection pooling and management as well.

More modeling

The remaining work I've done to evolve the interface has really come down to:

  1. Take on various problems and try to solve them perfectly (best performance, best problem modeling) using escapeHatchQuery.
  2. Examine common themes and create utilities to reuse across them.
  3. Pilot the new utilities on a current problem.
  4. Go back and rewrite the old escapeHatchQueries to use the well defined interface.

One cool place where this organic growth and evolution happened was with pagination. Pagination seems so simple, but I found myself struggling with it each time I came back to write it at a higher level of abstraction. The interface pagination finally arrived at:

// pagination that orders by the created_at column
const pagination = huntModel.makeCreatedPagination()
const resultPage = await loadModelPage({
  db,
  pagination,
  paging: {
    startingAfter: 'hunt_1234',
    limit: 10,
  },
})

The pagination itself is a collection of queries that loadModelPage leverages to get a page of results.

Eventually the well-defined interfaces became comprehensive enough we could fully remove escapeHatchQuery.

Reflection

This turned out to be quite the journey despite being able to summarize it in only a few paragraphs.

  • I learned a lot. Working on my own interface really gives me a good perspective on other ODMs and such. Now some things I can appreciate more and others I can look at in greater horror.

  • There's always a lesson that optimization is a trade-off. When I started I wanted to do everything perfectly, no waste and very performant. This took a lot of mental overhead, tons of finely crafted boilerplate, and a lack of reusability.

    These days, I am more interested in moving quickly relying on abstractions my past self built and accept being non-optimal. It's tricky though: I feel comfortable leaning on the abstractions because I knew a perfectionist spent way too much time on them. I've definitely felt like there are abstractions I couldn't appreciate without knowing their history.

  • A limited feature set is a feature. This came along as more and more features which used escapeHatchQuery became well-defined. Every time this happened it was a relief to not have to think about the specific hard wiring anymore. I could be more confident in binary answers: if it doesn't have a well-defined interface, it is not supported and needs due diligence to add.

    Restricting the interface also allows for something that I see as more valuable than performance: predictable performance. If someone is able to write a super inefficient query it can have so many downstream impacts on other unrelated things. Well defined interfaces limit the blast radius of any change, and that's more a comfort than knowing something is the most performant it could be.

    There is a spectrum of trust we feel we can lend ourselves in the way we design interfaces. This is pretty cool.

I definitely spent more time on this than I anticipated and it never feels like the work is quite done. For so many years I had only ever gotten to use and see the by-products and artifacts of others taking on the same problem. I wanted to share these iterations so others might be able to make their own journey better or appreciate the work involved a bit more.