If software is computing with data to get new data, then software architecture is the way we shape said data and the computations over it. This architecture is abstract but its consequences are not: it influences and even determines what we can do, how fast we can do it, and importantly: how quickly we can adapt to doing new things.

Let’s talk about data modeling: how we represent and store our data. This means creating database schemas – whether structurally enforced in SQL, or more loosely defined in NoSQL. Data modeling is hard, and decisions become entrenched with time. Data permeates the entire application. Without precautions, shared models create deep coupling across software systems. Changing the data model might entail changing every line of code that manipulates it, as well as disrupt tooling – like admin interfaces…

One of the most common problems in hierarchical data modeling is where to put shared data. Do I repeat the data on each item, or do I create a grouping layer for the shared data and reference it from each item?

This piece helps explore the tradeoffs for this decision. Knowing which questions to ask helps me make decisions quickly, without backing myself into too many corners later.

If you’d like to skip the context and motivation for this problem, jump to the case study.

Data modeling and duplication

Don’t Repeat Yourself – DRY – may as well be a mantra for software architecture. Everybody knows that we define constants, create abstractions, or represent all data in normal(-ish) form.

For example we learn that if sibling database records all share a field value, then we should move that to the shared parent record. Don’t do this:

graph LR book[a book] book --> chapter_1[a chapter
chapter_name: The Beginning
book_name: The Book] book --> chapter_2[a chapter
chapter_name: The Middle
book_name: The Book] book --> chapter_3[a chapter
chapter_name: The End
book_name: The Book]

"don't": shared attribute repeated across objects

and instead do this:

graph LR book[a book
book_name: The Book] book --> chapter_1[a chapter
chapter_name: The Beginning] book --> chapter_2[a chapter
chapter_name: The Middle] book --> chapter_3[a chapter
chapter_name: The End]

"do": shared attribute "dried" into the parent

We know this is correct, because the picture is simpler. 😐

There are good reasons why this is a basic principle. After all, a book’s chapter can only ever appear as a chapter in its book. Right?

Well everybody also knows that all good principles have exceptions. Is it possible that repeating data might actually be good sometimes?

spoiler alert
source: knowyourmeme.com

Data duplication trade-offs

In most NoSQL, the database is explicit that duplication is preferred. NoSQL typically optimizes for retrieving records from one set, rather than SQL which is geared toward handling deeply linked record sets (aka tables).

By discouraging multiply-indexed queries, NoSQL forces us to consider the core properties of a given data point. Which attributes are relevant to (almost) every usage of the data? At minimum, those need to go into the row so that the software isn’t performing multiple database queries. If you can’t issue an eager-join statement to the database, you are forced into the N+1 query problem.

So duplication in NoSQL is often a performance question: which related columns do I need in sufficiently common queries, that it makes sense to just inline the columns into the row? This happens so much, it has a name: NoSQL encourages “wide rows” because all the related columns get pulled in. Very much not normal form.

SQL databases are very good at relational computation. They’re so fast that it’s tempting to DRY things up because after all we can easily re-join it back together in queries. This is especially tempting when we learn about relational databases in theory, where the objective is deriving mathematical and algorithmic properties of a formal system.

Software systems evolve over time, representing the evolving needs of the human systems they serve. In engineering it’s usually more important to optimize for system evolution than for formal proofs. So the mathematical elegance of DRYness isn’t reason enough. Sorry, normal form. 😇

Join performance isn’t usually a huge problem in SQL. Most of the time adding an index solves the problem. But good indexing is a topic unto itself…

So whether it’s SQL or not, what are trade-offs other than performance? To understand that let’s look at how the data model shapes software architecture.

Architecting the data model

The data model is how we describe “facts”: posts on a blog, people on a team, money in a bank account. It’s whatever things our application stores to do its job. The model shapes what we can express: like, whether or not a book can appear in one, or multiple collections. Because data flows through the entire system, it even shapes core functional units (side-effect free functions). Therefore, the data model shapes even what could be true… what we are capable of discussing, whether or not we save it away.

Being able to reason about hypothetical situations makes humans intelligent; software is more intelligent, insofar as it can perform its functions just for pretend. With a device that lets us pretend the world isn’t how it is, we can reason about things that might happen in the future. This is a fundamental tool in how we humans think about the world.

DRYing our data constrains how our code manipulates it, and much more importantly its evolution over time. This is generally called “the abstraction” or “software architecture”. Architecture is a fancy term for how software represents its problem into neat computational units, and how quickly those units may be extended and rearranged for new purposes. For example, my utensil drawer is poorly architected if I have to sift through napkins to get to spoons.

Architectural problems are very expensive, multiplicatively so as the abstraction proliferates throughout the software system. If everybody has to use the same system for different needs, then the system needs to understand every need. Inevitably, systems grow in usage patterns and it’s easier to just accrete usages for a component rather than separate the emerging usage patterns. And indeed we should not pursue elegance for its own sake, only shipping software when it reaches its architectural zenith.

Sandi Metz argues in The Wrong Abstraction that when abstractions grow by accretion,

the code no longer represents a single, common abstraction, but has instead become a condition-laden procedure which interleaves a number of vaguely associated ideas. It is hard to understand and easy to break.

She terms the cost of changing such code “brutal” and her suggestion is to remove dryness by introducing duplication. Is that like adding moisture? olive oil? I digress. One proposal is Mat Ryer’s don’t abstract methods prematurely :

To push back a little against DRY (Don't Repeat Yourself) principles in code I'd like to suggest that it's ok to repeat yourself, in fact that's what you should do first. Later you can DRY it up IF it makes sense.

How about we write DAMP code:

Don't
Abstract
Methods
Prematurely
— Mat Ryer (@matryer) January 7, 2019

So duplication is good, especially at first, but eventually it’s no longer premature to remove duplication. Let’s restate the problem: what are the trade-offs of repeating versus DRYing our domain facts? If we know the answer

Case study in DRYness

Let’s explore this with payroll software; the first usage of “payroll” in 1740 meant:

pay-roll : the roll of parchment listing persons owed payment (Merriam Webster)

Today it means basically the same thing: the process by which businesses pay workers, in compliance with applicable labor & tax codes. In the USA, basic payroll means handling time worked (including overtime), withholding & paying taxes, and timely filing of reports. Some businesses handle payroll internally; as businesses grow many outsource to an accountant. Accountants in turn often rely on payroll software such as Gusto, ADP, etc.

When a business pays its workers, that’s called “running payroll”. Most payrolls are run on some cadence (weekly, twice a month, etc.). Each payment is recorded with several line items, such as the gross pay, applicable taxes, associated benefits… It’s like line items on a receipt or invoice. Each worker has a gross pay, a net pay, and a breakdown of the numbers in between.

Let’s say we’re paying workers in Oakland and New York. For simplicity’s sake, let’s say the only relevant payroll factors are how much you made, and where you live. A payroll might look something like:

graph LR payroll --> payroll_item_1_1[earning 1
$100, worker: A, location: Oakland] payroll --> payroll_item_1_2[tax 1
$10, worker: A, location: Oakland] payroll --> payroll_item_2_1[earning 2
$200, worker: B, location: New York] payroll --> payroll_item_2_2[tax 2
$30, worker: B, location: New York]

representing a basic payroll

or, if we defined the facts imperatively:

earning1 = Earning($100, workerA, Oakland)
tax1 = Tax($10, workerA, Oakland)

earning2 = Earning($200, workerB, New York)
tax2 = Tax($30, workerB, New York)

p = Payroll(earning1, tax1, earning2, tax2)

Around now the “DRY alarm” goes off ⏰ Each earning/tax combo repeats data: the worker, and the location.

DRY, round 1

Grouping by worker is pretty intuitive: we typically administer payrolls by worker. The company payroll report breaks it out by worker; workers only access their own data. So let’s create a line item for each worker, with the earnings and taxes as sub-items:

graph LR payroll --> payroll_item_1[line item 1
worker: A] payroll_item_1 --> payroll_item_1_1[earning 1
$100, location: Oakland] payroll_item_1 --> payroll_item_1_2[tax 1
$10, location: Oakland] payroll --> payroll_item_2[line item 2
worker: B] payroll_item_2 --> payroll_item_2_1[earning 2
$200, location: New York] payroll_item_2 --> payroll_item_2_2[tax 2
$30, location: New York]

employee line item abstraction

now our facts look like this:

earning1 = Earning($100, Oakland)
tax1 = Tax($10, Oakland)
line_item_1 = LineItem(workerA, earning1, tax1)

earning2 = Earning($200, New York)
tax2 = Tax($30, New York)
line_item_2 = LineItem(workerB, earning2, tax2)

p = Payroll(line_item_1, line_item_2)

At this point we usually feel pretty good: we have a data model that doesn’t repeat data (✅ data normalization), our data model is pretty intuitive with respect to reality and expected usage (✅ software agility), and we can use database indexes to keep queries fast (✅ fast UX). Databases are really fast.

What happens when we introduce a grouping concept into the data model?

Consequences of grouping by worker

Although it’s natural to group these items, the worker grouping doesn’t actually “exist”. Filtering by worker is how we happen to look at the data, but that view isn’t what it “is”. The grouping itself doesn’t carry semantic meaning. Our problem is to represent who was paid how much, when, where, and for what. That we happen to often sort or organize these line items by worker isn’t part of their structural truth.

Combined with the DRY mantra, it really is the “everybody knows” way to spec out a payroll of workers getting paid. But good data modeling isn’t about replicating what’s intuitive; it’s about enabling our software to do useful things, accurately and quickly and agile in the face of things to come.

Being skeptical of what “everybody knows” is a way to hone beginner’s mind 初心 “shoshin”. Because “everybody knows” usually just means what’s intuitive to me, myself, and I.

So: now that we’ve grouped by worker, any component accessing the worker on earnings and taxes needs to understand the “line item” concept. That’s because given an earning or tax, we can’t know who it’s for without issuing an additional DB query. In pseudo-ORM syntax,

oakland_earnings = all Earnings where location:Oakland
earning = oakland_earnings

// no can do: an Earning doesn't have a Worker
print "<earning.worker> earned <earning.money>"

// can do: an Earning has a LineItem
line_item = earning.line_item
// which has a Worker
print "<line_item.worker> earned <earning.money>"

Accessing the line_item field means an extra query. By the way, hopefully we included the line_item in each earning as part of a finely tuned pre-fetch… otherwise we have the N+1 query problem again.

It’s almost a given that new “serious” projects will use some kind of ORM: GORM in Golang, ActiveRecord in Ruby/Rails, SQLAlchemy in Python, Hibernate in Java …

… so, we probably configure our ORM with some syntactic sugar via the association between earnings and line items:

Line Item:
  has_many Earnings
  
// now this "just works"
print "<earning.line_item.worker> earned <earning.money>"

Eventually we see all the repetition of line_item, so we decide to DRY that up using the delegation pattern:

function Earning::worker {
  return self.line_item.worker
}

// now it's just
print "<earning.worker> earned <earning.money>"

n.b.: by DRYing up the data model, we introduced repetition in the code. DRYing up the code introduced the association abstraction. The resulting code has undone the data DRYness to appear as if it’s accessing the original data model. This is a hint, but not an answer that the “dried up” layer has no structural, semantic meaning.

But if our code really just wants the thing we just removed in our data, we should be sure that the additional structure won’t be a problem.

the abstraction eats itself
source: wikipedia ouroboros

ORMs and data-code coupling

The pervasiveness of the ORM is how data models become entrenched. It’s just so easy and obvious to pass around the rich ORM data objects. The more the code uses the ORM, the more the code structure imitates the data’s. For example, the code tends to loop in the same layers as the data’s hierarchy.

A grouping layer is like fixing a multi-dimensional space along one axis. For example, earning(worker, location, amount) is a three-dimensional fact: there is an earning for this worker, at this location, for this amount.

DRYing the worker into a line item removes that dimension from the fact. To obtain the full fact, you need to compose line_item(worker, earning) with earning(location, amount).

Abstractly speaking it’s all the same. That’s the whole point of normal form. Pulling one bigger thing is “the same” as composing two smaller things. So why does it matter, what are the actual consequences of this decision?

First, to achieve the abstract equivalence, all code needs to understand and (correctly) implement the composition.

Difficulty of implementing correctly increases with complexity of business domain. Payroll is a very complex domain. Did you know that Pennsylvania alone has 2,492 local income tax jurisdictions? Yeah good luck, have fun.

Code <-> ORM : coupled to implicit behavior

The ORM enables and encourages hiding this complexity, via delegations, helper scopes, etc. But making something invisible means being coupled to whatever assumptions were made along the way. Let’s go back to how we turned earning.line_item.worker into earning.worker : an ORM association:

Line Item:
  has_many Earnings

… WIP themes …

to prefetch or not to prefetch
forcing data colocation (grouping has to be same place as grouped, for joins)
forced join queries
harder data science (the code thought it was convenient to un-DRY the data, why not data science?)

… WIP …

Conclusion

… WIP …

DRY data models are tempting, for good reason. DRY is simple and as Einstein said,

everything should be made as simple as possible, but no simpler

A data model is too simple when it prevents you from fully expressing your problem.

Each layer of DRYness constrains the dimensions on the objects below it.

What is the actual cost of supporting multiple dimensions,

Thanks to … for reviewing early versions.

← Previous Post Next Post →

DRY truths about data models

How data models influence architecture