Who knows about the overloaded data?

Reducing complexity to scale software

Posted by David Haley on April 09, 2021 · 12 mins read

#architecture  ·  #software

In evolutionary architecture, data models get overloaded – inevitably. Some new requirement comes in and we introduce some property that changes how the data is interpreted. Everybody using the data needs to agree on the implications. In software quality terms, this is a connascence of meaning: downstream usage of the data depends on understanding this property’s meaning. More connascence means more complexity; more complexity, more problems.

The data evolves with a boolean flag here, a type enum there. Next thing we know, as Sandi Metz writes in The Wrong Abstraction,

the code no longer represents a single, common abstraction, but has instead become a condition-laden procedure which interleaves a number of vaguely associated ideas.

The software becomes “hard to understand and easy to break”, making engineering costs “brutal”.

So what’s the “right” way to evolve software architecture? Well… it depends. Let’s say we’re about to overload the data. Who needs to know about this newly overloaded data? The answer suggests seams in the abstraction. If these entire modules over here only care about one usage, and those modules over there only care about the other usage, something about those modules is different. These differences in turn guide naming, software/data design, and software packaging.

Software packaging guides how code over here does or doesn’t access code over there. Scalable software is easy to understand and hard to break.


Let’s see how this might play out with a startup transaction system. A transaction could be anything really; a financial transaction, a payroll, a purchase order…

The “v-zero product” creates transactions – ahem – transactionally: when we make a transaction it either succeeds or fails right there aka synchronously. The saved database record represents the real world result of the transaction. I’ll say txn for short.

class Transaction
	integer id
	decimal amount

function make_a_transaction(amount)
	txn = database.create_transaction()
	txn.amount = amount
	database.save(txn)

When some caller has a transaction, printing its amount is straightforward:

function print_amount(txn)
	print(txn.amount)

So what does print_amount need to know? concept mapping: a transaction’s amount is read off the amount field. data mapping: the amount field came from somewhere (like the database)

graph LR subgraph print_amount get_amount(txn amount = ?) txn_amount[/txn.amont/] get_amount --- txn_amount end subgraph data[data model] txn_table[(txn model)] end txn_amount --- txn_table
who knows what?

In some schools of thought, instance variables would never be accessed directly; instead all accesses go through some abstraction (like methods).

function print_amount(txn)
	print(txn.get_amount())
graph LR subgraph data[data model] txn_table[(txn model)] end subgraph txn.get_amount abstract_amount("txn amount = ?") txn_amount[/txn.amont/] abstract_amount --- txn_amount txn_amount --- txn_table end subgraph print_amount get_amount(txn amount = ?) get_amount --- abstract_amount get_amount -. maybe .- txn_table end
who knows what, with a getter

Why is this useful? get_amount separates print_amount from how amount is actually obtained. One example is that it enables zero-downtime database migration. But that’s another story…

Some frameworks encourage you to write these getters explicitly via a record or entity object. Other frameworks like ActiveRecord set up the abstraction implicitly using Rails magic. ✨

So how much does print_amount actually need to know about the database? It depends on how implicit your mapping into the data is. A friendly database object (like an ActiveRecord) might read from the database when you didn’t mean to… like in a loop…

Alternatively the transaction might be a passive data structure, the amount field is either there or it is not. But we need to know something about the model, after all we are doing something with the modeled data.


Soon enough we want asynchronous transaction processing. Maybe saving transactions takes a while and holds up whoever’s entering transactions. So we introduce the pending status: a transaction that’s about to be real … probably? Until processing succeeds, pending transactions are not yet real.

class Transaction
	integer id
	boolean is_pending
	decimal amount

function save_transaction(amount)
	txn = database.create_transaction()
	txn.amount = amount
	txn.is_pending = true
	start_processing(txn)
	database.save(txn)

// some asynchronous time later...

function processing_finished(txn)
	txn.is_pending = false
	database.save(txn)

All amount accesses must correctly determine which transactions are “real” from their perspective to maintain system behavior.

  • the transaction processor cares very much about pending status.
  • a system operator needs to see if a transaction is stuck in the pending status.
  • but the reporting & bookkeeping modules only care about real aka non-pending transactions.

Of all the places we read amount let’s work with print_amount. Our nifty little function has been reused in a few places, say:

// used in the operator’s panel
function print_user_txns(user_id)
	txns = fetch_txns_for_user(user_id)
	for txn in txns:
		print_amount(txn)

// used by accounting to keep the books
function print_daily_txns(date)
	txns = fetch_txns_on_date(date)
	for txn in txns:
		print_amount(txn.amount)

Here’s two classic options for handling the pending condition:

1- keep the print_amount abstraction for printing amounts, and parameterize it 2- create another abstraction aka API for the behavior

Option 1: parameterize it!

Parameterizing the API means callers need to know about this data overloading.

function print_amount(txn, include_pending)
	if !txn.pending || include_pending:
		print(txn.amount)

Unifying the abstractions with a parameter means each caller knows about the abstraction as well as its parameter value.

graph LR subgraph data[data model] txn_table[(txn model)] end subgraph print_amount get_amount(txn amount = ?) txn_amount[/txn.amount/] include_pending[/include_pending/] txn_processed[/txn.processed/] get_amount --- txn_amount get_amount --- txn_processed txn_amount --- txn_table txn_processed --- txn_table txn_processed --- txn_amount include_pending --- txn_processed end print_user_txns print_daily_txns print_user_txns ---|yes| include_pending print_daily_txns ---|no| include_pending
who knows what, with an API parameter

It’s tempting to default to some value. This gives an appearance of continuity and reduces code changes required. Since our system previously assumed any transaction was real, we’d probably default to include_pending=false. Importantly, this guarantees that any missed case won’t treat pending transactions as real!

The default is tempting because it apparently cuts the dependency between the pending concept and most usages. But it does this by just making the dependency implicit. ✨

Option 2: duplicate it!

Duplicating the “print the transaction” concept means creating at least two functions:

  • one that just prints the amount
  • one that only prints the amount for real transactions

Maybe we write something like:

function print_amount(txn)
	print(txn.amount)

function print_real_amount(txn)
	if !txn.is_pending:
		print(txn.amount)

Who knows what now?

graph LR subgraph data[data model] txn_table[(txn model)] end subgraph print_amount get_amount(txn amount = ?) txn_amount[/txn.amount/] get_amount --- txn_amount txn_amount --- txn_table end subgraph print_real_amount real_get_amount(txn amount = ?) real_txn_amount[/txn.amount/] real_txn_processed[/txn.processed/] real_get_amount --- real_txn_amount real_get_amount --- real_txn_processed real_txn_amount --- txn_table real_txn_processed --- txn_table real_txn_processed --- real_txn_amount end caller caller ---|this one?| real_get_amount caller ---|this one?| get_amount
who knows what, with API duplication

The implementations are simpler than the parameterized version. Generally speaking, too many parameters and flags create confusion. Despite this simplification, the caller must still understand which function to call (if aware of both).

The naming choice shapes future engineers’ assumptions:

  • print_amount vs print_real_amount suggests the amount field is the standard.
  • print_raw_amount vs print_amount suggests real amounts are standard

Is it too easy to conclude that the standard value is the real one?


Remember the method approach? Where we wrote txn.get_amount() instead of txn.amount? Same problem as the above, but for all usages of amount not print_amount.

Let’s say we made get_amount aware of pending transactions:

function Transaction.get_amount()
	if self.is_pending:
		return 0.0
	else
		return self.amount

This is a vital dependency on is_pending: as with duplication, naming matters!

graph LR subgraph data[data model] txn_table[(txn model)] end subgraph transaction.get_amount["transaction.get_amount()"] abstract_amount("txn amount = ?") txn_processed[/txn.is_pending/] txn_amount[/txn.amount/] abstract_amount --- txn_processed abstract_amount --- txn_amount txn_processed --- txn_amount txn_amount --- txn_table txn_processed --- txn_table end subgraph caller get_amount(txn amount = ?) -. maybe .- txn_table get_amount --- abstract_amount txn_processed -. implicit dependency.- get_amount end
who knows what, with an opinionated method

Our ability to write high-quality software at speed and scale is directly impacted by how much we need to know. Do we still write assembly? Of course not – not unless we really need to know the machine details. The more we need to know, the more things we can get wrong.

In the above scenario, there are at least two groups of system users: those who might need to deal with unprocessed transactions, and those who never, ever need to. It turns out that most people only deal in real, not pending data. This reveals a crucial seam in the abstraction, and we can remove the implicit dependency by simply never exposing that data to those consumers in the first place.

Each condition added to an abstraction is an opportunity to ask: who actually needs to know about that overload? In the real-world it’s single fields and also, functions and modules and entire system components. Sometimes I like not knowing how it works. Sometimes, it very much matters to me… the larger the component being abstracted, the larger the impact if I don’t get what I expected.

I don’t need to understand that which I don’t need to know.

With thanks to Lynn Langit, Upeka Bee, Ishmael King, and Stephan Hagemann for review & feedback.