2023-03-23 series

The Sources Layer: Schema-first versus schema-last paradigms (Part 4)

(This post is part of a series on working with data from start to finish.)

Every data infrastructure starts with the same thing: raw data. To process and analyze data, first you must capture it, and this begins in the data sources layer.

Historically, database administrators were too not involved in the collection of data. Instead, they ingested data into the infrastructure “as-is”. Questions from business users about a source system’s data model, semantics or accuracy were tactfully redirected to the appropriate system owner: Salesforce questions went to sales, HubSpot questions to marketing, QuickBooks questions to finance.

Like a spinning carousel, questions about data routinely went in one end and out the other. DBAs believed their job was to manage everything within the data warehouse. Business users, on the other hand, believed the data platform was to manage all data assets at the firm.

As the mandate of traditional DBAs progressively shrunk, the mandate of the broader data team correspondingly grew. The data platform expanded its purview into procuring data, logging data, cataloging data and developing well-designed data models.

Data engineers were to apply data collection best practices upstream and ensure that data going into the data warehouse was of high quality. Within the sources layer, data acquisition fell into the following common functions:

Ensuring logging systems were high-throughput, high-availability and high-resolution
Ensuring enterprise systems followed “master data management” (MDM) best practices
Procuring third-party data sets and managing vendor relationships

These functions further divided into two broad categories depending on the type of data collected: “event-based data” and “entity-based data”. With event-based data, typically your foremost concern related to write capacity. With entity-based data, you focused on high-fidelity domain modeling.

This distinction was popularized by Martin Kleppmann, who in Designing Data-Intensive Systems contrasted two foundational data management designs (a summary of which is here):

Schema-on-write, where schema conformity and referential integrity are validated before data is written
Schema-on-read, where write throughput is prioritized over schema conformity

In a schema-on-write architecture, “entities”, described and encoded in a schema, are the principal focus. For schema-on-read, it is “events” (sometimes referred to as “documents”).

From a data warehousing perspective, entities generally become our dimensions and events become our facts. For example, if we are building a mosaic of users by tying together click activity, marketing engagement and transaction history, then the user (entity) is our dimension to be subsequently joined onto click, marketing and transaction facts (events).

While your run-of-the-mill, relational database such as PostgreSQL or MySQL can very well accommodate both approaches, different databases are commonly used for each architecture.

Relational databases (RDBMS) excel at modeling and normalizing entities. Conversely, non-relational databases (i.e. NoSQL) and logging systems (e.g. Kafka, Logstash and DataDog) excel at high write capacity and are able to absorb hundreds of thousands of writes per second.

Schema-on-write architecture #

In a schema-on-write architecture, the data model (or “schema”) reigns supreme. New data destined for the database is blocked unless it conforms to the existing schema; no data at all can be written unless a schema exists a priori.

The schema defines what things are (entities), what characteristics individual entities possess (attributes), and how entities relate to one another (relationships). As a result, schema definition forms the semantic backbone of an organization.

When it comes to data management, it is prudent to follow the practices advocated in the relatively mature field of master data management. Some in particular include:

Data modeling

Accurate modeling of the domain and producing entity relationship diagrams (ERD)
Explicitly defining entity uniqueness via primary keys
Explicitly specifying data types and enumerated types
Including metadata around record management (e.g. created_at_utc, last_updated_at_utc, created_by_user, last_updated_by_user)

Data governance

Specifying data owners and data stewards for various data sets
Producing documentation on all entities, fields and values
Annotating which fields are original to a given system versus derivative of another system
Avoiding, or carefully managing, the storage of personal identifiable information (PII)
Improving data discovery to identify already existing concepts, entities and logic elsewhere in the firm

Data automation

Flagging system-to-system reconciliation failures and remediating manually or automatically
Using automated input (via API) over error-prone human entry (free-form text)

Master data management applies whenever we are dealing with entity-based data.

It doesn’t matter whether the entities live in an enterprise system (such as Salesforce or HubSpot), a vendor API (such as Stripe or Twilio), a self-managed database (such as RDS, Azure SQL or MongoDB), an Excel file, or a text file. If you are working with logical entities, you should practice MDM.

Schema-on-read architecture #

In a schema-on-read architecture (also sometimes called “event sourcing”), we are mostly concerned about getting data safely and quickly into the system. If you are capturing hundreds of orders per second, you want to ensure you don’t lose any of them, which can occur from time to time due to service unavailability, network failures and queue overcapacity (among other reasons).

At such a high write velocity, you generally don’t have time to enforce referential integrity for each event. You simply want to store the event quickly and durably. An example event looks like this:

client_timestamp_ms=1677549412381 server_timestamp_ms=1677549817539 user_id=d41d8cd9 order_id=0010316 server=nj-us-east-1 ip=233.188.23.10 amount=40 currency=USD

Logging systems expose client APIs which support tremendous write-throughput (often backed by a cluster of “write nodes”), quickly persist data to disk, copy it for redundancy and index it for rapid retrieval. Logging systems and implementations generally feature:

High write-availability: the service is always operational and accepting writes
High read-availability: when we query for recent logs, we always get results
High-capacity buffer: we can track substantial data per event, enabling high spatial resolution (e.g. Splunk accepts 10,000 characters, or about 2,000 words, per event by default)
Durability: writes, once acknowledged, are never lost
Idempotence: concurrent writes to multiple write nodes do not result in duplicate data
Distributed tracing: “unique requests” can be followed throughout various services in a microservice architecture

On the frontend, examples of logging systems include Mixpanel, Segment, Hotjar and Heap Analytics. Backend logging systems include AWS CloudWatch, Splunk, DataDog, Logstash and Kafka, as well as NoSQL stores such as MongoDB and DynamoDB.

(previous)(next)