(This post is part of a series on working with data from start to finish.)
Every data infrastructure starts with the same thing: raw data. To process and analyze data, first you must capture it, and this begins in the data sources layer.
Historically, database administrators were too not involved in the collection of data. Instead, they ingested data into the infrastructure “as-is”. Questions from business users about a source system’s data model, semantics or accuracy were tactfully redirected to the appropriate system owner: Salesforce questions went to sales, HubSpot questions to marketing, QuickBooks questions to finance.
Like a spinning carousel, questions about data routinely went in one end and out the other. DBAs believed their job was to manage everything within the data warehouse. Business users, on the other hand, believed the data platform was to manage all data assets at the firm.
As the mandate of traditional DBAs progressively shrunk, the mandate of the broader data team correspondingly grew. The data platform expanded its purview into procuring data, logging data, cataloging data and developing well-designed data models.
Data engineers were to apply data collection best practices upstream and ensure that data going into the data warehouse was of high quality. Within the sources layer, data acquisition fell into the following common functions:
These functions further divided into two broad categories depending on the type of data collected: “event-based data” and “entity-based data”. With event-based data, typically your foremost concern related to write capacity. With entity-based data, you focused on high-fidelity domain modeling.
This distinction was popularized by Martin Kleppmann, who in Designing Data-Intensive Systems contrasted two foundational data management designs (a summary of which is here):
In a schema-on-write architecture, “entities”, described and encoded in a schema, are the principal focus. For schema-on-read, it is “events” (sometimes referred to as “documents”).
From a data warehousing perspective, entities generally become our dimensions and events become our facts. For example, if we are building a mosaic of users by tying together click activity, marketing engagement and transaction history, then the user (entity) is our dimension to be subsequently joined onto click, marketing and transaction facts (events).
While your run-of-the-mill, relational database such as PostgreSQL or MySQL can very well accommodate both approaches, different databases are commonly used for each architecture.
Relational databases (RDBMS) excel at modeling and normalizing entities. Conversely, non-relational databases (i.e. NoSQL) and logging systems (e.g. Kafka, Logstash and DataDog) excel at high write capacity and are able to absorb hundreds of thousands of writes per second.
In a schema-on-write architecture, the data model (or “schema”) reigns supreme. New data destined for the database is blocked unless it conforms to the existing schema; no data at all can be written unless a schema exists a priori.
The schema defines what things are (entities), what characteristics individual entities possess (attributes), and how entities relate to one another (relationships). As a result, schema definition forms the semantic backbone of an organization.
When it comes to data management, it is prudent to follow the practices advocated in the relatively mature field of master data management. Some in particular include:
Master data management applies whenever we are dealing with entity-based data.
It doesn’t matter whether the entities live in an enterprise system (such as Salesforce or HubSpot), a vendor API (such as Stripe or Twilio), a self-managed database (such as RDS, Azure SQL or MongoDB), an Excel file, or a text file. If you are working with logical entities, you should practice MDM.
In a schema-on-read architecture (also sometimes called “event sourcing”), we are mostly concerned about getting data safely and quickly into the system. If you are capturing hundreds of orders per second, you want to ensure you don’t lose any of them, which can occur from time to time due to service unavailability, network failures and queue overcapacity (among other reasons).
At such a high write velocity, you generally don’t have time to enforce referential integrity for each event. You simply want to store the event quickly and durably. An example event looks like this:
client_timestamp_ms=1677549412381 server_timestamp_ms=1677549817539 user_id=d41d8cd9 order_id=0010316 server=nj-us-east-1 ip=18.104.22.168 amount=40 currency=USD
Logging systems expose client APIs which support tremendous write-throughput (often backed by a cluster of “write nodes”), quickly persist data to disk, copy it for redundancy and index it for rapid retrieval. Logging systems and implementations generally feature:
On the frontend, examples of logging systems include Mixpanel, Segment, Hotjar and Heap Analytics. Backend logging systems include AWS CloudWatch, Splunk, DataDog, Logstash and Kafka, as well as NoSQL stores such as MongoDB and DynamoDB.