2023-03-14 data

The architecture of a data infrastructure (Part 2)

(This post is part of a series on working with data from start to finish.)

When thinking about systems, it’s often useful to analyze the architecture of a system alongside its infrastructure. An architecture conveys a high-level understanding of how the system should operate in theory. An infrastructure specifies a low-level implementation of how the system will operate in fact. Where an architecture is conceptual, an infrastructure is physical.

A data infrastructure too typically follows a certain architectural pattern, the most common of which can decomposed into a series of “layers”. These are:

  1. The Sources Layer: The locations and methodologies under which source data is captured across the firm
  2. The Transit Layer: Connectivity between data sources and their destination, the data warehouse
  3. The Warehouse Layer: The single source of truth for all firm-wide enterprise data
  4. The Semantic Layer: A unified documentation portal on all source data, derived data and metadata
  5. The Presentation Layer: A user-friendly business intelligence tool enabling rapid data analysis
  6. The Research Layer: A remote, high-capacity computing environment for large-scale, programmatic data analysis

Each architectural layer is undergirded by a single, infrastructural primitive. For transit, it is the data integrations tool; for presentation, the business intelligence tool; for research, the interactive notebook. Throughout this series, we’ll dive into each layer in considerably more depth.

In many ways, modern data engineering is more akin to DevOps than it is to software engineering. In DevOps, there is a fairly standard suite of services you need to implement no matter the organization: cloud hosting, CI/CD, IAM, environment isolation, virtual networking, containers, logging, monitoring, alerting.

Due to the uniformity of these needs, vendors specializing in DevOps have emerged over time to comprehensively address each. For cloud hosting, it is Google and Amazon and Microsoft. For CI/CD, it is GitLab, GitHub and BitBucket. For logging, DataDog, Splunk and Elastic Stack.

And so it is with data engineering. You always need data connectivity between systems, which led to Stitch and Fivetran and Airbyte. You always need a distributed data warehouse, which led to Redshift and BigQuery and Snowflake. And you always need to visualize data, which led to Tableau and PowerBI and Looker.

While software engineering proper always reserves the right to develop “custom components” - a state machine here, a compression algorithm there - little about the modern data stack is custom. For the most part, these are solved problems using standard, off-the-shelf components. The job of the data engineer then is to provision, configure, connect and maintain such components.

This of course does not mean the job is easy! Even if you know how a magic trick is done, it’s another thing to do it well. And within a data infrastructure, with several disparate components, enormous data sets and long feedback loops, there are many, many ways for things to go wrong.

Build-versus-buy #

Given such a mature vendor landscape in data engineering, one of the core skills of the data engineer therefore becomes the ability to evaluate which vendor to choose, if any at all. In other words, the decision to build-versus-buy.

Regarding “build”, only in rare cases should data engineers develop novel components, as doing so would reflect fairly bespoke data requirements that few organizations have. While most companies fantasize about vast machine learning deployments, real-time data systems and elaborate A/B testing infrastructures, few need this. Instead they need basic infrastructure: data tracking, data centralization and data analysis.

Data engineers should almost never be writing their own extractors, orchestrators, data transformation frameworks or analytics frontends as, once again, these are all solved problems. It is far cheaper and faster to use hosted, off-the-shelf components (or free, open-source ones) than to pay engineers to create and maintain such systems.

Regarding “buy”, which vendor should be chosen, as there are generally a handful for any given problem? Every vendor will claim to have a solution for all of your data problems. None of them completely will. Each offering presents its own tradeoffs. What feature set, user experience, pricing, performance, security and technical support does your particular organization require? It is increasingly the job of the data engineer to provide such an analysis.

Although the role of the data engineer has rapidly evolved over the past two decades - from managing on-prem databases to coding up map-reduce jobs to pipefitting between various data vendors - one thing is clear: time-to-market and financial costs have dramatically fallen due to recent improvements in data technologies.

In the past it would generally take quarters or even years and a handful of engineers to build a petabyte-level data infrastructure. Now, enabled by tools from the modern data stack, it takes an engineer or two just a couple of months to do the same.