2023-03-17 series

How to build a reliable, useful and performant data infrastructure (Part 3)

(This post is part of a series on working with data from start to finish.)

How do we evaluate a data infrastructure? What aspects do we care most about? What are its goals, its guarantees, its service-level agreements (SLAs)?

Most data infrastructures will generally aim to deliver the same things:

Coverage: Data across all enterprise systems is ingested into and available within the firm-wide data warehouse
Availability: Data infrastructure components (e.g. data integrations, data warehouse, BI tool) are consistently operational
Integrity: Data is complete, useful and free of errors (e.g. duplicates, omissions, miscalculations)
Currency: Data is updated on a frequent basis
Durability: Data is not inadvertently lost, deleted or rendered inaccessible (e.g. missing encryption keys)
Comprehensibility: Data is well-defined, consensual and publicly documented
Performance: Data is ingested, transformed and queried with relatively fast runtimes
Security: Data is encrypted and permissioned using role-based access control (RBAC)

In short, we want as much data as possible to be correct, current, secure, understandable and fast.

Software development best practices for data infrastructure #

When writing software, we generally follow certain software development practices. These help ensure that the software we produce is reliable, stable and functionally correct. For data infrastructure, these include:

Deploying via CI/CD pipelines: version control, feature branches, code reviews, linting, static code analysis, test runners, test environments and weekly releases
Creating developer sandboxes: developers should receive and develop within isolated clones of the data warehouse
Performing automated testing: write breaking unit tests for logic and warning data tests for data quality
Writing “configuration-as-code” for vendor tools instead of managing configurations by point-and-click
Employing idempotence: plan for duplicated insertions and discard accordingly
Using append-only, immutable and bitemporal logs to retain all changes made to source data
Ingesting “schemaless” source data to avert upstream, breaking changes to the schema
Practicing fault-tolerance: components regularly fail, so develop automatic retries, monitoring and alerting to remediate quickly
Learning the data model: if you get the entities and relationships right, everything else gets much easier
Considering your naming conventions, as these “public interfaces” tend to resist change most as an organization scales
Documenting generously: create business glossaries, data dictionaries, data lineages and data inventories
Applying security best practices of data minimization, need-to-know, least privilege, audit trails and masked PII

Collectively, these guidelines minimize the frequency of errors, accelerate development velocity and improve understanding of the data infrastructure and its outputs.

The modern data mindset #

However, even if you adhere to every software development best practice above, the data infrastructure will still be missing something essential. That is, velocity.

Each item above is technical in nature, which means each must be implemented and managed by an engineer. It is the engineer who is responsible for extracting schemaless fields into tabular form, who uniquely has visibility into the data model, and who produces documentation around where fields come from and what they mean. Little can be done without the engineer. And for a data infrastructure, whose ultimate goal is consolidating and conveying knowledge, this can materially hinder organizational learning velocity.

Traditionally, this engineer was referred to as the database administrator, or DBA, who served as fiduciary gatekeeper of the database. DBAs were responsible for provisioning the database, populating it, optimizing it, scaling it, securing it and more.

If you wanted data, you needed to go through the DBA. And because everyone around the firm constantly wanted more data, the DBA had little time to focus on anything else which made for a better data infrastructure: user experience, client service, knowledge bases, project management and so on.

Over the past decade, improvements in data warehousing technology have largely obviated the need for traditional DBAs. Data warehouses are increasingly cloud-managed (as opposed to on-premises), auto-scaling, and vastly more performant for analytical workloads than ordinary, non-distributed databases. The demand for DBA skills waned, and the demand for more modern data engineering skills accelerated.

This refreshed skill set evolved to address the core drawback of traditional data infrastructures: centralized management and administration which, by extension, led to technical gatekeeping and slow development velocity.

The new infrastructure would be centralized on data best practices but decentralized on data operations. It would give users access to more data, faster, with cleaner and more polished interfaces. And it would be organized around what are now called “data mesh architecture” and “data-as-product” principles.

The data mesh architecture #

Although data mesh architecture goes by various names - DataOps, domain-driven design (DDD), service-oriented architecture (SOA) - the ideas are largely the same.

Business users want access to more data, faster. They want end-to-end visibility into where data came from and what kind of processing was applied. Increasingly, they know SQL, which empowers them to clean, join and enrich data themselves. Less and less, they need a data engineer to perform “English-to-SQL” translation, and more and more, they are able to encode business rules directly into code.

Thanks to self-service tooling, there are fewer intermediaries than ever in the data stack. Anyone - from analysts to engineers to C-Suite executives - can ingest data using Fivetran, process it using basic SQL, and visualize it in Tableau. All code and data is public by default. Through transparency and accessibility, the data mesh architecture democratizes access to the data warehouse.

Maxime Beauchemin, the creator of Apache Superset and Apache Airflow, astutely observes: “The modern data warehouse is a more public institution than it was historically, welcoming data scientists, analysts, and software engineers to partake in its construction and operation. Data is simply too centric to the company’s activity to have limitations around what roles can manage its flow.”

In a data mesh, the job of the engineer is not to perform every data operation imaginable: ingesting data, processing data, validating it and so on. Rather, it is to define and enforce best practices which enable business users to do their job more effectively.

This means setting engineering standards, performing code reviews, administering access control, training on tool and SQL usage, certifying data sets and managing metadata. The data engineer makes it easy for anyone in the company to build within the infrastructure, not around it.

Data-as-product thinking #

When business users worked with traditional DBAs, they often grew accustomed to slow response times, cryptic answers littered with code fragments, and primitive interfaces. DBAs saw their jobs as technical in nature, and in turn, delivered technical solutions. Business users wanted business solutions.

Perhaps the most prevalent format of a solution comes by way of a “product.” With a product, you expect a reliable, QA-tested, factory-like quality. You get customer service and thoughtful design and thorough documentation. You may hear about product roadmaps or even obsoletion plans.

A good product - rather, a good solution - eliminates friction around doing what you want to do. A good product just works. For a data infrastructure, this means that accessing and learning from data should be frictionless.

Embracing a product mindset goes one step further. When you deliver a product, you don’t just consider operations - the sequence of steps necessary to ship a product. You also consider sales and marketing and finance. A comprehensive data infrastructure will do this as well.

Data-as-product thinking is not just about getting things to work. It is about getting them to work well, and ultimately, to solve people’s problems as effectively as possible.

(previous)(next)