2023-03-07 series

What is a data infrastructure and why do you need one? (Part 1)

(This post is part of a series on working with data from start to finish.)

On a long enough timeline, everyone realizes they need a data infrastructure.

Marketing wants to know who their customers are, sales wants to know if they’re on track to hit monthly quotas, product wants to know if people are using the new features they launched. Everyone wants to know how their business is doing, and for that, they need visibility.

Absent a data infrastructure, these teams are forced wrangle data themselves. They export data from their email marketing systems, customer relationship management systems and product tracking systems, then dump it all into a singular, ponderous Excel workbook. Here, they “fuzzy join” on quasi-identifiers such as email address or full name, ultimately producing a lengthy but mostly comprehensive data set upon which they calculate aggregate statistics.

It is at this point that data errors routinely become apparent (“why is this figure so high?”), as do data ambiguities (“what does this field mean?”). Adjacent data is solicited (“can we also include this data?”), as is more recent data (“can we get this weekly?”).

Each step of this process is manual, time-consuming and error-prone.

It is this essential workflow that a data infrastructure aims to streamline. A data infrastructure offers:

Data centralization: Data across all enterprise systems is transferred to and stored within the firm-wide data warehouse
Self-service analytics: Data can be rapidly analyzed in a user-friendly business intelligence (BI) tool
Automated reporting: Data can be published (e.g. PDF, Excel, CSV) and delivered (e.g. SFTP, email) on a regular schedule
Automated testing: Data is routinely tested for integrity (e.g. duplicates, omissions, miscalculations, mismappings)
Data semantics: Data is clearly documented at the entity, field and value levels (including their lineages and relationships)
Data discovery: Data sets and dashboards can be rapidly surfaced across the organization
Data exposition: Data from the data warehouse is available to downstream consumers via API or ODBC (e.g. Jupyter notebooks, reverse ETL tools, software applications)
Research environments: Statistical and machine learning models can be developed in programming languages such Python, R and Julia and in high-performance computing (HPC) environments
Access control: Data visibility is restricted based on user role (i.e. RBAC)
Monitoring and alerting: Health metrics (e.g. job completions, query latencies, data errors) are available in dashboards and trigger automated alerts upon exceeding certain thresholds
Usage statistics: Usage of data assets (e.g. data sets, dashboards) is available in administrative dashboards

Defining “data infrastructure” #

So, in short, what is a data infrastructure?

It is an integrated collection of systems which capture, consolidate and transform data from across the organization for the purposes of human and machine learning.

Let’s unpack this definition. First we have “capture”: the data infrastructure is generally assumed to have all firm-wide data, and if it does not, it is responsible for acquiring it. This ranges from cataloging data assets at the firm to procuring third-party data sets to installing application-level instrumentation for data tracking purposes.

Next we have “consolidate”: all data must make its way to the firm’s “single source of truth”, the data warehouse. Here, we can apply firm-wide rules and definitions all in a single place without having to reapply them throughout a jumble of enterprise systems. In addition, we need not concern ourselves with data inconsistencies between various enterprise systems, such as the number of active users reported between a backend system and a frontend system, because only the data warehouse contains the authoritative calculation on such a metric.

Then we have “transform”: raw data must be validated, corrected, enriched and joined before it is considered usable by business stakeholders. After processing, this data is partitioned into clean, concise, domain-specific “data marts” where business users must do little more than filter and aggregate their data in order to generate insights.

Finally, we have the goal of a data infrastructure: to effect human and machine learning.

Traditionally, learning from the data is conducted by humans who perform exploratory data analysis (EDA) and statistical modeling to elicit patterns in the data. However, increasingly, machines are able to learn from data, detect their own patterns and make testable predictions. Both human and machine learning enable people to understand how systems work and develop concrete, actionable interventions to improve their performance.

(previous)(next)