(This post is part of a series on working with data from start to finish.)
On a long enough timeline, everyone realizes they need a data infrastructure.
Marketing wants to know who their customers are, sales wants to know if they’re on track to hit monthly quotas, product wants to know if people are using the new features they launched. Everyone wants to know how their business is doing, and for that, they need visibility.
Absent a data infrastructure, these teams are forced wrangle data themselves. They export data from their email marketing systems, customer relationship management systems and product tracking systems, then dump it all into a singular, ponderous Excel workbook. Here, they “fuzzy join” on quasi-identifiers such as email address or full name, ultimately producing a lengthy but mostly comprehensive data set upon which they calculate aggregate statistics.
It is at this point that data errors routinely become apparent (“why is this figure so high?”), as do data ambiguities (“what does this field mean?”). Adjacent data is solicited (“can we also include this data?”), as is more recent data (“can we get this weekly?”).
Each step of this process is manual, time-consuming and error-prone.
It is this essential workflow that a data infrastructure aims to streamline. A data infrastructure offers:
So, in short, what is a data infrastructure?
It is an integrated collection of systems which capture, consolidate and transform data from across the organization for the purposes of human and machine learning.
Let’s unpack this definition. First we have “capture”: the data infrastructure is generally assumed to have all firm-wide data, and if it does not, it is responsible for acquiring it. This ranges from cataloging data assets at the firm to procuring third-party data sets to installing application-level instrumentation for data tracking purposes.
Next we have “consolidate”: all data must make its way to the firm’s “single source of truth”, the data warehouse. Here, we can apply firm-wide rules and definitions all in a single place without having to reapply them throughout a jumble of enterprise systems. In addition, we need not concern ourselves with data inconsistencies between various enterprise systems, such as the number of active users reported between a backend system and a frontend system, because only the data warehouse contains the authoritative calculation on such a metric.
Then we have “transform”: raw data must be validated, corrected, enriched and joined before it is considered usable by business stakeholders. After processing, this data is partitioned into clean, concise, domain-specific “data marts” where business users must do little more than filter and aggregate their data in order to generate insights.
Finally, we have the goal of a data infrastructure: to effect human and machine learning.
Traditionally, learning from the data is conducted by humans who perform exploratory data analysis (EDA) and statistical modeling to elicit patterns in the data. However, increasingly, machines are able to learn from data, detect their own patterns and make testable predictions. Both human and machine learning enable people to understand how systems work and develop concrete, actionable interventions to improve their performance.