alex petralia

ChatGPT and the newfound accessibility of unstructured data

2024-03-12T00:00:00+00:00

When we think of data, we typically think of structured data. Data which fits cleanly into a data table, neatly organized into rows and columns. Data which is readily amenable to calculating totals, averages and distinct counts. Data which is easy to summarize and, in turn, understand.

We don’t typically think of unstructured data: text, images, videos, PDFs and other proprietary formats, which collectively comprise the bulk of information in the world. In fact, the universe of unstructured data must necessarily be larger than that of structured data: all structured data ultimately derives from unstructured sources.

Real life is unstructured. When customers walk into and out of a retail store, video cameras do not store data points like {“customer_id”: “fcfbd2e1da00573f”, “direction”: “ENTRY”, “timestamp”: “2023-08-04T14:03:27.215”}.

Instead, they record raw sensor data which must be converted into records and fields, rows and columns. To produce the data point above, we must classify a customer, match it to existing customers (or not), classify the direction, and finally record the timestamp. Classifying a customer is itself no trivial task: we need a machine learning model trained on a large volume of historical data that associates image data with “people”, and further, people with historical “customers”.

Even a simpler case of unstructured data - the business email - can be instructive:

Hi Jane,

Our team met last week to discuss the Jan. 24 email campaign, and although the overall performance was good, we had a few lingering questions. Could you provide the underlying campaign data from MailChimp when you get a chance?

Thanks,
Sam

From this unstructured data, we might extract the total number of characters in the email (LEN(email_body)), the number of lines (LEN(STRING_TO_ARRAY(email_body, ‘\n’))), and whether or not the email contains the word “data” (CASE WHEN CONTAINS(‘data’, email_body) THEN TRUE ELSE FALSE END).

These are simple extractions. What if we wanted to know whether the email is_marketing_related or not?

It’s a useful property to know, but not one that is easily parsed from the email text alone. Here, we would again need a machine learning model trained on a large corpus of text to calculate if this email is “similar” enough to a set of pre-classified, marketing-related emails.

Such a machine learning model does not come cheaply. With structured data, we are off to the races, but with unstructured data, each classifier requires investment into a new model.

Absent a clear business case to invest in such a model, most unstructured data lives on the periphery of data analysis, and by extension, comprehension.

LLMs as no-code machine learning classifiers

Since the launch of ChatGPT in 2022, the cost of encoding unstructured data has dramatically fallen.

No longer must we build a bespoke machine learning model to classify whether an email is_marketing_related or not, we simply put the email into the LLM and ask. No longer must we ask if an image contains a dog or not, we simply put it into the LLM and ask. If we want to change the encoding to classify cats instead of dogs, it is as simple as updating one’s prompt.

LLMs have ushered in the era of no-code classifiers. Once the data is encoded, it is immediately consumable in data analysis, statistical analysis and even further machine learning.

The mad dash for unstructured data

As the cost of encoding unstructured data falls, the demand for unstructured data will rise. From where will companies source this unstructured data?

Data, whether structured or unstructured, is typically sourced from three main channels:

It is collected
It is scraped
It is bought (or bartered)

Facebook, for example, collects vast amounts of behavioral data from the users of its platforms. Surveillance systems, such as camera and video tracking systems, collect footage of people who enter within certain premises. Websites collect profile information from users who seek to register to the platform.

In virtually every case of collection, there is a quid pro quo: a service must be provided in return for user consent. If companies want to collect their own unstructured data, they must then provide a valuable service to get it.

Scraping data is what occurs when consent is not expressly obtained. We can, for example, build a crawler to scrape LinkedIn public profiles or, if you’re Google, the entire Internet. However, precisely because the data is necessarily open to the general public, it is often not as valuable as data which is collected privately.

The final route is to buy data or barter for it, for example by offering a bi-directional data integration. Data sets for sale frequently suffer from the same drawback as scraped data: end users typically do not consent to having their data used by anyone except the service providers they directly interface with, and so any remaining data they do consent to share is of low quality. Nevertheless, for data sets which do not contain user information (such as economic or financial data), purchasing data is a common approach for acquiring data.

A data strategy for unstructured data

In the future, companies will put more thought into how they expand their data footprint. Unstructured data will no longer be considered prima facie inaccessible: with an LLM encoder, unstructured data can be cheaply encoded. The challenge will be to procure it.

Companies will be increasingly creative and nimble in how they acquire data. They will build free apps to collect data from users, crawlers to scrape repositories of public text (such as website DOMs), and relationships with data brokers who amass large stores of unstructured data. They will be just as thoughtful in how they acquire unstructured data tomorrow as how they acquire structured data today.

Why doesn’t data engineering feel like software engineering?

2024-02-21T00:00:00+00:00

Summary

The role of data infrastructure is to record history and, eventually, interpret it. The role of software infrastructure is to transform and produce history (data)

Data engineers are fundamentally concerned with how state changes over time. Software engineers are concerned with change at a point in time (input/output)

It is perhaps telling that one of the fastest growing frameworks in data engineering - dbt - counts among its many features the ability to perform data tests, but curiously, not unit tests. Software engineers write unit tests, so why don’t data engineers?

Unit tests evaluate the correctness of our functions. If I pass in a test string to an MD5 hashing function, do I get the correct MD5 hash output? If I pass in a zero to a function which uses an average, do I correctly catch a DivideByZero error, or instead throw an unhandled exception that crashes the program? Unit tests take our inputs and outputs as “given” and assert the correctness of our logic in between.

Data tests, on the other hand, evaluate the correctness of our data. Are we ingesting data with duplicate primary keys? Do we have null data on critical fields? Is today’s sales volume unusually high or low? Is our data stale? Data tests take our logic as “given” and assert the correctness of the input or output data.

When you write code, you are essentially writing functions in a way that resembles pure mathematics. Take some input, apply some logic, produce some output. Take input x, apply function f, and produce output f(x). All software can be thought of as f.

Data, on the other hand, is x and f(x). It is what goes into f and what comes out of f. Data is “state” - the state of the world - that is transformed by some function. When you enter 4 + 3 into an interpreter, you are inputting two pieces of state (the operands 4 and 3) through an addition operator to produce the output 7.

Notice how the addition operator f is ignorant of the world external to its scope. The state of the world - the set of numbers which can be added - lives beyond f. Data lives in some persistent data store, such as a database, file system or physical medium - but not in f.

When you are the engineer writing f, life in some sense is easy. f, after all, is known.

Sure, you might wonder about the algorithmic complexity of f - how it performs in space and time over large amounts of data. You might wonder, given f and other fs, what else can be proved. You might wonder, because f is pure and stateless, how you might parallelize f across many machines. But at the end of the day, f is known, and given some input, we know what the output will be.

In data land, f is not known. x and f(x) are. Consider a file system in the wild. What produced these files? What programs wrote these binaries? The processes which generated this data - or in statistical parlance these “data generating processes” - are not known. We may speculate on what they are, but we cannot know for sure.

A substantial portion of applied statistics revolves around estimating the data generating process f based exclusively on observable data f(x). This contrasts against mathematics where f is defined, and given f, other theorems can be deduced. Statistics is inductive while mathematics is deductive. Data and code are no different.

When business people ask data engineers why a certain number went up or down, it is not the same as asking a software engineer why a certain number went up or down. Inferring the “why” from data alone is fundamentally an inductive process. Data engineers may explain what records changed, but the explanation for why they changed lives beyond the data. It must be hypothesized and inferred, often by way of statistics.

Software, on the other hand, is deductive. If a number outputted by a function is higher than expected, we must merely inspect the inputs and follow the logic of the function through. In software, f is known.

This difference between data and code permeates throughout all aspects of data and software engineering. Data engineers think of the data as “real” and the code as unknown. Data is history, and history can never be deleted, despite how unmanageable it can become over time. Software engineers think of the code as “real” - stateless applications which can be spun up or down at will - with comparatively less regard for the “state” that will be recorded for all of history.

Of course there is overlap between the two. Data engineers transform data, and software engineers manage state, but the longevity and granularity of state managed is typically what differentiates them.

For data engineers, everything becomes history, and history never goes away. For software engineers, state may be externalized to a config.yml, which itself may be versioned over time, but only for as long as prior versions are supported - generally not all time.

This explains why data practitioners and software engineers are concerned at a technical level about fundamentally different things.

Data engineers think about data volume, data durability, relational modeling, query access patterns, metadata, schema changes, bitemporal modeling, aggregation, and full refreshing data. Data analysts and statisticians use this data to infer patterns; namely, the data generating processes that produced such data. It is in this way that we learn from history.

Software engineers think less about the output data which lives forever, and instead about the input data which goes into a particular function or application. They think about code organization (e.g. OOP, public interfaces), static typing, serialization, containerization, parallel processing, algorithm development and algorithmic complexity. For software engineers, outputs over time are not as important as outputs at any given time.

The dividing line between software engineers and data engineers can then be defined, however lightly, by how the two view history. Software engineers are concerned with how history is made, and data engineers with how it is recorded and understood.

The Presentation Layer: Extracting insights from the data (Part 10)

2023-05-22T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

After we have stored, modeled and governed our data, we must finally make sense of it. This occurs in what is traditionally called the “data presentation layer”. Here, we present data to our users, and they in turn modify its representation in order to understand it.

The presentation layer serves three essential functions, each of which is enabled by the business intelligence infrastructure:

Reporting: We provide a bird’s-eye view of our operations by charting key performance metrics
Analysis: We decompose those metrics across various dimensions to identify areas of over- and underperformance
Inspection: We inspect individual data points to understand what general characteristics drive the behavior we observe in aggregate

Baseline reporting comes in the form of simple, zero- or one-dimensional metrics, such as “total year-to-date sales” or “total year-to-date sales by territory”. It is not designed to provide deep insight into the drivers of particular metrics, but rather a general intuition around levels (“are total sales in the millions or billions?”), ranges (“most sales are in the tens of thousands range”), and relative shares (“Northwest sales are only about 10% of our total sales”).

// zero-dimensional
total_sales

// one-dimensional
month (PK) | total_sales
2023-01-01 | 1,000
2023-02-01 | 2,000
2023-03-01 | 3,000

Analysis, on the other hand, pivots operational metrics along two or more dimensions, such as “regional sales per month” or “regional sales per month per salesperson”. If, for example, sales grew substantially in recent months, we can evaluate whether this growth came predominantly from a single region or whether it was distributed evenly across multiple regions. Further, when adding a third dimension, we can determine who the top-selling salespeople were within the top-selling regions.

// two-dimensional
month, region | total_sales
2023-01-01, EAST | 1,000
2023-02-01, WEST | 2,000
2023-03-01, NORTHWEST | 3,000

// three-dimensional
month, region, salesperson | total_sales
2023-01-01, EAST, A | 1,000
2023-02-01, WEST, B | 2,000
2023-03-01, NORTHWEST, C | 3,000

Unlike the reporting of key metrics across a single dimension, the analysis of key metrics across two or more dimensions can rapidly complicate one’s ability to extract insights from the data.

First, data insights by their very nature are high-level, universal and parsimonious. Multi-dimensional analysis, on the other hand, is verbose and nuanced. It is one thing to say “sales have grown in recent months.” It is quite another to say “sales have grown most in recent months from clients A and C in our Northeast region, and from clients B and F in our Midwest region.” In the former, there is a clear pattern; in the latter, there is not.

Second, there are typically many ways of pivoting a metric, and no single view will accommodate all possible ways. For example, consider a metric of total_sales which can be analyzed across the six dimensions of region, client, salesperson, date_year, date_quarter, date_month. One-dimensional pivots of the data - “sales by region”, “sales by client”, “sales by salesperson” and so on - yield us six views.

Now consider the business user wants to perform a more granular, two-dimensional analysis. “Total sales by client per year” is one way of viewing the data, but so is “total sales by client per salesperson”, “total sales by salesperson by client”, “total sales by client per quarter”, and so on. In total, there are 30 ways of permutating six choices along two dimensions of analysis. Not all of these can be shown at once.

For these reasons, analytical dashboards are typically separated from reporting dashboards. They contain the same data, but their understanding is more nuanced, their operation more hands-on, and their audience more research-oriented.

Finally, we have inspection. After users drill-down into the drivers of particular trends - “most sales this quarter came from our Northeast region” - they will want to see individual data points. They do this to (a) validate results and (b) build intuition around the data.

It is not at all uncommon for metrics reported in aggregate to be based on incorrect calculations[0]. Calculations can use the wrong data definitions, wrong data sources or wrong data transformations. Data correctness is hard to come by.

Business users, who observe day-to-day operations and develop an intuition for how the data ought to look, are often the first to say “that number seems wrong.” They will want to inspect the actual data points composing any particular calculation.

Upon seeing individual data points, these users will routinely say “these data points should not be in our data set” or “these data points are misclassified” or “we are missing some data here.” In other words, they will validate high-level reporting metrics using the low-level data points.

Sometimes, they will say: “That’s interesting, I didn’t expect to see that.” Or: “Most of these data points came from our one big client. I didn’t expect them to have such a large effect.” Here, business users will use high-level trends to shape their low-level intuition about the data.

Most teams want observability before they want analysis or inspection. While data analysts are often eager to shoehorn as many metrics and dimensions onto a dashboard as will fit, this is not in fact what business users want. They want simple, high-level takeaways about their operations from the data. Only after grasping this macroscopic view will they delve into the details via analysis and inspection.

The presentation layer is conventionally understood to be the end state of a data infrastructure because it is what is tangibly delivered from a data team to clients. In reality however, clients do not ultimately want dashboards. They want outcomes. How data analysis can be used to discover, deliver and measure such outcomes will be the subject of the next chapter.

For this chapter however, we’ll continue to focus on the infrastructural aspect of the presentation layer. In particular, how to choose a business intelligence tool and how to manage it to facilitate reporting, analysis and inspection.

Choosing a business intelligence vendor

There is a vast array of business intelligence providers - Tableau, PowerBI, Qlik, Looker, Sisense, Metabase, Preset (among others) - and all of them are sufficient to perform basic data analysis and visualization. As in the data warehouse and data transit layers, there exists feature parity in the business intelligence space: when one vendor adds a feature, the others follow.

The choice of business intelligence vendor largely depends on who is making the decision, what tools they have historical experience with, what the organization is willing to pay, and finally what features they need.

For example, if an organization needs several users to work on the same dashboard at the same time, they may opt not to use Tableau. If they need users to analyze CSV files from their local desktops, they may rule out Looker. And if the CTO has considerable experience with PowerBI, then the organization will likely use PowerBI. As all providers offer similar basic capabilities and user interfaces, the essential workflows can be quickly learned in each.

Nevertheless, organizations sometimes have more bespoke requirements, and business intelligence vendors vary in their provision or usability of corresponding features. It is therefore useful to compare vendors along a checklist of potential requirements so that an organization does not find itself one day needing a feature that the vendor cannot provide. Such features include:

Development

Concurrent dashboard editing
Code-based calculations and visualizations (such as LookML), which can be version controlled
Composability, where “modules” can be reused across various dashboards
Integrations to various databases, Excel and CSV files (although most data should live within the data warehouse)
Embedded analytics, where dashboards can be embedded into other applications (often using elements)

Design

Out-of-the-box visualizations (e.g. bar graphs, geospatial maps, funnel charts, Sankey diagrams)
Advanced formatting of text and visualizations (e.g. font styles, conditional formatting, secondary axes)
Code-based custom formatting (e.g. using CSS and JavaScript)

Performance

Data caching via extracts to improve dashboard performance

Automation

Scheduled email reporting and real-time alerting (e.g. into Slack or email)
API access (e.g. to export a series of reports or manage backups)

Governance

Advanced security management, including RBAC, SSO integration, SCIM, and row-level security
Usage and performance analytics (e.g. on dashboards or data sources)

Structuring a business intelligence deployment

After deciding on your business intelligence vendor, you will build reporting, analytical and inspection dashboards.

Reporting dashboards present a suite of metrics to business stakeholders under a particular theme, such as sales performance or marketing performance. They are passively consumed, informational, and understood in isolation. For example, a marketing team will use a reporting dashboard to display open rates, click rates and conversion rates of their various marketing campaigns over time.

To the extent that reporting dashboards reveal unexpected increases or decreases in the data (i.e. variability), users will typically demand analysis. They will want metrics analyzed along various dimensions to see what factors in particular drove high-level trends. This process occurs within analytical dashboards.

Finally, users will want to see individual data points. These inspection views should exist separate to reporting and analytical views. Underlying data is displayed in tabular format with as many columns as possible to add context, as well as filters to isolate relevant sets of data.

Reporting dashboards

Reporting dashboards succeed in their objective if they quickly and concisely answer the questions of their users. If a marketer wants to know whether subscriptions are up year-over-year, then they only want a single number: “Subscriptions are up 42%”.

If they further inquire into whether this was a transient or persistent trend, we would provide data from multiple periods: “Subscriptions are up 42% this year, compared to only 18% last year and 16% the year prior.” These statistics could be displayed using either a table or a visualization.

The reporting dashboard is designed around the goals of a business team, such as increasing sales or growing marketing engagement. These goals are, for the most part, well-defined and static. The attendant questions, such as “how are sales recently” or “how is marketing engagement recently”, are too. Reporting dashboards should provide these answers front and center.

Anything which obscures these answers - such as missing titles, confusing data labeling, haphazard visual organization or slow dashboard performance - rapidly diminishes the utility of these dashboards.

Good reporting dashboards make takeaways from the data obvious, with little to no work required by the user to understand what they need. If you cannot copy-and-paste a chart from a reporting dashboard into a PowerPoint deck, then it does not belong in the dashboard.

General principles for building reporting dashboards relate to[1]:

Visual hierarchy: Placing the most high-level and commonly asked statistics at the top of the dashboard
Concision: The visualization includes only the pattern to be observed (e.g. a line trending up), and nothing more
Economy: Details which do not immediately improve understanding are removed (i.e. Edward Tufte’s “data-ink ratio”)
Labeling: All titles and labels are written in plain English (and eschew technical or business jargon)
Performance: Loading times are kept under 5 seconds

Finally, after the above requirements are satisfied, other features can be added to reporting dashboards, such as print-to-PDF, daily scheduled emails, and threshold-based alerting.

Analytical dashboards

Analytical dashboards succeed in their objective if they produce novel insights from the data. Unlike reporting dashboards, there is no view of the data users routinely want to see, as such views would necessarily not be novel. Instead, business questions are not well-defined and investigations are more exploratory in nature. Analytical dashboards are therefore designed for exploratory data analysis (EDA).

Insights are novel to the extent they surface unexpected variability within the data. For example, email conversion rates may be higher among drip campaign emails compared to promotional emails, and it may be further the case that conversion rates are highest among drip campaigns placing the call-to-action at the top of the email rather than the bottom.

If email conversion rates were flat between drip campaigns and promotional emails, and as well as between emails with calls-to-action at the top versus at the bottom, then there would be no data insights of interest. As though placing weights on a balance, variability tells us which side of the data we should focus our efforts on.

It is not obvious in advance where to necessarily look for patterns. If you pivot the data by has_image or has_emoji instead of call_to_action_location, you might find that conversion rates are flat across those dimensions. It is only with respect to call_to_action_location that conversion rates differ.

As a result, analytical dashboards should make it easy to pivot by various dimensions. This can be achieved in the Editor view of the business intelligence tool, or by exposing parameters which allow the user to choose what dimensions to pivot the measure by. Additionally, analytical dashboards should generously include filters (often as many as the number of dimensions available) and enable data inspections of individual data points.

You’ll notice in the examples above that dimensions are always discrete: has_image, has_emoji, call_to_action_location. In fact, in data modeling more broadly, dimensions are discrete and facts (conversion_rate) are continuous. Traditionally, we perform analysis by pivoting a continuous measure, such as the conversion rate, by discrete dimensions.

If we have two continuous variables, such as number_of_images versus conversion_rate, then we typically discretize one of them by way of bins. The continuous variable number_of_images becomes the discrete variable number_of_images_bin with levels of [0, 1), [1, 3) and [3, 100). Then, as usual, we pivot the continuous conversion_rate by these discrete categories to determine whether it materially differs between them.

In actuality, conversion_rate is also discretized: after all, we must apply some aggregate function to collapse the individual data points, such as AVG(), to arrive at an average conversion rate in the number_of_images_bin = [1, 3). This is indeed the conventional analytical methodology in dashboard-driven exploratory data analysis: group-by a dimension, aggregate a measure.

While useful, this is only one particular style of analysis - one that I like to call “dispersionless”. When we partition the number_of_images variable into bins, we lose information on how many emails have number_of_images = 1 versus how many have number_of_images = 2. Conversion rates may very well differ between these two values, but we would never be able to see this difference using the bin of number_of_images_bin = [1, 3).

Additionally, the AVG(conversion_rate) also collapses the underlying distribution of conversion rates. If we sent just two email campaigns in the number_of_images_bin = [3, 100), one with a conversion rate of 90% and the other with a conversion rate of 10%, then our average conversion rate is 50%. However, we do not actually expect an average conversion rate of 50%, but rather a rate of “too little data to know”. Dispersion therefore qualifies our point estimates with corresponding uncertainty.

Charting continuous variables against continuous variables (or alternatively, facts against facts) typically sits within the realm of scatter plots and regression models. While business intelligence tools can produce these, they typically fall short of the outputs generated by full-fledged programming languages such as Python and R. Business intelligence tools are not statistical tools, and as such they do not produce statistical summaries or statistical graphs (e.g. CDFs and residual plots), nor do they work well with large volumes of unaggregated data.

Analytical dashboards are therefore useful for the conventional, “guess-and-check” analysis of pivoting measures by various dimensions in the hope that some of these dimensions exhibit unexpected variability. Oftentimes however, they can leave the analyst overwhelmed by the number of dimensions to choose from and underwhelmed by the amount of variability observed. Data-driven patterns can be hard to come by.

There is indeed a fast-track to identifying patterns which exhibit the most variability - that is, the most promising data insights - though this cannot be achieved using business intelligence tools. For that, we will need to jump into our Jupyter notebooks and perform some last-mile analytics. That is the subject of our next chapter.

What about metric stores?

Business intelligence tools today couple two separate but related analytical functions: the definition of metrics, as well as the visualization of those metrics. You might, for example, create an average revenue per active user metric in Tableau by filtering for all active users, then taking the average across users. You could further group by year if you wanted to compare how this metric changed over time, and visualize these changes as a bar chart.

However, once defined within Tableau, this average revenue per active user metric is not easily replicated to other applications such as Jupyter notebooks, Excel workbooks, webapps, other business intelligence tools, or enterprise data systems like Hubspot and Salesforce.

If a Jupyter user wanted to use the average revenue per active user metric, they would have to recreate the underlying SQL, then reconcile its results to those in Tableau. They could not simply reference the figures produced by Tableau directly. Metrics, when created within the presentation layer, essentially become “locked in” to the tool.

Most commonly, when replicating a metric to another system, users imitate the SQL but do not perform the reconciliation. Without the reconciliation however, inconsistencies invariably arise and consume hours of analyst time to resolve. The desire to sidestep this issue entirely - to reference metrics instead of duplicating them - led to the emergence of “metric stores”.

Metric stores decouple the definition of metrics from their visualization. Metrics are defined exclusively in the metric store and referenced by the business intelligence tool. Under this model, business intelligence becomes just another consumer of metrics, like Jupyter or Excel or an internal webapp.

Metric stores are sometimes called “headless BI” because they encode the logic of metrics without their attendant, end-user presentation (“the head”). They provide CLIs and APIs, but not GUIs. Because metric stores extricate the code logic from the visual display, metrics can be more easily stored in a version control system, documented, code reviewed, tested and deployed.

The semantics of modern-day metric stores originated in 2012 with Looker’s LookML, which pioneered the definition of metrics in simple, human-readable configuration files:

view: encounter {
  sql_table_name: lookerdata.healthcare_demo_live.encounter ;;
  dimension: status {
    label: Status"
    type: string
    sql: ${TABLE}.status ;;
  }

  dimension: code_name {
    type: string
    sql: case when ${code} = 'IMP' then 'Inpatient'
    when ${code} = 'AMB' then 'Ambulatory'
    when ${code} = 'EMER' then 'Emergency Department' end;;
  }

  dimension_group: period__end {
    label: "Discharge"
    type: time
    timeframes: [
      date, week, month, year, day_of_week, time, time_of_day, hour_of_day, raw
    ]
    sql:  ${period}.end ;;
  }

  measure: count_patients {
    label: "Number of Patients"
    type: count_distinct
    sql: ${patient_id} ;;
    drill_fields: [patient_id, patient.gender, patient.age, patient.los]
  }

  measure: med_los {
    group_label: "LOS Statistics"
    label: "Median Length of Stay"
    type: median
    sql: ${length_of_stay} ;;
    value_format_name: decimal_2
  }

  measure: repeat_patients {
    label: "Percent of Repeat Patients"
    type: number
    value_format_name: percent_2
    sql: 1.0*(${count}-${count_patients})/nullif(${count},0) ;;
  }

Looker’s insight was to equip otherwise nondescript SQL tables (or dbt data models) with the analytics-friendly handlebars of dimensions and measures. These handlebars could be variously combined to form “metrics” - a particular, summary view of the data - such as::

  - title: 'Error Type 2: Data Entry Error'
    name: 'Error Type 2: Data Entry Error'
    model: healthcare
    explore: observation_vitals
    type: looker_bar
    fields: [observation_vitals.type, observation_vitals.count_anomalies]
    pivots: [observation_vitals.type]
    filters:
      observation_vitals.issued_hour: 48 hours
      observation_vitals.absolute_standard_deviation: "<15"
    limit: 500

Above, Error Type 2: Data Entry Error represents a final, polished metric ready to be consumed by business stakeholders. In English, it translates to: “the number of Type 2 data anomalies per observational vital type in the past 48 hours, excluding outliers”. If another application wanted to reference this metric, it could simply query this Look (a reference to a Looker visualization) via API.

Although Looker invented the decoupled business intelligence layer, it had two drawbacks that prevented mass adoption. The first was the price: a starter package will run in the several tens of thousands of dollars, putting it out of reach for most individuals and early-stage startups. The second was that LookML was not in fact a universal protocol: it only worked with Looker. It was not designed to work with Jupyter, internal webapps, or other BI clients like Tableau (despite what 2-year old press releases may claim).

Over the past few years, open-source alternatives have emerged to make metric stores more broadly accessible. The most prominent of these are Cube.js, Transform’s MetricFlow (recently acquired by dbt Labs) and Google’s Malloy (still experimental). Each prescribes similar semantics to Looker, such as Malloy below:

source: airports is table('malloy-data.faa.airports') {
  measure: airport_count is count()
  measure: avg_elevation is elevation.avg()
  query: top_5_states is {
    group_by: state
    aggregate: airport_count
    limit: 5
  }

  query: by_facility_type is {
    group_by: fac_type
    aggregate: airport_count
  }
}

To request the number of airports by facility type, a client would run query: airports -> by_facility_type.

Is a metric store necessary?

One might wonder why dbt models, which also structure data using dimensions and measures, do not suffice for defining metrics. If metrics should not be defined within the presentation layer, perhaps they can live within the warehouse layer instead?

This is in fact how many data teams operate today. Whether it is to create reusable data marts or to craft bespoke team metrics, as much business logic as possible is encoded within dbt. The goal is to make the warehouse layer as intelligent as possible, and the presentation layer as naive as possible.

The disadvantage of this approach is that metrics are situated alongside data marts, living within the same codebase and the same database. Metrics, however, are not data marts.

Data marts are modular, reusable, materialized data sets (or “OLAP cubes”) which can be used to craft many metrics. Metrics, on the other hand, are particular slices and aggregations of this underlying OLAP cube. A single dashboard can produce various metrics all sourcing from the same OLAP cube.

If we define metrics using dbt, we would need a great many dbt models: one for num_anomalies_by_observational_vital_type_past_48h_excl_outliers, as well as one for num_anomalies_by_practictioner_name_past_48h_excl_outliers, and so on. Clearly, this approach is untenable at scale, as it would crowd out the reusable, modular OLAP cubes with narrowly defined, bespoke metrics.

Metric stores are downstream of OLAP cubes but upstream of BI clients. They allow business logic to be pulled out of the presentation layer, yet not be haphazardly stuffed into the warehouse layer. They facilitate the broad reusability of metrics while at the same time preserving their separation from dbt’s reusable, modular data sets.

[0] This is known as Twyman’s law.

[1] Metabase also provides useful documentation on how to design business intelligence dashboards.

(previous)(next)

The Semantic Layer: Know Your Data, or KYD (Part 9)

2023-05-07T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

As covered in the introduction, a data platform conveys information about the business which can be subsequently used to understand and improve operating performance. Importantly, a successful data platform does not merely produce data. It produces knowledge.

While business users frequently have no issue extracting data from the data platform, knowledge is harder to come by:

They do not know where to find particular data, metrics or dashboards
They do not understand how particular metrics are defined
They do not agree on how particular metrics are defined
They do not have confidence in the correctness of particular metrics

These pain points can be grouped into two themes:

Data discovery; and
Data correctness

Business users want to know where to look for answers to their questions, and when they find them, they want them to be correct.

In absence of a self-service tool, business users will ask the core data platform team directly to answer their questions (“Where can I find information on user activity? Is this figure correct?”). Any self-service tool which exists in the semantic layer must as a result satisfy these needs at least as well as the core data platform team itself.

Data discovery

When business users have questions, they need to know where to find answers. As the last two decades have shown, the definitive solution to this problem has been a search engine: “Just Google it.”

Google’s famously spartan home page illustrates what users today demand in search. They do not want filters or categories or sort options to choose from. Instead, they want a simple “linguistic interface” - the search box - to understand their question and return the most relevant results. How exactly Google does that is completely opaque to the user.

Behind the scenes, of course, Google injects considerable structure into the vast sea of content floating within the Internet (a process called “indexing”). Like pairing a wine with a cheese, this is the mechanism by which Google associates any set of search results with a given search query.

In the data landscape today, it is “data catalog” or “data portal” tools which offer such search capabilities. Like Google, they expose a search box and produce results such as:

Data sources
Transformation logic
Metrics
Dashboards (or “metric collections”)

It is important to note that, for the most part, business users do not ultimately want any of these search results. Instead, they want answers. If they could harness a ChatGPT-like linguistic interface to answer their questions, then they would never need to view individual metrics or data sources. They only examine these when immediate answers to their questions are unavailable.

And how does Google choose which results to return, and in what order? In an ideal world, it apprehends your search query perfectly. then returns results which are (1) most relevant and (2) most correct. Results which are irrelevant or incorrect do not constitute high-quality answers to queries.

While relevance depends on successful semantic parsing, “correctness” constitutes an entirely different problem which fundamentally depends on consensus.

Data correctness

If the calculations returned by the data catalog are incorrect, then they are effectively useless. As an example, if a user searches for “monthly profit and loss (P&L) on our new widgets product line” and the data returned is an order of magnitude off the correct figures, then the answer is as good, or worse, than no answer at all.

How would we know this data is correct?

First, the metric must be defined correctly. If the metric defines P&L as “net revenues less operating expenses”, but we define it as “net revenues less operating expenses and capital expenditures”, then the metric will be wrong for our needs, even if it is calculated correctly.

Second, the metric must be calculated correctly. If the metric defines P&L as “net revenues less operating expenses”, but the actual calculation uses gross revenues instead of net revenues, then the metric will be incorrect, even if it is defined correctly.

Data definitions

Before business users are able to achieve visibility into their operations, they must first specify what they want to see. They must define metrics which are relevant to their operations, such as “average revenue per customer” or “customer lifetime value” or “90th percentile latencies”, and then subsequently calculate them.

These definitions typically live in a “business glossary.” Collectively, they minimize ambiguity over what business terms can possibly mean. A “daily active user” may be defined for example as “any non-personnel, non-test user who logs into the application on a given day based on UTC timestamps”.

At first glance, this definition seems fairly palatable, although even it does not completely escape scrutiny. Business users headquartered in New York may question why “days” are calculated using the more systems-relevant UTC timestamps instead of the more business-relevant EST timestamps.

Indeed, when teams cannot agree on how particular metrics should be defined, they must inevitably create new definitions which satisfy their own needs (such as daily active user, UTC and daily active user, EST). Teams which continue to use shared business terms (daily active user) whose definitions do not satisfy their own individual needs will continually question the correctness of the data.

Because metrics are defined and tailored to the needs of individual users and teams (that is, people), it is important to annotate for whom a given metric is relevant. If one dashboard illustrating system availability metrics is used by a junior systems engineer while another is used by the CTO, then the latter will appear to be more “authoritative” and “credible” than the former.

As we learned in the philosophy of data, data correctness reflects the degree to which independent opinions converge on the same answer. If there is firm-wide consensus on which dashboards and which metrics are “correct”, then any metrics created by individual teams must first and foremost reconcile with the firm-wide metrics in order to be correct.

Due to the primacy of consensus in assuring data quality, data catalog tools must always enrich metric data with stakeholder metadata. Airbnb’s Dataportal provides an illustrative example of how stakeholder information, such as (a) usage popularity, (b) discussion boards, and (c) upvotes or approvals on metrics can all be used to fortify consensus.

Data calculations

After business users specify and define the metrics which are relevant to their operations, they must calculate them. Here, the technical implementation of the calculation should correspond exactly to the definition found in the business glossary.

As these calculations are often performed using SQL, this effectively represents an exercise in “English-to-SQL” translation. The specifications and constraints of the business definition must be precisely encoded into SQL.

Taking the example above of calculating daily active users, EST, one might use the following SQL:

WITH system_logs AS (
    SELECT DISTINCT 
        DATETRUNC(DAY, server_timestamp::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM api_logs

    UNION ALL

    SELECT DISTINCT 
        DATETRUNC(DAY, created_at_est::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM click_logs

    UNION ALL

    SELECT DISTINCT
        DATETRUNC(DAY, connector_synced_at::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM mobile_logs
)

SELECT 
    s.log_timestamp_est AS login_date_est, 
    COUNTD(s.user_id) AS num_users
FROM system_logs AS s
    LEFT JOIN users AS u ON s.user_id = u.id
WHERE TRUE
    AND NOT u.is_test
    AND u.employee_id IS NULL
GROUP BY 1

Despite its apparent simplicity, there are many places this calculation can diverge from its lexical definition, including:

The sources chosen (api_logs, click_logs, mobile_logs)
The fields chosen (mobile_logs.connector_synced_at vs. mobile_logs.insertion_timestamp)
The transformations applied (e.g. filters, group-by keys and aggregate functions)
The underlying data itself (e.g. mobile_logs failed to populate yesterday’s data)

Due to these potential sources of difference, business users typically want end-to-end visibility into how a given metric was calculated - that is, its data lineage. If and only if all of these inputs are “correct” according to the business user, then the calculation as a whole is said to be correct.

Automated testing can be used to validate both the integrity of the logic and the integrity of the data. Unit tests, for example, test that the logic applied to the data is correct. More specifically, this necessitates writing tests to ensure the sources, fields and transformations do not undergo a “regression” in the form of some future code change (e.g. a developer inadvertently swapping which fields are used).

Data tests, on the other hand, assume the logic is correct and instead test the data itself. This can include validating the recency of the data, detecting outliers, identifying missing or unexpected values, and reconciling row counts between tables. Many data tests (unlike unit tests) come built-in to dbt, and more are available from packages such as dbt-utils and dbt-expectations.

The semantic layer today

Self-service tools in the semantic layer today have not yet matured to the same degree as those in the data integrations, warehouse or presentation layers; however, they represent an area of active investment and growth.

Popular data governance providers today include Collibra, OvalEdge, Atlan, Acryl, Sifflet and Select Star. All of these provide the basic capabilities of enabling data discovery and data validation.

Many companies continue to use homegrown solutions to administer their data governance (or avoid data governance entirely), although with the growing complexity of data infrastructures, this is increasingly inadvisable. Documentation, metadata and manual testing should move out of knowledge bases and shared documents and into dedicated data governance tools.

(previous)(next)

The Warehouse Layer: Database administration in a modern data warehouse (Part 8)

2023-04-26T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

We’ve now covered the history of data warehouses, as well as how they are architected in dbt to refine raw data materials into finished data products.

Of course, there’s considerable work in managing a data warehouse beyond what strictly lives in the codebase. This work, typically within the ambit of traditional database administration (DBA) and DevOps, spans four major focus areas:

Managing costs
Administering role-based access control (RBAC)
Experimenting with new features
Improving the developer experience (DX)

Managing costs

Unless you are running your data platform on a single, on-prem database or a distributed data lake infrastructure, you are most likely relying on a cloud service provider (CSP) to manage your data platform for you. As of 2023, your options are:

Google’s BigQuery
Amazon’s Redshift
Databricks’ Data Lakehouse
Snowflake

Whether you manage your own infrastructure on-prem or have a CSP do it for you, there will be costs. These costs can be (1) explicit, such as the direct financial costs incurred for using the service; and (2) implicit, such as the engineering effort required to master a tool, vendor lock-in, and an impoverished feature set.

Because implicit costs are more difficult to measure, although no less important, than explicit costs, they are out of scope for this analysis. Here we’ll review the explicit financial costs assessed by the service provider, as well as how to manage them.

First, most vendors break out costs between the storage of data, which is negligible, and the compute upon data, which is more expensive. Compute is charged on a volume basis (e.g. per terabyte) or temporal basis (e.g. per second).

Second, comparing costs between vendors is not straightforward. Google charges by the terabyte, Snowflake by the “compute credit”, Databricks by the Databricks Unit (DBU) and Redshift by the Redshift Pricing Unit (RPU).

Units of compute are generally measured in multiples of the smallest compute cluster. For example, an “extra-small” (XS) compute node on Snowflake (which, behind the scenes, is an AWS c5d.2xlarge VPS or equivalent) costs $2.00 to run for an entire hour[0]. By extension, a “small” (S) compute cluster has twice the resources (two c5d.2xlarge instances), twice the speed, and twice the cost ($4.00 for the hour).

A unit of compute on Snowflake, however, is not necessarily equal to that on BigQuery or Redshift or Databricks. Each provider implements its query engine differently. A unit of compute on Redshift may exhibit relatively slow performance for one query (compared to other providers), but relatively fast performance for another query.

Naturally, every provider when advertising its query engine chooses the queries which highlight its engine’s strengths and downplay its weaknesses. Most data practitioners generally agree however that BigQuery and Snowflake offer the best performance per unit cost (though this can vary depending on your organization’s needs).

Regardless of the CSP, the ability to estimate and manage costs has squarely fallen under the purview of modern data engineers. Data engineers are uniquely capable of understanding the strengths and weaknesses of various query engines, forecasting an organization’s analytical and performance needs, and finally marrying these with various pricing schemes to develop pro forma cost estimates along monthly and annual time frames. As more conventional data engineering work is pushed to the cloud, data engineers are increasingly tasked with the less conventional work of managing those CSPs.

Cloud offerings, by their very nature, aim to provide “infinite scaling”, meaning any workload and any data volume can be handled by the infrastructure. With infinite scaling, however, comes infinite costs. In practice then, most organizations do not want infinite scaling. They want reasonable scaling and reasonable performance at a reasonable cost. It is the job of the data engineer to provide this.

In an ideal world, the cost of computation is tightly coupled to its associated benefit. If a dashboard costs $400 to update on a monthly basis, does it provide at least $400 of value? In reality, dashboards and data sets are requested in the short-term with little visibility into costs in the long-term, leading to the frequent condition of “runaway costs” when developing infrastructure in the cloud.

Data engineers should therefore exercise fiscal discipline when building out their infrastructure: that is, the intuition and tooling to know whether certain queries are justified by their associated costs.

The first step is to make analytical costs and benefits legible by:

Tagging table refreshes, dashboard loads and ad hoc analytical queries (e.g. by using query tags)
Building dashboards which visualize overall costs, along with which queries incur the greatest cost
Setting alerts when certain cost thresholds are met
Enriching cost dashboards with end-user engagement of analytical dashboards to quantify benefit

The second step is to optimize and reduce those costs by:

Specifying account-wide spending budgets
Reducing job frequencies (if prompt data is not required)
Profiling and refactoring slow queries
Indexing the data using sort keys, cluster keys and/or partition keys

Administering RBAC

Invariably, as the data warehouse collects more data from across the entire firm, it becomes the case that not everyone (including engineers) should be able to view all parts of it. Financial data, customer data or personnel data often require some form of access control.

The most common methodology for managing permissions is called role-based access control (RBAC). Users are granted one or more roles, and roles are associated with a set of privileges. Roles can be nested under other roles, where the superordinate role inherits all privileges of the subordinate role, thereby forming a role hierarchy.

For example, to manage access of sensitive HR data, an engineer may set up three roles: HR_VIEWER, which permits viewing of HR data, HR_EDITOR, which permits updating of data, and HR_ADMIN, which grants overall resource management over anything related to HR data. HR_ADMIN can inherit from HR_EDITOR, which in turn inherits from HR_VIEWER. A user requiring access to HR data will be granted the appropriate role for their given job functions.

Ideally, it is not the data engineer who continually manages which users should have which roles. This can lead to issues where, for example, an employee switching out of the customer service team never has their CUSTOMER_VIEWER role revoked, and therefore is still able to view customer data despite not being permitted to.

Instead, user and group information should be managed by the information security team in a centralized identity management tool (IDM), such as Okta or Britive. When an employee leaves the company or switches teams, the change is encoded exclusively within the IDM tool, which thereafter propagates changes in access control to all other systems via SCIM.

Experimenting with new features

Modern cloud offerings continually expand their feature sets, and it is the responsibility of the data engineer to investigate whether new features can improve developer workflows. This entails attending feature demos by vendors, reading white papers and marketing collateral, testing out the new features, and internally training on feature usage across the team.

Because the data warehouse market is competitive, there is generally feature parity across the vendors: if Snowflake implements a new and widely used feature (such as Data Sharing), then soon enough the others will too. Examples of these features include:

Time Travel: the ability to resurface results or tables up to several days (or weeks) prior
Data Sharing: the ability to share data sets between organizations without any ETL required
Query History: the ability to recover SQL queries related to ad hoc analyses performed months or years ago
User Defined Functions: the ability to run custom JavaScript or Python functions on SQL data
Hybrid Tables: the ability to use the same data store for both OLTP and OLAP data

Improving the developer experience (DX)

Finally, data engineers are tasked with continually investing in the infrastructure in order to improve (their own) developer experience. Anything which reduces the amount of developer time spent on debugging errors, repeating routine actions, or researching how to do things means more time spent delivering data products. This includes:

Automatically provisioning developer environments (clones) to enable isolating testing
Implementing a linting tool (e.g. sqfluff) to standardize SQL code
Using merge request checklists to ensure code review methodologies are followed
Improving semantics and data discovery to quickly answer questions or track down errors (e.g. using dbt docs)
Integrating with various systems to reduce manual work (e.g. SCIM to avoid manually performing RBAC changes)
Documenting standards (e.g. release guides, support guides, style guides, testing guides)

These practices represent “guardrails” to development which enforce high standards upon code quality and reduce the likelihood of errors (e.g. a former employee who still has access to the data warehouse). They enable the core data infrastructure team to serve as, in the words of Maxime Beauchemin, “centers of excellence”, setting the standard for anyone who contributes to the codebase.

[0] For reference, the actual rental cost of the VPS from Snowflake ranges from $0.24 to $0.38, thereby yielding them a compute markup of 6x to 10x.

(previous)(next)

The Warehouse Layer: Transforming data with dbt (Part 7)

2023-04-17T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

Over the past decade, the advent of the distributed data warehouse dramatically simplified how data is stored, managed and analyzed. No longer did engineers need to manage physical servers or configure Hadoop clusters; instead, they could operate exclusively within the data warehouse using “just SQL.”

This SQL constituted the “code” behind the data warehouse and, like all code, it needed to be organized. In addition, the data itself it needed to be organized. These were respectively called “software architecture” and “data architecture.”

It is worth clarifying why software and data are treated differently. Software lives in a codebase and is applied to data[0]. Data, on the other hand, lives in a database. Software consists of “stateless” operations (functions or algorithms) which, like moving pieces on a chessboard, transition data from one “state” to the next.

Because software concerns the logic upon which data transitions between states, it is organized “imperatively” as a sequence of steps, or a pipeline. Conversely, data itself is laid out along a flat chessboard where everything can be viewed, related and operated upon from above.

For the purposes of this essay, we’ll review both the software and data architecture of a codebase using dbt, the foremost orchestration tool for SQL-based data pipelines.

Software and data architecture

Software architecture refers to how code is organized within the codebase while data architecture refers to how data is organized within the database. While software architecture concerns the hierarchical and modular nature of code, data architecture concerns the relational model of data.

For example, defining the behavior of a payments system, such as making a transaction, performing a refund or querying an account balance, would constitute software architecture. Modeling each discrete component - a transaction, a refund, a balance - and their respective relationships would instead comprise data architecture. This is conventionally taught as the difference between “verbs” (actions) and “nouns” (things).

In a dbt-managed data pipeline, software architecture is the sequence of SQL code responsible for transititioning data from one state to the next. Data architecture, on the other hand, describes the structure of that data at any particular state in time.

As data moves through the pipeline, it is progressively transformed from normalized to denormalized, from system-designed to human-designed and, most importantly, from raw to business-ready.

Although dbt recommends three stages of the data pipeline (staging, intermediate, marts), we’ll explore a slightly different version in this essay:

Sources
Staging
Core (facts and dimensions)
Data marts

Sources

Like a shipping port welcoming the arrival of new freight, the sources stage collects all data from across the firm into a single location for subsequent processing. In the vision of Bill Inmon, this represents the “enterprise data warehouse” (EDW), which retains the maximally normalized relational model of transactional databases.

In the sources stage, we do not perform any data processing. Data is transmitted as-is from data sources (e.g. Salesforce, Hubspot, Postgres, Mixpanel) to the enterprise data warehouse. Data integration tools, such as Airbyte or Fivetran, perform this row-for-row replication automatically.

Data integration tools will specify a “destination” in the data warehouse, typically an isolated database such as DW_PROD_SOURCES. Each data source is assigned its own schema: Salesforce data in SALESFORCE, Hubspot in HUBSPOT and so on.

In the dbt codebase, one could organize the code using folders for each source:

sources/
    hubspot/
        sources.yml
    salesforce/
        sources.yml
    mixpanel/
        sources.yml
    jira/
        sources.yml

Generally speaking, prefixing file names is equivalent to nesting files within folders (as in AWS S3). Therefore, the following structure would also work:

sources/
    hubspot_sources.yml
    salesforce_sources.yml
    mixpanel_sources.yml
    jira_sources.yml

For the intrepid, one could also consolidate all data source information into a single, warehouse-wide sources.yml file, although this is not recommended.

Staging

After ingesting our sources, we perform initial processing of the data. Here, we use “staging tables” to:

Specify or cast data types (e.g. quantity::INTEGER)
Rename fields (e.g. initialBid to initial_bid)
Map original values to more commonly used values (e.g. “N/A” to NULL)
Perform basic calculations (e.g. first_name + “ “ + last_name)
Format or localize timestamps (e.g. TO_TIMESTAMP(unix_created_at))
Flatten nested fields (e.g. JSON)

Each staging table should correspond with an entity in the data source. For example, DW_PROD_SOURCES.SALESFORCE.CONTACT might have an associated staging table of DW_PROD.STAGING.STG_SALESFORCE_CONTACT to process raw Salesforce contacts.

Because staging tables correspond to source data and not to business-specific use cases, we should rarely if ever perform joins (JOIN), aggregations (GROUP BY) or filters (WHERE) in staging.

As shown in the prior example, we typically use a single STAGING schema within a DW_PROD data warehouse (as separate from our sources database, DW_PROD_SOURCES) to consolidate all processed data. Staging tables follow the nomenclature of stg__, such as stg_salesforce_contact and stg_salesforce_account.

One might wonder why we do not use separate schemas for each source, as we had done in DW_PROD_SOURCES, especially since staging tables have one-to-one correspondence with source tables. In addition, why do we put staging tables in DW_PROD and not DW_PROD_SOURCES?

If we were to keep our staging tables adjacent to our source tables in DW_PROD_SOURCES, for example by placing STG_SALESFORCE_CONTACT next to CONTACT in the DW_PROD_SOURCES.SALESFORCE schema, then we have coupled our processing of data with our ingestion of it. Should a developer want to test both staging and “data mart” code in the same session, he or she must constantly switch between databases.

If, on the other hand, we created a schema for each source in DW_PROD, then we would have the undesirable layout of source schemas (SALESFORCE, MIXPANEL, JIRA) next to pipeline schemas (STAGING, INTERMEDIATE, CORE, MARTS).

As a result, we typically place all staging tables in the DW_PROD.STAGING schema. While this in theory runs the risk of having too many processed source tables in a single schema, in practice only a limited set of source tables are ever processed and table prefixing by source name (e.g. stg_salesforce_contact) prevents clutter.

In the dbt codebase, we place each staging table into folders segregated by data source:

staging/
	salesforce/
		stg_salesforce_contact.sql
		stg_salesforce_account.sql
	hubspot/
		stg_hubspot_firm.sql
        stg_hubspot_email.sql
        stg_hubspot_campaign.sql

Core and intermediate

Even after the initial processing of data in our staging tables, our data retains its highly normalized structure. It is only in our (1) “intermediate” or (2) “core” tables that we begin to denormalize data by joining it together. Here, in the heart of the data warehouse, we apply core transformations which render raw data comprehensible to the business.

There are two approaches to structuring data in this stage. dbt recommends the use of intermediate tables: modular components which can be variously assembled to produce business-ready data marts. An alternative approach uses Kimball-style dimensional modeling to construct “fact” and “dimension” tables.

In practice, these two approaches are very similar. Both encode concepts, such as “the number of outreach activities per sales territory per day”, into tables at a given grain. Both produce modular components which can be joined to other components to form a more synoptic view of the business.

They only differ with respect to the recommended amount of denormalization: an intermediate approach will pre-join more data in advance (i.e. our fact and dimension tables), while a dimensional approach will leave fact and dimension tables separate until joined at query time.

As is typical in data warehousing, there is no one right answer, and multiple approaches can be used to achieve the same outcome. Here, we’ll explore the structure of a dimensionally modeled data warehouse using fact, dimension and even intermediate tables.

Fact tables

Fact tables capture facts about the world, such as “the volume of transactions processed per month” or “the number of users logging in per day.” Generally speaking, they represent “events” (actions, verbs) which occurred over time, such as a history of user logins, clicks or transactions.

Facts exist at a particular grain, such as “per day” in the “number of users logging in per day.” The grain corresponds to a fact’s analytical resolution, meaning you can analyze everything above the “line of representation” but nothing below it. If you need more granular visibility into a given fact, you must choose a higher-resolution grain and create a new fact table.

The most granular fact is, in Kimball’s terminology, the “atomic grain”: the maximum amount of detail that a given business process captures. In the terminology of resolution, this corresponds to a business process’ instrument resolution, the maximum resolution at which we can record data.

Fact tables typically include at least one quantitative measure, such as the “number of distinct users” per month or the “maximum latency” per thousand network requests. These measures frequently correspond to SQL’s most widely used aggregate functions: COUNTD, SUM, MIN, MAX.

In SQL, a fact table (prefixed using fct_) may look like:

fct_user_logins_daily.sql

SELECT
	event_date,
	COUNTD(user_id) AS num_users
FROM i_platform_activity
GROUP BY event_date
ORDER BY event_date DESC

event_date	num_users
2022-04-05	10
2022-04-04	14
2022-04-02	12
2022-04-01	16
2022-03-31	10

When possible, measures should be “additive”, meaning they can be summed over any dimension. For example, the SUM of sales over each month is additive because individual monthly totals can be added together to produce an aggregate total.

On the other hand, if each row in the fact table represented the AVG sales per month, then these averages could not be added together: the data would be “non-additive”. Finally, there is “semi-additive” data, such as bank balances, which can be summed across some dimensions (e.g. different bank accounts) but not across time (e.g. last month’s balance plus this month’s balance).

Dimension tables

If fact tables correspond to “verbs” which occur over time, then dimensions represent the “nouns” and “adjectives” which embroider those events with additional detail.

For example, we might know the “number of website visitors arriving per day from each marketing channel”, but we do not necessarily know which marketing channel is paid versus organic, in-person versus online, or web-based versus mobile. To determine these, we must join our fact table to a dimension table (prefixed using dim_) containing this information.

dim_marketing_channel.sql

marketing_channel_id name is_paid is_in_person is_mobile
google ads TRUE FALSE YES
conference TRUE TRUE FALSE
hubspot_marketing FALSE FALSE FALSE

Why don’t we include all this information in the fact table upfront, thereby obviating the need to perform any joins at all? Indeed, a maximally denormalized table could include all possible fields from related tables. However, such a wide table would routinely include hundreds or thousands of irrelevant fields for any particular business query and would therefore not constitute a reusable component.

The grain of a fact table is defined by its primary key, which itself is composed of a set of attributes (dimensions). Thus the “number of website visitors arriving per day from each marketing channel” would have a primary key of (date, marketing_channel), both of which could optionally be joined to dim_date and dim_marketing_channel should we need additional dimensional data.

To visualize which fact tables are associated with which dimension tables, a two-by-two “bus matrix” is typically used to map the relationships. This should be updated whenever new fact tables or dimension tables are added to the data warehouse.

Intermediate tables

You’ll notice in the fct_user_logins_daily example above that we sourced data from a table called i_platform_activity. Here, i_ signifies “intermediate”.

Intermediate tables can be useful as intermediate stages of processing between source data and business data. Recall that in staging, our tables should have one-to-one correspondence with sources. But what if we need a data set that combines data from many sources?

For example, if we want to consolidate all user platform activity from our backend API logs, our frontend JavaScript logs and our mobile application logs (which all reside in different systems), then we must UNION ALL these data sets together to get a holistic view of our users.

This cannot be done in our staging tables, and if it is done in our fact tables, the logic must be repeated at each grain of fact: fct_user_logins_daily, fct_user_logins_weekly, fct_user_logins_monthly and so on. This would produce considerable redundant code.

Instead, we can build a reusable component, i_platform_activity, which is referenced in each downstream fact table. No longer must each fact table duplicate the UNION ALL logic. It can be stored in an upstream component, leaving the fact tables only to group by various temporal dimensions (day, week, month).

Generally speaking, you should not need intermediate tables until you identify redundant code in the fact or dimension tables. Intermediate tables should live in the INTERMEDIATE schema and be saved as views, as they should not be queried by end users directly.

Late-arriving facts (LAF) and slowly changing dimensions (SCD)

Sometimes, event data is written to the database “late”, meaning old event data is added to the data warehouse even after newer data has already arrived. For example, imagine a daily ETL job which failed to ingest data yesterday, succeeded today, and then, upon someone noticing the error, was manually rerun today to fetch yesterday’s data. In this case, yesterday’s data arriving today would constitute “late arriving facts” (LAF).

Late-arriving facts can be problematic when we perform “incremental” transformations. Typically, we do not want to process the entire history of data every single day (e.g. all click history), but rather only the last few days’ history. This is called an “incremental run”.

If facts arrive late however, how exactly do we specify data which has already been processed? If we only filter for today’s data using WHERE event_timestamp >= TODAY(), having made the assumption that yesterday’s data has already been processed, then we will fail to process yesterday’s data which arrives today.

The solution to late arriving facts is “bitemporal modeling.” Here, we maintain two timestamps within the data: (1) the original timestamp as recorded by the source system (event_timestamp), and (2) the timestamp at which the data was ingested into to the database (ingested_timestamp).

Now, instead of filtering for today’s data as recorded by the original timestamp, we filter for today’s data based on when it was ingested into the database WHERE ingested_timestamp >= TODAY(). Late arriving facts ingested today will be duly processed along with all other new data[1].

Bitemporal modeling is also used for dimension tables, although for slightly different reasons. Dimensions commonly do not have any temporal component at all: you have a list of all users or all products or all customers, but unlike events, you do not necessarily have them over time. They are ingested at a point in time, after which they change “slowly.”

For example, imagine you are calculating the monthly sales per product for various products listed on Amazon, and specifically you are comparing the sales between products which had free_shipping against those which did not. Your SQL query would look something like:

SELECT
	p.free_shipping,
	p.name AS product_name,
	DATETRUNC(‘month’, o.order_date) AS order_month
	SUM(o.amount) AS total_amount
FROM fct_orders AS o
LEFT JOIN dim_products AS p ON o.product_id = p.product_id
GROUP BY 1, 2, 3
ORDER BY 1, 2 ,3

Although this query is superficially correct, what happens if products have changed their free_shipping status over time? If a product had free_shipping last month but not this month, then the query above would erroneously classify last month’s sales as free_shipping: false (using the current snapshot of product data) when in reality free_shipping: true (had we used a historical snapshot of product data). This is the problem of “slowly changing dimensions” (SCD).

As with late arriving facts, we must inject a temporal component into our dimensions to specify when exactly a given product had free_shipping. In doing so, we transform these tables into what are often called “history tables”, “snapshot tables” or “audit tables”, as they record the history of all changes made to any given dimension.

Our dim_products dimension table would now look something like this:

dim_products_history.sql

products_history_id product_id product_name free_shipping valid_from valid_to
1 stapler TRUE 2023-01-01 2023-01-31
1 stapler TRUE 2023-02-01 2023-02-28
1 stapler FALSE 2023-03-01 NULL

Our updated SQL query would look as follows:

SELECT
	p.free_shipping,
	p.name AS product_name,
	DATETRUNC(‘month’, o.order_date) AS order_month
	SUM(o.amount) AS total_amount
	FROM fct_orders AS o
LEFT JOIN dim_products_history AS p
	ON o.product_id = p.product_id
	AND o.order_date BETWEEN p.valid_from AND COALESCE(valid_to, TODAY())
GROUP BY 1, 2, 3
ORDER BY 1, 2 ,3

Structuring the core and intermediate schemas

In the database, all fact and dimension tables should live in the CORE schema of PROD_DW. Any intermediate tables, to the extent they are necessary, should live as views in the INTERMEDIATE schema.

In the codebase, we can similarly use intermediate and core folders to delineate the separate schemas:

intermediate/
	i_platform_logins.sql
core/
	fct_platform_logins_daily.sql
	fct_platform_logins_monthly.sql
	fct_email_to_login_conversions.sql
	dim_marketing_campaigns.sql
	dim_users.sql
	dim_firms.sql

Whether they are intermediate tables or core fact and dimension tables, every component should aim to be as synoptic as possible. For example, a dimension table dim_hubspot_marketing_campaigns containing only data from Hubspot would be less synoptic than a dim_marketing_campaigns which creates a UNION ALL of all marketing data sources.

As a general rule, fact and dimension tables should UNION ALL as much data as possible and filter out (WHERE) as little data as possible. Flags can used in the dimension tables to specify their source (e.g. is_hubspot or hubspot_campaign_id).

Data marts

At last, we have reached the final stage of data processing: clean, joined and enriched data ready for direct consumption by the business teams. When using data marts, business users should do little more than filter, group and aggregate their data.

Data marts represent the most denormalized version of our data. Unlike fact and dimension tables, they are not reusable components and they are not building blocks. Instead, they should be used exclusively and narrowly by the team which requested them.

Data marts are created by joining together fact and dimension tables. For example, if our Finance team needs to analyze “sales by territory over time, excluding holidays and only within the Americas geographic region”, we simply need to join our sales facts to our calendar and region dimensions, then perform some filtering:

mart_finance_sales_by_territory_americas.sql

SELECT
	DATETRUNC(‘month’, o.order_date) AS order_month,
	t.territory_name,
	SUM(o.amount) AS total_amount
	FROM fct_orders_daily AS o
LEFT JOIN dim_calendar_table AS c ON o.order_date = c.date
LEFT JOIN dim_territories AS t
	ON o.territory_id = t.territory_id AND o.order_date BETWEEN t.valid_from AND COALESCE(t.valid_to, TODAY())
WHERE TRUE
	AND c.date NOT c.is_holiday
	AND t.region = ‘Americas’
GROUP BY 1, 2

Notice how the moment we apply filtering in the WHERE clause, our data becomes less reusable for other analytical questions. This is why we seldom use WHERE clauses in our fact and dimension tables but often use them in our data marts.

In the codebase, each business team receives its own folder where all relevant data marts are stored:

marts/
	finance/
		mart_finance_revenue_by_territory_americas.sql
		mart_finance_revenue_by_territory_emea.sql
		mart_finance_revenue_total.sql
		mart_finance_pnl_by_product.sql
	marketing/
		mart_marketing_campaign_conversions.sql
		mart_marketing_webinar_prospects.sql
		mart_marketing_user_journeys.sql
		mart_marketing_email_segments.sql
	product/
		mart_product_feature_usage_by_cohort.sql
		mart_product_usage_stats_by_feature.sql
		mart_product_churned_users_by_feature.sql
		mart_product_journey_completion_funnel.sql

In the database, tables follow the nomenclature of mart__, such as mart_finance_sales_by_territory_americas, and live within the MARTS schema.

Data marts are frequently created in the business intelligence tool, such as Tableau or Looker, rather than in the data warehouse. Tableau, for example, uses the concept of “Data Sources”, wherein fact and dimension tables are joined together using a drag-and-drop interface.

These joins however often occur at query time and can expose substantial latency to the end user. To improve query performance, they can be “pushed down” to the data warehouse where tables are pre-computed in advance. This means that while most data marts will not exist in SQL under marts, those which need to be materialized for convenience or performance reasons will.

[0] Of course, software is also a form of data, and can be transitioned through a series of “code states” by way of a version control system, such as `git`.

[1] In practice, we often want to use `WHERE ingested_timestamp >= DATEADD(‘day’, TODAY(), -2)` to include a “lookback period” in our incremental runs. This allows us to write not just today’s data, but additionally overwrite the last two days of data in case any other joined dimensions were late arriving.

(previous)(next)

The Warehouse Layer: A short history of data warehousing (Part 6)

2023-04-07T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

At last, we’ve reached the nucleus of the data infrastructure: the data warehouse. Like freight arriving at its final destination, raw data from around the company is shipped here to the data warehouse for central processing.

The data warehouse is the “single source of truth” (SSoT) at the firm. While various enterprise systems may be “systems of record” (SoRs), serving as exclusive entry points for new data and therefore sources of truth in their own right, only the data warehouse is responsible for providing accuracy guarantees across all data at the firm.

For example, a company may maintain information about its customers in both a customer relationship management (CRM) like Salesforce, as well as an enterprise marketing software such as HubSpot. It may subsequently ingest both sets of customer data into its data warehouse, for example AWS Redshift.

Here, the company would designate only one system, such as Salesforce, as their system of record for customer information. If an employee needed to update a certain customer’s name or email, they would do so in Salesforce. Any updates to customer information within HubSpot would either be automatically denied, or understood to be secondary relative to Salesforce data.

When customer information is queried within AWS Redshift, it is Salesforce data which is first and foremost exposed. A data warehouse therefore always makes explicit or implicit decisions about which source data systems are systems of record. Data within the data warehouse should always reconcile exactly with data in the system of record. If it does not, the data warehouse is in error. The data warehouse may freely fail to reconcile with non-systems of record, as they do not contain authoritative source data.

Because the data warehouse makes explicit guarantees about the quality of its data - either collected as-is from source systems or derived from a collection of source systems - it serves as the semantic backbone of an organization. It determines which systems are systems of record, adjudicates definitions between various systems, and singularly applies validation and transformation to raw data that need not be replicated across all source systems. Only the data warehouse can provide a unified, internally consistent and synoptic view of all data at the firm.

The objective of a data warehouse, like a well-run factory, is to transmute raw materials into processed goods - that is, raw data must become well-defined, functional data assets. This is achieved through three broad mandates:

Evaluating and selecting data infrastructure tools
Modeling and transforming data
Administering the data warehouse

We’ll begin data infrastructure tooling, and how in particular the field has evolved over the past 50 years.

The 1970s: The emergence of the relational database

In 1970, computer scientist E. F. Codd published a seminal, 10-page paper titled “A Relational Model of Data for Large Shared Data Banks”, wherein he proposed a relational model of data and an associated “universal data sublanguage”.

At the time, data was most commonly arranged using a hierarchical model, which can still be seen today in the layout of file systems, in file formats like XML and JSON, and in HTML’s Document Object Model (DOM). Although hierarchical models reflected an intuitive way of conceptualizing data - a “person” sits within a “team” within a larger “organization” - it came with certain drawbacks.

The first was that it repeated data in multiple places and, over time, this produced data inconsistencies. We can examine this by way of the popular JSON format. Imagine we have the following data:

data = [
   {
      "organization":{
         "name":"Acme Corporation",
         "teams":[
            {
               "name":"Finance",
               "people":[
                  "Steven Smith",
                  "Jane Doe",
                  "Sarah Connors",
                  "Evan Middleton"
               ]
            },
            {
               "name":"Marketing",
               "people":[
                  "Steven Smith",
                  "Kaitlyn Wood",
                  "Jack Feinwood",
                  "Steven Lanyard"
               ]
            },
            {
               "name":"Engineering",
               "people":[
                  "John Finch",
                  "Angela Vickers",
                  "Sally Beckhert",
                  "Robert Samueslon"
               ]
            },
            {
               "name":"Human Resources",
               "people":[
                  "Steven Vale",
                  "Zachary Seaward",
                  "Sam Slate",
                  "Walter Iverson"
               ]
            }
         ]
      }
   }
]

Notice how Steven Smith appears under both the Finance and Marketing teams. If Steven leaves the firm, now we have to ensure he is removed in two locations. Should we erroneously remove him from only one team, the data will be internally inconsistent and by extension inaccurate.

The second issue relates to access paths. Imagine we want to count how many employees we have at the firm. We might write JavaScript code that looks like this:

[... new Set(data.filter(obj => obj.organization.name === 'Acme Corporation')[0].organization.teams.flatMap(obj => obj.people))].length

This says: give me the organization whose name is “Acme Corporation”, look within their associated teams for people and count the distinct number of names (here, 15 in total).

Notice how, to answer this question, we must follow a particular path, typically called a “query access pattern”. This path represents the logical hierarchy of entities. To find what you are looking for, you must always start at the top of the hierarchy (or root of the tree) and navigate to the bottom.

Despite us only needing a list of people at the firm, we must unnecessarily route through unrelated “nodes” of the hierarchy - such as the teams people sit within - to get what we’re looking for. When scaling up to large data sizes, scanning billions or trillions of records per query, such inefficiencies invariably become slow and expensive.

Codd, in his 1970 paper, invented a system for organizing data which was effectively “access pattern free”. It would rely on free-floating “relations” (more commonly known as database tables) which specified relationships to one another. There would be no hierarchy and no tree, only a flat landscape of tables and their associations. It would become known as the relational model of data.

Codd’s formulation was based entirely on relational algebra, which meant all data operations could be represented mathematically using predicate logic. If you passed a collection of input relations (tables) through a handful of well-known set functions, such as set union or set difference, then you could guarantee the output relations (result sets), no matter how they were computationally arrived at.

In other words, if you started with two data sets and wanted the intersection of values between them, you knew what the result would be, even if you didn’t exactly know how you’d compute it. One such algorithm would be iterating through each value in the first set, checking if it existed in the second set, and only if so placing it into a third “intersection” result set. Other algorithms include binary search, merging search and fast set intersection.

Some implementations would be faster or slower, but thanks to relational algebra, we were always assured of what the output would be. This effectively cleaved two layers into the design of database systems: the first, which specified mathematically what we wanted, and the second, which actually implemented it.

The former became the “data sublanguage” that Codd had originally envisioned: SEQUEL (later called SQL), invented by Codd’s IBM colleagues Donald Chamberlin and Raymond Boyce in 1973.

The latter became the “query engine”, the workhorse at the center of the database, which parsed incoming SQL statements and assigned the most efficient “query plan” to execute them. Whether the query engine ordered data using a merge sort or quick sort or heap sort was entirely obscured to the end user. Propelled from beneath by the query engine, database users could write SQL joins, filters and aggregations - all enabled by the guarantees of relational algebra - without ever having to worry about how exactly to implement them.

The 1980s: Relational databases take over, but not over everything

By the late 1970s, Codd’s ideas had permeated beyond the walls of IBM and began to see wider acceptance in the broader technology community. A then-unknown software engineer recognized the potential of Codd’s work and released the first commercial version of the relational database management system (RDBMS) in 1979.

That software engineer was Larry Ellison, who went on to found the technology titan Oracle, a $100B+ enterprise which today houses the data of many of the largest firms and governments in the world. This would not be the first time that lucrative technology had escaped the clutches of technology incumbent IBM. Just years prior, IBM outsourced development of the PC operating system to a fledgling company named Microsoft, which would soon vault into becoming one of the world’s most valuable technology companies.

By the late 1980s, relational databases had thoroughly embedded themselves into the circulatory system of every major enterprise and government. Airlines, banks, hospitals and municipalities all eagerly threw their data into relational databases. The moment you had software automating business processes, you now also had a relational database behind the scenes storing and manipulating the data.

As more and more organizations piled their data into relational databases, a new problem emerged. Early databases were designed to rapidly store and retrieve individual records, but they were not necessarily designed for large-scale analytics. If an airline wanted to know which flight routes featured the highest profit margins over the past five years, they would likely run into at least three issues.

The first was that they would have to join many, many tables together. Early databases were designed for “online transactional processing” (OLTP), which prioritized inserting, updating and deleting data over and above reading data. To do so quickly, the structure of the data, or “schema”, needed to be as federated as possible.

Every logical entity - a passenger, a ticket, a flight, a booking - needed to be its own table. This sprawling data model, the pinnacle of which was Codd’s “third normal form” (3NF), ensured data consistency when calculating results, but also made analytical work more complex.

The second issue was that there was often no singular database which contained all the information required to perform certain analytical queries. An airline might keep flight information in its flight management system, booking information in its ticket management system, and financial information in its accounting system. Each of these were backed by separate relational databases.

Finally, large analytical queries were computationally expensive, and burdening operational databases with such queries routinely caused the databases to crash. These crashes would subsequently cascade into a failure of the entire production system.

To address these issues, database designers opted to create a secondary, read-only “data warehouse” whose exclusive use would be analytical in nature. It would be a relational database, just like the OLTP databases, but would be isolated from production systems and would consolidate all information between disparate databases. It would replicate data byte-for-byte from operational systems via ETL processes and feature a “denormalized” schema to reduce the amount of joins required to query data.

This database, designed first and foremost for “online analytical processing” (OLAP), would herald a new era in large-scale analytics: that of data warehousing.

The 1990s: From databases to data warehouses

By the early 1990s, most large enterprises utilized a data warehouse to support complex analytical work. At the same time, they found their business users increasingly using spreadsheets to perform small-scale data analysis. The first popular spreadsheet program was Lotus 1-2-3, released in 1983, which was soon eclipsed by Microsoft Excel, which launched in 1985.

With the aid of spreadsheets, business users could analyze data via a simple, convenient, graphical user interface (GUI) instead of the more abstruse command-line interface (CLI) provided by relational databases. They could perform the hallmark of data analysis - filtering, pivoting and aggregating - all by point-and-click rather than SQL commands.

However, they could only do it for “small data” - that is, the amount of data which fit into the spreadsheet. For Excel, that was 1,048,576 rows. Data analysis therefore forked into two paths: the one where you had small enough data which fit into Excel, and the other where you didn’t and needed to perform analysis directly within the data warehouse.

When it came to the data warehouse, there was no clear consensus on how exactly to structure the data. We knew it should be less federated (or “normalized”) than the OLTP database, but how much less?

In 1992, Bill Inmon published Building the Data Warehouse, where he argued that the data warehouse should be as normalized as the OLTP database, only differing insofar as it contained all firm-wide data and not just the data for a given application. In Ralph Kimball’s 1996 The Data Warehouse Toolkit, Kimball instead suggested the data warehouse should follow a more denormalized structure to simplify analysis for business users.

Kimball’s methodology, which he called “dimensional modeling”, revolved around central “fact tables” surrounded by supporting “dimension tables.” Fact tables aggregated raw data into quantifiable measures, such as the number of users visiting a website per day, whereas dimension tables allowed users to pivot those facts by various dimensions, such as “by region” or “by marketing channel”.

Using two simple techniques - aggregations and filters applied to dimensions, and calculations to facts - business users had everything they needed to analyze data at scale rapidly and comprehensively.

Despite their differences, both Inmon’s and Kimball’s architectures remain widely used today, and in fact are often used in parallel. It is Inmon’s approach which is used at the initial stages of data processing (called “base” or “staging” areas), while it is Kimball’s which is used for the final stages (as fact and dimension tables are molded into business-consumable “data marts”).

By the late 1990s, there remained one final frontier which lay beyond the reach of conventional OLAP data warehouses: “big data.”

The 2000s: Big data and data lake architectures

Despite being able to facilitate large-scale analytics, OLAP data warehouses could not service truly massive data sets. To store and process big data, they needed to be “vertically scaled”, which required an upgrade of the entire physical machine, instead of “horizontally scaled”, where machines could be conjoined infinitely to form a virtual data warehouse.

A single, physical machine serving as the firm-wide data warehouse created three problems. First, it constituted a single point of failure for analytics in the event of a server crash. Second, vertical scaling after a certain point was challenging, if not impossible, as there are only so many physical slots onto which you can attach additional hard drives or RAM. Finally, calculations could not be performed in parallel across multiple processors, instead being performed sequentially within a single one.

Due to these limitations, there remained an upper limit on just how much data a single-machine data warehouse could process. This all changed with the advent of the Hadoop Distributed File System (HDFS), invented by Doug Cutting and Mike Cafarella in 2002.

Unlike traditional databases, HDFS was “distributed-first.” Rather than storing all data on a single machine, it instead parceled out data to a network of connected machines. Data would be replicated redundantly across the network to minimize the risk of permanent data loss and improve the data’s “durability”. New machines could be added to the network “infinitely”, thereby expanding the network’s collective disk space and compute power.

The hardware for these underlying machines was secondary. They could be old servers, new servers or storage servers. What mattered was the distributed software orchestrating from above: HDFS.

By 2004, HDFS had proved its mettle and began to see wider adoption. That year, Jeff Dean and Sanjay Ghemawat at Google published a paper titled “MapReduce: Simplified Data Processing on Large Clusters”, where they introduced a new software for processing data across a distributed file system such as HDFS.

MapReduce worked by “mapping” a data operation to all machines on the network, gathering the results (“shuffle and sort”), and finally “reducing” those results using some form of aggregation.

For example, if an airline wanted to calculate the number of passengers traveling on every flight route this year, it would “map” an extraction operation for passenger information to all nodes, shuffle and sort the results by flight route, and finally “reduce” the results by calculating the number of passengers for each flight route.

Although powerful, the original MapReduce program was written in Java, slow to develop in, and not particularly user-friendly. Programming interfaces soon emerged to simplify working with MapReduce, such as Spark (written in Scala), PySpark (written in Python), Hive (offering SQL semantics) and Pig (a command-line interface).

Spark and PySpark in particular would later blossom into an expansive ecosystem of libraries which, broadly speaking, applied batch ETL, data streaming and machine learning pipelines to “big data” for the very first time. In addition, the tools were entirely open-source, meaning you could assemble the entire infrastructure yourself without having to rely on any external vendors (and the attendant risk of future vendor lock-in).

Collectively, HDFS and Spark formed the foundation of what became known as “data lake” architectures. Unlike a relational database, HDFS enforced no schema and no structure upon the data launched into the file system. Further, Spark could process any type of data: unstructured, semi-structured, image, audio and so on.

As a result, the guardrails to data collection came off: if we could capture and store anything in the “infinite file system”, then why wouldn’t we? It is at this point, around 2005, that the era of “big data” began in earnest. Data lake architectures proliferated, and firms began collecting more data than ever before. HDFS and Spark were in; the OLAP database was out.

Even in the beginning however, there were objections to the new distributed data paradigm.

First, if you could throw anything into HDFS, without any structure or form, you often ended up with a “data swamp” rather than a data lake. Second, setting up the Hadoop ecosystem was no small task and often took a team of engineers to configure and manage. Third and finally, Spark was not intuitive, meaning that analysis, although directed by business stakeholders, ultimately had to be implemented by engineers.

For the latter half of the 2000s, it appeared as though this was the end state of data infrastructure. Companies were finally able to analyze big data, but only if they invested into a large team of engineers to help them do so.

The 2010s: Distributed data warehouses arrive

Beginning in the 2010s, the data landscape would fundamentally shift once again with the renaissance of the data warehouse. In 2011, Google launched BigQuery, the first vendor-managed, distributed, relational data warehouse.

Behind the scenes, BigQuery utilized Google’s own distributed file system called Colossus, and its own map-reduce system called Dremel. What was exposed to the end user, however, was a simple interface data practitioners had long been familiar with: SQL. In releasing BigQuery, Google revived the OLAP data warehouse. This time however, it would be distributed.

Amazon followed suit in 2013 with its offering called Redshift, and Snowflake publicly launched its distributed data warehouse in 2014. Now, anyone could use SQL to process and analyze big data, not just the data engineers who were able to grapple map-reduce.

Distributed data warehouses of course were not quite like traditional data warehouses. The architecture on the backend was completely different; it was only the frontend which gave the appearance of a data warehouse.

The most salient difference and raison d’être for distributed data warehouses was that they could “infinitely scale”: you could grow your data footprint and compute requirements without ever having to vertically or horizontally scale your infrastructure. Everything could be “auto-scaled”, managed entirely by the vendor in the cloud.

Because the distributed file system and the map-reduce “compute cluster” were two separate systems, you could also resize one without changing the other. Unlike vertically scaled data warehouses, where increasing your RAM was attended by increased disk space and processor power (i.e. an upgrade of the overall server), a distributed data warehouse allowed you to marry small compute with enormous data for efficiency, or large compute with small data for speed.

As a result of this decoupling, large volumes of data no longer needed to be stored in powerful but expensive, vertically-scaled servers. The data itself could be stored on cheap, commodity servers, while the more expensive map-reduce jobs could live on big servers with powerful processors. While the cost of compute continued its perennial decline (thanks to Moore’s law), the cost of storing data plummeted.

The 2020s: Managing big data pipelines using “just SQL”

By the 2020s, distributed data warehouses like Google BigQuery, AWS Redshift and Snowflake had firmly taken hold in the data infrastructure landscape. Data lake architectures, such as the cloud-managed Databricks or self-managed Hadoop, persisted, but waned in popularity. SQL had returned to usurp Spark.

Despite its simplicity however, SQL had one critical deficit that Spark did not: SQL was not a data pipeline. While Spark could sequentially and incrementally process raw data into enriched, business-consumable data, SQL could only perform one operation at a time.

In the early days of single-machine data warehouses, SQL statements were stitched together using a patchwork of “stored procedures” and “triggers”: after this table is updated, update the next, and so on. Together, these SQL statements implicitly congealed into “dependency graph” or, more technically, a “directed acyclic graph” (DAG).

By the 1980s, this tangled web of SQL code was plucked from the database and dropped into “ETL tools”, such as Microsoft’s SSIS or Talend’s Open Studio. These tools enabled data engineers to visualize the DAG, monitor pipeline health, alert on errors, automatically retry on failures, provision hardware resources, and manage the ETL schedule.

If you didn’t use a vendor to orchestrate your data pipelines, then you typically wrote your own. The most primitive of these was a simple cron job running a bash script (“run this SQL code at 6pm daily”), while the more sophisticated ones built custom frameworks in Python or Java to construct the DAG.

In 2014, batch data processing experienced a breakthrough with the public release Airflow, a data pipeline orchestration tool, created by Maxime Beauchemin at Airbnb. Airflow had all the features of historical, vendor-based ETL tools, but also contained two important differences.

First, DAGs were not developed by point-and-click within the GUI, but instead were specified in the underlying Python code. Thanks to Airflow, DAGs for the first time became first-class concepts, meaning entire pipelines could be copied, re-arranged and nested in order to improve architectural design.

Second, Airflow was free and open source, meaning once again that companies could build their own tooling in-house without having to worry about vendor management, costs and lock-in. Over the coming years, Airflow took the data industry by storm and rapidly became the de facto solution for managing data pipelines.

Although Airflow was simpler than writing your own DAG framework or wrestling the archaic Microsoft SSIS, it was not that simple. You needed to code up the DAG in Python, understand various “operators” to execute pipeline tasks, and - like a flight controller monitoring dozens of flight paths each day - decipher the kaleidoscopic dashboard to see which jobs succeeded and which jobs failed. The data engineer was therefore still integral to constructing, managing and monitoring the data pipeline.

This changed with the launch of dbt, or “data build tool”, in 2016. Developed by the team at Fishtown Analytics, dbt offered a simple data pipeline wrapper around SQL statements, no Python required. Data analysts and engineers could write their plain SQL in an editor, test it in the data warehouse, and more or less copy-and-paste it into dbt to weave it into the data pipeline. With dbt, the entire data pipeline was just SQL.

But what did this data pipeline look like? How exactly did it transmute raw data into business-ready data? And what does it mean to produce business-ready data anyway?

That brings us to the next major topic in data warehousing: data transformation, in particular, using SQL and dbt.

(previous)(next)

The Transit Layer: Data integrations & data connectors (Part 5)

2023-04-05T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

In the data transit layer, we answer the question: given data over there, how do we get it over here?

For example, data we want to analyze might live on a website, such as basketball statistics from ESPN or economic statistics from the Federal Reserve’s FRED system. Or, data might live in an enterprise resource planning system (ERP), such as SAP or Netsuite, or a customer relationship management system (CRM), such as Salesforce or HubSpot. Finally, data might be exposed by a vendor over API or FTP, or sent by a team member as a CSV or Excel file.

In each case, the goal of data transit is to get data over there, here, into the data warehouse.

This is conventionally called “extract-transform-load”, or ETL, which is the process of extracting the data from a data source, transforming it into a schema-compliant structure, and finally loading it into the data warehouse.

Data integrations providers such as Stitch, Segment, Fivetran, Singer and Airbyte have emerged over time to provide a comprehensive solution to this particular problem. In just a few clicks of a button, these vendors make it easy to select a data source, choose which data we want, and sync it into the data warehouse. While these tools are non-trivial behind the scenes, to end users simply work: data over there magically appears over here.

When evaluating these tools, the most important characteristics tend to be availability (the service is consistently operational), coverage (the service supports many data source integrations) and cost (most vendors charge per million rows transferred). Because of how straightforward vendors make the ETL process, only in rare cases should we build instead of buy.

ETL providers however typically do not have integrations for every data set we could possibly want. For example, they would not scrape website from the front page of the New York Times, nor would they integrate with the APIs niche service providers (such as Buttondown). They may not readily sync FTP data, nor would they download and parse attachments sent over email.

In these cases, to get data “over there” here into the data warehouse, we would have to write our own data extractors, crawlers, transformers and loaders. This is in fact an instructive exercise to understand what exactly ETL providers must generally do behind the scenes, including:

Integrating with various vendor APIs (which may or may not have helpful documentation)
Adapting to service unreliability using retry logic, exponential backoffs and request throttling
Validating and reconciling data equality post-transfer between source and destination
Orchestrating concurrent workers and scheduling jobs
Performing schema validation on both the source as well as the destination (in case any changes were made)
Automating schema correction and implementing soft deletes
Supporting full resyncs and partial data resyncs (i.e. incremental changes only)
Masking sensitive data to particular users when queried
Permitting history mode (change-data-capture) in addition to the more typical upsert mode

What about reverse ETL?

The notion of data transit sometimes implies the reverse of the process described above: given data over here, how do we get it back over there? How do we transfer clean, merged, enriched data from the data warehouse back into our enterprise systems, be it SAP or Jira or Salesforce? This is typically referred to as “reverse ETL”.

One might think that we can simply use the same connectors we used for ETL - after all, connectivity should work both ways, shouldn’t it?

In fact, bi-directional connectivity doesn’t come for free. The nature of ETL is about performing reads against the data source and writes against the data warehouse. This means implementing the GET APIs of data sources and issuing DDL/DML (such as CREATE and INSERT statements) within the data warehouse.

On the other hand, reverse ETL is about performing writes against the data destination after having read data from the data warehouse. This means implementing the POST APIs of data destinations and issuing DQL (SELECT statements) within the data warehouse.

While the SQL standard is fairly trivial when it comes to DDL, DML and DQL, implementing the POST APIs of various data destinations is not. When writing data to data destinations, you must manage write throughput (so as not to overload the system), bundle data into batches, retry failed requests, implement idempotence (should you issue the same successful request twice), and more generally safeguard against incorrect logic (lest you inadvertently overwrite all data in an enterprise system).

None of this is trivial, and as a result, few traditional ETL providers have as of yet expanded into reverse ETL. Instead, it is a nascent landscape with startups like Hightouch and Census currently at the forefront.

As the space matures over time, we can one day expect reverse ETL will become as push-button as ETL is today. Until then however, most data engineers will be left to develop these integrations themselves.

(previous)(next)

The Sources Layer: Schema-first versus schema-last paradigms (Part 4)

2023-03-23T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

Every data infrastructure starts with the same thing: raw data. To process and analyze data, first you must capture it, and this begins in the data sources layer.

Historically, database administrators were too not involved in the collection of data. Instead, they ingested data into the infrastructure “as-is”. Questions from business users about a source system’s data model, semantics or accuracy were tactfully redirected to the appropriate system owner: Salesforce questions went to sales, HubSpot questions to marketing, QuickBooks questions to finance.

Like a spinning carousel, questions about data routinely went in one end and out the other. DBAs believed their job was to manage everything within the data warehouse. Business users, on the other hand, believed the data platform was to manage all data assets at the firm.

As the mandate of traditional DBAs progressively shrunk, the mandate of the broader data team correspondingly grew. The data platform expanded its purview into procuring data, logging data, cataloging data and developing well-designed data models.

Data engineers were to apply data collection best practices upstream and ensure that data going into the data warehouse was of high quality. Within the sources layer, data acquisition fell into the following common functions:

Ensuring logging systems were high-throughput, high-availability and high-resolution
Ensuring enterprise systems followed “master data management” (MDM) best practices
Procuring third-party data sets and managing vendor relationships

These functions further divided into two broad categories depending on the type of data collected: “event-based data” and “entity-based data”. With event-based data, typically your foremost concern related to write capacity. With entity-based data, you focused on high-fidelity domain modeling.

This distinction was popularized by Martin Kleppmann, who in Designing Data-Intensive Systems contrasted two foundational data management designs (a summary of which is here):

Schema-on-write, where schema conformity and referential integrity are validated before data is written
Schema-on-read, where write throughput is prioritized over schema conformity

In a schema-on-write architecture, “entities”, described and encoded in a schema, are the principal focus. For schema-on-read, it is “events” (sometimes referred to as “documents”).

From a data warehousing perspective, entities generally become our dimensions and events become our facts. For example, if we are building a mosaic of users by tying together click activity, marketing engagement and transaction history, then the user (entity) is our dimension to be subsequently joined onto click, marketing and transaction facts (events).

While your run-of-the-mill, relational database such as PostgreSQL or MySQL can very well accommodate both approaches, different databases are commonly used for each architecture.

Relational databases (RDBMS) excel at modeling and normalizing entities. Conversely, non-relational databases (i.e. NoSQL) and logging systems (e.g. Kafka, Logstash and DataDog) excel at high write capacity and are able to absorb hundreds of thousands of writes per second.

Schema-on-write architecture

In a schema-on-write architecture, the data model (or “schema”) reigns supreme. New data destined for the database is blocked unless it conforms to the existing schema; no data at all can be written unless a schema exists a priori.

The schema defines what things are (entities), what characteristics individual entities possess (attributes), and how entities relate to one another (relationships). As a result, schema definition forms the semantic backbone of an organization.

When it comes to data management, it is prudent to follow the practices advocated in the relatively mature field of master data management. Some in particular include:

Data modeling

Accurate modeling of the domain and producing entity relationship diagrams (ERD)
Explicitly defining entity uniqueness via primary keys
Explicitly specifying data types and enumerated types
Including metadata around record management (e.g. created_at_utc, last_updated_at_utc, created_by_user, last_updated_by_user)

Data governance

Specifying data owners and data stewards for various data sets
Producing documentation on all entities, fields and values
Annotating which fields are original to a given system versus derivative of another system
Avoiding, or carefully managing, the storage of personal identifiable information (PII)
Improving data discovery to identify already existing concepts, entities and logic elsewhere in the firm

Data automation

Flagging system-to-system reconciliation failures and remediating manually or automatically
Using automated input (via API) over error-prone human entry (free-form text)

Master data management applies whenever we are dealing with entity-based data.

It doesn’t matter whether the entities live in an enterprise system (such as Salesforce or HubSpot), a vendor API (such as Stripe or Twilio), a self-managed database (such as RDS, Azure SQL or MongoDB), an Excel file, or a text file. If you are working with logical entities, you should practice MDM.

Schema-on-read architecture

In a schema-on-read architecture (also sometimes called “event sourcing”), we are mostly concerned about getting data safely and quickly into the system. If you are capturing hundreds of orders per second, you want to ensure you don’t lose any of them, which can occur from time to time due to service unavailability, network failures and queue overcapacity (among other reasons).

At such a high write velocity, you generally don’t have time to enforce referential integrity for each event. You simply want to store the event quickly and durably. An example event looks like this:

client_timestamp_ms=1677549412381 server_timestamp_ms=1677549817539 user_id=d41d8cd9 order_id=0010316 server=nj-us-east-1 ip=233.188.23.10 amount=40 currency=USD

Logging systems expose client APIs which support tremendous write-throughput (often backed by a cluster of “write nodes”), quickly persist data to disk, copy it for redundancy and index it for rapid retrieval. Logging systems and implementations generally feature:

High write-availability: the service is always operational and accepting writes
High read-availability: when we query for recent logs, we always get results
High-capacity buffer: we can track substantial data per event, enabling high spatial resolution (e.g. Splunk accepts 10,000 characters, or about 2,000 words, per event by default)
Durability: writes, once acknowledged, are never lost
Idempotence: concurrent writes to multiple write nodes do not result in duplicate data
Distributed tracing: “unique requests” can be followed throughout various services in a microservice architecture

On the frontend, examples of logging systems include Mixpanel, Segment, Hotjar and Heap Analytics. Backend logging systems include AWS CloudWatch, Splunk, DataDog, Logstash and Kafka, as well as NoSQL stores such as MongoDB and DynamoDB.

(previous)(next)

How to build a reliable, useful and performant data infrastructure (Part 3)

2023-03-17T00:00:00+00:00

(This post is part of a series on working with data from start to finish.)

How do we evaluate a data infrastructure? What aspects do we care most about? What are its goals, its guarantees, its service-level agreements (SLAs)?

Most data infrastructures will generally aim to deliver the same things:

Coverage: Data across all enterprise systems is ingested into and available within the firm-wide data warehouse
Availability: Data infrastructure components (e.g. data integrations, data warehouse, BI tool) are consistently operational
Integrity: Data is complete, useful and free of errors (e.g. duplicates, omissions, miscalculations)
Currency: Data is updated on a frequent basis
Durability: Data is not inadvertently lost, deleted or rendered inaccessible (e.g. missing encryption keys)
Comprehensibility: Data is well-defined, consensual and publicly documented
Performance: Data is ingested, transformed and queried with relatively fast runtimes
Security: Data is encrypted and permissioned using role-based access control (RBAC)

In short, we want as much data as possible to be correct, current, secure, understandable and fast.

Software development best practices for data infrastructure

When writing software, we generally follow certain software development practices. These help ensure that the software we produce is reliable, stable and functionally correct. For data infrastructure, these include:

Deploying via CI/CD pipelines: version control, feature branches, code reviews, linting, static code analysis, test runners, test environments and weekly releases
Creating developer sandboxes: developers should receive and develop within isolated clones of the data warehouse
Performing automated testing: write breaking unit tests for logic and warning data tests for data quality
Writing “configuration-as-code” for vendor tools instead of managing configurations by point-and-click
Employing idempotence: plan for duplicated insertions and discard accordingly
Using append-only, immutable and bitemporal logs to retain all changes made to source data
Ingesting “schemaless” source data to avert upstream, breaking changes to the schema
Practicing fault-tolerance: components regularly fail, so develop automatic retries, monitoring and alerting to remediate quickly
Learning the data model: if you get the entities and relationships right, everything else gets much easier
Considering your naming conventions, as these “public interfaces” tend to resist change most as an organization scales
Documenting generously: create business glossaries, data dictionaries, data lineages and data inventories
Applying security best practices of data minimization, need-to-know, least privilege, audit trails and masked PII

Collectively, these guidelines minimize the frequency of errors, accelerate development velocity and improve understanding of the data infrastructure and its outputs.

The modern data mindset

However, even if you adhere to every software development best practice above, the data infrastructure will still be missing something essential. That is, velocity.

Each item above is technical in nature, which means each must be implemented and managed by an engineer. It is the engineer who is responsible for extracting schemaless fields into tabular form, who uniquely has visibility into the data model, and who produces documentation around where fields come from and what they mean. Little can be done without the engineer. And for a data infrastructure, whose ultimate goal is consolidating and conveying knowledge, this can materially hinder organizational learning velocity.

Traditionally, this engineer was referred to as the database administrator, or DBA, who served as fiduciary gatekeeper of the database. DBAs were responsible for provisioning the database, populating it, optimizing it, scaling it, securing it and more.

If you wanted data, you needed to go through the DBA. And because everyone around the firm constantly wanted more data, the DBA had little time to focus on anything else which made for a better data infrastructure: user experience, client service, knowledge bases, project management and so on.

Over the past decade, improvements in data warehousing technology have largely obviated the need for traditional DBAs. Data warehouses are increasingly cloud-managed (as opposed to on-premises), auto-scaling, and vastly more performant for analytical workloads than ordinary, non-distributed databases. The demand for DBA skills waned, and the demand for more modern data engineering skills accelerated.

This refreshed skill set evolved to address the core drawback of traditional data infrastructures: centralized management and administration which, by extension, led to technical gatekeeping and slow development velocity.

The new infrastructure would be centralized on data best practices but decentralized on data operations. It would give users access to more data, faster, with cleaner and more polished interfaces. And it would be organized around what are now called “data mesh architecture” and “data-as-product” principles.

The data mesh architecture

Although data mesh architecture goes by various names - DataOps, domain-driven design (DDD), service-oriented architecture (SOA) - the ideas are largely the same.

Business users want access to more data, faster. They want end-to-end visibility into where data came from and what kind of processing was applied. Increasingly, they know SQL, which empowers them to clean, join and enrich data themselves. Less and less, they need a data engineer to perform “English-to-SQL” translation, and more and more, they are able to encode business rules directly into code.

Thanks to self-service tooling, there are fewer intermediaries than ever in the data stack. Anyone - from analysts to engineers to C-Suite executives - can ingest data using Fivetran, process it using basic SQL, and visualize it in Tableau. All code and data is public by default. Through transparency and accessibility, the data mesh architecture democratizes access to the data warehouse.

Maxime Beauchemin, the creator of Apache Superset and Apache Airflow, astutely observes: “The modern data warehouse is a more public institution than it was historically, welcoming data scientists, analysts, and software engineers to partake in its construction and operation. Data is simply too centric to the company’s activity to have limitations around what roles can manage its flow.”

In a data mesh, the job of the engineer is not to perform every data operation imaginable: ingesting data, processing data, validating it and so on. Rather, it is to define and enforce best practices which enable business users to do their job more effectively.

This means setting engineering standards, performing code reviews, administering access control, training on tool and SQL usage, certifying data sets and managing metadata. The data engineer makes it easy for anyone in the company to build within the infrastructure, not around it.

Data-as-product thinking

When business users worked with traditional DBAs, they often grew accustomed to slow response times, cryptic answers littered with code fragments, and primitive interfaces. DBAs saw their jobs as technical in nature, and in turn, delivered technical solutions. Business users wanted business solutions.

Perhaps the most prevalent format of a solution comes by way of a “product.” With a product, you expect a reliable, QA-tested, factory-like quality. You get customer service and thoughtful design and thorough documentation. You may hear about product roadmaps or even obsoletion plans.

A good product - rather, a good solution - eliminates friction around doing what you want to do. A good product just works. For a data infrastructure, this means that accessing and learning from data should be frictionless.

Embracing a product mindset goes one step further. When you deliver a product, you don’t just consider operations - the sequence of steps necessary to ship a product. You also consider sales and marketing and finance. A comprehensive data infrastructure will do this as well.

Data-as-product thinking is not just about getting things to work. It is about getting them to work well, and ultimately, to solve people’s problems as effectively as possible.

(previous)(next)