2023-05-07 data

The Semantic Layer: Know Your Data, or KYD (Part 9)

(This post is part of a series on working with data from start to finish.)

As covered in the introduction, a data platform conveys information about the business which can be subsequently used to understand and improve operating performance. Importantly, a successful data platform does not merely produce data. It produces knowledge.

While business users frequently have no issue extracting data from the data platform, knowledge is harder to come by:

These pain points can be grouped into two themes:

  1. Data discovery; and
  2. Data correctness

Business users want to know where to look for answers to their questions, and when they find them, they want them to be correct. 

In absence of a self-service tool, business users will ask the core data platform team directly to answer their questions (“Where can I find information on user activity? Is this figure correct?”). Any self-service tool which exists in the semantic layer must as a result satisfy these needs at least as well as the core data platform team itself.

Data discovery #

When business users have questions, they need to know where to find answers. As the last two decades have shown, the definitive solution to this problem has been a search engine: “Just Google it.”

Google’s famously spartan home page illustrates what users today demand in search. They do not want filters or categories or sort options to choose from. Instead, they want a simple “linguistic interface” - the search box - to understand their question and return the most relevant results. How exactly Google does that is completely opaque to the user.

Behind the scenes, of course, Google injects considerable structure into the vast sea of content floating within the Internet (a process called “indexing”). Like pairing a wine with a cheese, this is the mechanism by which Google associates any set of search results with a given search query. 

In the data landscape today, it is “data catalog” or “data portal” tools which offer such search capabilities. Like Google, they expose a search box and produce results such as:

It is important to note that, for the most part, business users do not ultimately want any of these search results. Instead, they want answers. If they could harness a ChatGPT-like linguistic interface to answer their questions, then they would never need to view individual metrics or data sources. They only examine these when immediate answers to their questions are unavailable.

And how does Google choose which results to return, and in what order? In an ideal world, it apprehends your search query perfectly. then returns results which are (1) most relevant and (2) most correct. Results which are irrelevant or incorrect do not constitute high-quality answers to queries. 

While relevance depends on successful semantic parsing, “correctness” constitutes an entirely different problem which fundamentally depends on consensus.

Data correctness #

If the calculations returned by the data catalog are incorrect, then they are effectively useless. As an example, if a user searches for “monthly profit and loss (P&L) on our new widgets product line” and the data returned is an order of magnitude off the correct figures, then the answer is as good, or worse, than no answer at all.

How would we know this data is correct?

First, the metric must be defined correctly. If the metric defines P&L as “net revenues less operating expenses”, but we define it as “net revenues less operating expenses and capital expenditures”, then the metric will be wrong for our needs, even if it is calculated correctly.

Second, the metric must be calculated correctly. If the metric defines P&L as “net revenues less operating expenses”, but the actual calculation uses gross revenues instead of net revenues, then the metric will be incorrect, even if it is defined correctly. 

Data definitions #

Before business users are able to achieve visibility into their operations, they must first specify what they want to see. They must define metrics which are relevant to their operations, such as “average revenue per customer” or “customer lifetime value” or “90th percentile latencies”, and then subsequently calculate them.

These definitions typically live in a “business glossary.” Collectively, they minimize ambiguity over what business terms can possibly mean. A “daily active user” may be defined for example as “any non-personnel, non-test user who logs into the application on a given day based on UTC timestamps”.

At first glance, this definition seems fairly palatable, although even it does not completely escape scrutiny. Business users headquartered in New York may question why “days” are calculated using the more systems-relevant UTC timestamps instead of the more business-relevant EST timestamps.

Indeed, when teams cannot agree on how particular metrics should be defined, they must inevitably create new definitions which satisfy their own needs (such as daily active user, UTC and daily active user, EST). Teams which continue to use shared business terms (daily active user) whose definitions do not satisfy their own individual needs will continually question the correctness of the data.

Because metrics are defined and tailored to the needs of individual users and teams (that is, people), it is important to annotate for whom a given metric is relevant. If one dashboard illustrating system availability metrics is used by a junior systems engineer while another is used by the CTO, then the latter will appear to be more “authoritative” and “credible” than the former. 

As we learned in the philosophy of data, data correctness reflects the degree to which independent opinions converge on the same answer. If there is firm-wide consensus on which dashboards and which metrics are “correct”, then any metrics created by individual teams must first and foremost reconcile with the firm-wide metrics in order to be correct.

Due to the primacy of consensus in assuring data quality, data catalog tools must always enrich metric data with stakeholder metadata. Airbnb’s Dataportal provides an illustrative example of how stakeholder information, such as (a) usage popularity, (b) discussion boards, and (c) upvotes or approvals on metrics can all be used to fortify consensus.

Data calculations #

After business users specify and define the metrics which are relevant to their operations, they must calculate them. Here, the technical implementation of the calculation should correspond exactly to the definition found in the business glossary. 

As these calculations are often performed using SQL, this effectively represents an exercise in “English-to-SQL” translation. The specifications and constraints of the business definition must be precisely encoded into SQL.

Taking the example above of calculating daily active users, EST, one might use the following SQL:

WITH system_logs AS (
    SELECT DISTINCT 
        DATETRUNC(DAY, server_timestamp::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM api_logs

    UNION ALL

    SELECT DISTINCT 
        DATETRUNC(DAY, created_at_est::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM click_logs

    UNION ALL

    SELECT DISTINCT
        DATETRUNC(DAY, connector_synced_at::TIMESTAMP_LTZ) AS log_timestamp_est, 
        user_id 
    FROM mobile_logs
)

SELECT 
    s.log_timestamp_est AS login_date_est, 
    COUNTD(s.user_id) AS num_users
FROM system_logs AS s
    LEFT JOIN users AS u ON s.user_id = u.id
WHERE TRUE
    AND NOT u.is_test
    AND u.employee_id IS NULL
GROUP BY 1

Despite its apparent simplicity, there are many places this calculation can diverge from its lexical definition, including:

Due to these potential sources of difference, business users typically want end-to-end visibility into how a given metric was calculated - that is, its data lineage. If and only if all of these inputs are “correct” according to the business user, then the calculation as a whole is said to be correct.

Automated testing can be used to validate both the integrity of the logic and the integrity of the data. Unit tests, for example, test that the logic applied to the data is correct. More specifically, this necessitates writing tests to ensure the sources, fields and transformations do not undergo a “regression” in the form of some future code change (e.g. a developer inadvertently swapping which fields are used).

Data tests, on the other hand, assume the logic is correct and instead test the data itself. This can include validating the recency of the data, detecting outliers, identifying missing or unexpected values, and reconciling row counts between tables. Many data tests (unlike unit tests) come built-in to dbt, and more are available from packages such as dbt-utils and dbt-expectations.

The semantic layer today #

Self-service tools in the semantic layer today have not yet matured to the same degree as those in the data integrations, warehouse or presentation layers; however, they represent an area of active investment and growth. 

Popular data governance providers today include Collibra, OvalEdge, Atlan, Acryl, Sifflet and Select Star. All of these provide the basic capabilities of enabling data discovery and data validation.

Many companies continue to use homegrown solutions to administer their data governance (or avoid data governance entirely), although with the growing complexity of data infrastructures, this is increasingly inadvisable. Documentation, metadata and manual testing should move out of knowledge bases and shared documents and into dedicated data governance tools.

(previous)(next)