2023-04-26 series

The Warehouse Layer: Database administration in a modern data warehouse (Part 8)

(This post is part of a series on working with data from start to finish.)

We’ve now covered the history of data warehouses, as well as how they are architected in dbt to refine raw data materials into finished data products.

Of course, there’s considerable work in managing a data warehouse beyond what strictly lives in the codebase. This work, typically within the ambit of traditional database administration (DBA) and DevOps, spans four major focus areas:

Managing costs
Administering role-based access control (RBAC)
Experimenting with new features
Improving the developer experience (DX)

Managing costs #

Unless you are running your data platform on a single, on-prem database or a distributed data lake infrastructure, you are most likely relying on a cloud service provider (CSP) to manage your data platform for you. As of 2023, your options are:

Google’s BigQuery
Amazon’s Redshift
Databricks’ Data Lakehouse
Snowflake

Whether you manage your own infrastructure on-prem or have a CSP do it for you, there will be costs. These costs can be (1) explicit, such as the direct financial costs incurred for using the service; and (2) implicit, such as the engineering effort required to master a tool, vendor lock-in, and an impoverished feature set.

Because implicit costs are more difficult to measure, although no less important, than explicit costs, they are out of scope for this analysis. Here we’ll review the explicit financial costs assessed by the service provider, as well as how to manage them.

First, most vendors break out costs between the storage of data, which is negligible, and the compute upon data, which is more expensive. Compute is charged on a volume basis (e.g. per terabyte) or temporal basis (e.g. per second).

Second, comparing costs between vendors is not straightforward. Google charges by the terabyte, Snowflake by the “compute credit”, Databricks by the Databricks Unit (DBU) and Redshift by the Redshift Pricing Unit (RPU).

Units of compute are generally measured in multiples of the smallest compute cluster. For example, an “extra-small” (XS) compute node on Snowflake (which, behind the scenes, is an AWS c5d.2xlarge VPS or equivalent) costs $2.00 to run for an entire hour[0]. By extension, a “small” (S) compute cluster has twice the resources (two c5d.2xlarge instances), twice the speed, and twice the cost ($4.00 for the hour).

A unit of compute on Snowflake, however, is not necessarily equal to that on BigQuery or Redshift or Databricks. Each provider implements its query engine differently. A unit of compute on Redshift may exhibit relatively slow performance for one query (compared to other providers), but relatively fast performance for another query.

Naturally, every provider when advertising its query engine chooses the queries which highlight its engine’s strengths and downplay its weaknesses. Most data practitioners generally agree however that BigQuery and Snowflake offer the best performance per unit cost (though this can vary depending on your organization’s needs).

Regardless of the CSP, the ability to estimate and manage costs has squarely fallen under the purview of modern data engineers. Data engineers are uniquely capable of understanding the strengths and weaknesses of various query engines, forecasting an organization’s analytical and performance needs, and finally marrying these with various pricing schemes to develop pro forma cost estimates along monthly and annual time frames. As more conventional data engineering work is pushed to the cloud, data engineers are increasingly tasked with the less conventional work of managing those CSPs.

Cloud offerings, by their very nature, aim to provide “infinite scaling”, meaning any workload and any data volume can be handled by the infrastructure. With infinite scaling, however, comes infinite costs. In practice then, most organizations do not want infinite scaling. They want reasonable scaling and reasonable performance at a reasonable cost. It is the job of the data engineer to provide this.

In an ideal world, the cost of computation is tightly coupled to its associated benefit. If a dashboard costs $400 to update on a monthly basis, does it provide at least $400 of value? In reality, dashboards and data sets are requested in the short-term with little visibility into costs in the long-term, leading to the frequent condition of “runaway costs” when developing infrastructure in the cloud.

Data engineers should therefore exercise fiscal discipline when building out their infrastructure: that is, the intuition and tooling to know whether certain queries are justified by their associated costs.

The first step is to make analytical costs and benefits legible by:

Tagging table refreshes, dashboard loads and ad hoc analytical queries (e.g. by using query tags)
Building dashboards which visualize overall costs, along with which queries incur the greatest cost
Setting alerts when certain cost thresholds are met
Enriching cost dashboards with end-user engagement of analytical dashboards to quantify benefit

The second step is to optimize and reduce those costs by:

Specifying account-wide spending budgets
Reducing job frequencies (if prompt data is not required)
Profiling and refactoring slow queries
Indexing the data using sort keys, cluster keys and/or partition keys

Administering RBAC #

Invariably, as the data warehouse collects more data from across the entire firm, it becomes the case that not everyone (including engineers) should be able to view all parts of it. Financial data, customer data or personnel data often require some form of access control.

The most common methodology for managing permissions is called role-based access control (RBAC). Users are granted one or more roles, and roles are associated with a set of privileges. Roles can be nested under other roles, where the superordinate role inherits all privileges of the subordinate role, thereby forming a role hierarchy.

For example, to manage access of sensitive HR data, an engineer may set up three roles: HR_VIEWER, which permits viewing of HR data, HR_EDITOR, which permits updating of data, and HR_ADMIN, which grants overall resource management over anything related to HR data. HR_ADMIN can inherit from HR_EDITOR, which in turn inherits from HR_VIEWER. A user requiring access to HR data will be granted the appropriate role for their given job functions.

Ideally, it is not the data engineer who continually manages which users should have which roles. This can lead to issues where, for example, an employee switching out of the customer service team never has their CUSTOMER_VIEWER role revoked, and therefore is still able to view customer data despite not being permitted to.

Instead, user and group information should be managed by the information security team in a centralized identity management tool (IDM), such as Okta or Britive. When an employee leaves the company or switches teams, the change is encoded exclusively within the IDM tool, which thereafter propagates changes in access control to all other systems via SCIM.

Experimenting with new features #

Modern cloud offerings continually expand their feature sets, and it is the responsibility of the data engineer to investigate whether new features can improve developer workflows. This entails attending feature demos by vendors, reading white papers and marketing collateral, testing out the new features, and internally training on feature usage across the team.

Because the data warehouse market is competitive, there is generally feature parity across the vendors: if Snowflake implements a new and widely used feature (such as Data Sharing), then soon enough the others will too. Examples of these features include:

Time Travel: the ability to resurface results or tables up to several days (or weeks) prior
Data Sharing: the ability to share data sets between organizations without any ETL required
Query History: the ability to recover SQL queries related to ad hoc analyses performed months or years ago
User Defined Functions: the ability to run custom JavaScript or Python functions on SQL data
Hybrid Tables: the ability to use the same data store for both OLTP and OLAP data

Improving the developer experience (DX) #

Finally, data engineers are tasked with continually investing in the infrastructure in order to improve (their own) developer experience. Anything which reduces the amount of developer time spent on debugging errors, repeating routine actions, or researching how to do things means more time spent delivering data products. This includes:

Automatically provisioning developer environments (clones) to enable isolating testing
Implementing a linting tool (e.g. sqfluff) to standardize SQL code
Using merge request checklists to ensure code review methodologies are followed
Improving semantics and data discovery to quickly answer questions or track down errors (e.g. using dbt docs)
Integrating with various systems to reduce manual work (e.g. SCIM to avoid manually performing RBAC changes)
Documenting standards (e.g. release guides, support guides, style guides, testing guides)

These practices represent “guardrails” to development which enforce high standards upon code quality and reduce the likelihood of errors (e.g. a former employee who still has access to the data warehouse). They enable the core data infrastructure team to serve as, in the words of Maxime Beauchemin, “centers of excellence”, setting the standard for anyone who contributes to the codebase.

[0] For reference, the actual rental cost of the VPS from Snowflake ranges from $0.24 to $0.38, thereby yielding them a compute markup of 6x to 10x.

(previous)(next)