2023-04-05 series

The Transit Layer: Data integrations & data connectors (Part 5)

(This post is part of a series on working with data from start to finish.)

In the data transit layer, we answer the question: given data over there, how do we get it over here?

For example, data we want to analyze might live on a website, such as basketball statistics from ESPN or economic statistics from the Federal Reserve’s FRED system. Or, data might live in an enterprise resource planning system (ERP), such as SAP or Netsuite, or a customer relationship management system (CRM), such as Salesforce or HubSpot. Finally, data might be exposed by a vendor over API or FTP, or sent by a team member as a CSV or Excel file.

In each case, the goal of data transit is to get data over there, here, into the data warehouse.

This is conventionally called “extract-transform-load”, or ETL, which is the process of extracting the data from a data source, transforming it into a schema-compliant structure, and finally loading it into the data warehouse.

Data integrations providers such as Stitch, Segment, Fivetran, Singer and Airbyte have emerged over time to provide a comprehensive solution to this particular problem. In just a few clicks of a button, these vendors make it easy to select a data source, choose which data we want, and sync it into the data warehouse. While these tools are non-trivial behind the scenes, to end users simply work: data over there magically appears over here.

When evaluating these tools, the most important characteristics tend to be availability (the service is consistently operational), coverage (the service supports many data source integrations) and cost (most vendors charge per million rows transferred). Because of how straightforward vendors make the ETL process, only in rare cases should we build instead of buy.

ETL providers however typically do not have integrations for every data set we could possibly want. For example, they would not scrape website from the front page of the New York Times, nor would they integrate with the APIs niche service providers (such as Buttondown). They may not readily sync FTP data, nor would they download and parse attachments sent over email.

In these cases, to get data “over there” here into the data warehouse, we would have to write our own data extractors, crawlers, transformers and loaders. This is in fact an instructive exercise to understand what exactly ETL providers must generally do behind the scenes, including:

Integrating with various vendor APIs (which may or may not have helpful documentation)
Adapting to service unreliability using retry logic, exponential backoffs and request throttling
Validating and reconciling data equality post-transfer between source and destination
Orchestrating concurrent workers and scheduling jobs
Performing schema validation on both the source as well as the destination (in case any changes were made)
Automating schema correction and implementing soft deletes
Supporting full resyncs and partial data resyncs (i.e. incremental changes only)
Masking sensitive data to particular users when queried
Permitting history mode (change-data-capture) in addition to the more typical upsert mode

What about reverse ETL? #

The notion of data transit sometimes implies the reverse of the process described above: given data over here, how do we get it back over there? How do we transfer clean, merged, enriched data from the data warehouse back into our enterprise systems, be it SAP or Jira or Salesforce? This is typically referred to as “reverse ETL”.

One might think that we can simply use the same connectors we used for ETL - after all, connectivity should work both ways, shouldn’t it?

In fact, bi-directional connectivity doesn’t come for free. The nature of ETL is about performing reads against the data source and writes against the data warehouse. This means implementing the GET APIs of data sources and issuing DDL/DML (such as CREATE and INSERT statements) within the data warehouse.

On the other hand, reverse ETL is about performing writes against the data destination after having read data from the data warehouse. This means implementing the POST APIs of data destinations and issuing DQL (SELECT statements) within the data warehouse.

While the SQL standard is fairly trivial when it comes to DDL, DML and DQL, implementing the POST APIs of various data destinations is not. When writing data to data destinations, you must manage write throughput (so as not to overload the system), bundle data into batches, retry failed requests, implement idempotence (should you issue the same successful request twice), and more generally safeguard against incorrect logic (lest you inadvertently overwrite all data in an enterprise system).

None of this is trivial, and as a result, few traditional ETL providers have as of yet expanded into reverse ETL. Instead, it is a nascent landscape with startups like Hightouch and Census currently at the forefront.

As the space matures over time, we can one day expect reverse ETL will become as push-button as ETL is today. Until then however, most data engineers will be left to develop these integrations themselves.

(previous)(next)