2023-01-19 databest

Working with data from start to finish (Intro)

(Jump to the series index below!)

When I first entered the workforce, I was assigned to a project that required processing several gigabytes of financial trade data. As this was too much data for Excel, I began the work in Python (and more specifically, pandas).

I remember how little things which should have been easy, like dropping erroneous rows of data, could quickly grow monstrously difficult. I pored over answers on StackOverflow, trekked through pages of Google, and fumbled my way through chapters of programming textbooks. I poked and prodded and tinkered with my code endlessly until it finally capitulated and ran.

I remember that now-familiar rollercoaster of emotion when diving headlong into something new: perplexity and frustration when things didn’t work, followed by insight and relief when they did. I knew what I needed to do; I just didn’t know how.

Over the subsequent years, I developed a pretty good handle on how to code. I could formulate what I wanted, write the code and rapidly debug once cryptic error messages. The code was no longer the hard part.

Instead, new questions emerged: What was version control? How did databases work? Should I use static typing? What made good code? What was software testing or software deployment? And why did I need any of these things, given I was fully capable of writing code? I descended further into the alphabet soup of software engineering: VCS, RDBMS, NoSQL, CI/CD, OSI, HTTPS and more.

Several years later, with considerable hands-on, professional experience under my belt, I developed a much better handle on software engineering. I wrote webapps and application code, configured and populated databases, and tested and deployed microservices. The software engineering was no longer the hard part.

By this point, I managed a team and regularly built “data products”. That meant data sets, dashboards and insights. We were very good at building things, but we were less good at building things which were useful. People said they wanted data, then didn’t end up using it. Data sets, dashboards and insights were built, then forgotten. Our data products lived on the periphery of operational workflows.

And so I wondered: What was the ultimate goal of our data products? Why were we building? I took off the engineer’s hat and put on the business person’s hat. I reflected on what exactly the point of all this data was, and realized that it was to effect change. In other words, the litmus test of data was action. Data should not live on the periphery of execution.

It became clear to me that we needed to expand our mandate of data products into taking action and measuring change. If we built things that did not lead to any measurable change, then we had failed. We needed to orient ourselves around action and results. This meant embedding ourselves into business teams, understanding their operational workflows, integrating data into those workflows, and finally measuring the results of our changes. This was the last mile of analytics.

After adopting this dual mandate, we worked with business teams more closely than we ever had before. Instead of operating reactively via “request-response”, we worked proactively with teams to define their quantifiable goals (KPIs), explore historical data for potential drivers (statistical correlates), convert these into actual, testable hypotheses, and finally measure the results of these experiments.

Using this framework, we helped our marketing team slice segmented audiences and grow email conversion rates by several percentage points over a six month period, our sales team double their top-of-funnel prospecting in a quarter, and our product team highlight “sticky features” to increase user engagement 20% in about a month. Data became more applied than ever, and as an added benefit, that made it fun.

I wanted to document some of the knowledge I had built up over the years, in particular about working with data from start to finish: what it is, how you get it, how you process it, how you make sense of it, how you turn it into action, and how you know if it’s working. This spanned a fairly cross-functional skill set: data engineering, data analytics, data governance, statistics and domain expertise.

Those ideas turned into the series of essays below.

I hope you enjoy reading it as much as I enjoyed writing it!

(next)