2023-01-19 series

Working with data from start to finish (Intro)

Back in 2014, just after I had graduated university and started my career as a budding financial analyst, I was quickly thrown into my first work project of calculating damages in an alleged stock market manipulation case. My job was to parse millions of trades (several GB of data) and aggregate it into a financial model.

“Seems simple enough”, I thought, until I realized that the data was so large it would not fit into Excel. I needed a different tool. I had heard about Python pandas before but never used it; this seemed as good an opportunity as ever. I pip-installed the library, pulled up the tutorial and grabbed a coffee. I was in for a long ride.

Pandas was powerful - it crunched through millions of records like it was nothing - but it was also confusing. I remember how little things which should have been easy, like dropping particular rows of data, grew monstrously difficult. I pored over answers on StackOverflow, trekked through pages of Google, and fumbled my way through chapters of programming textbooks. I pulled and poked and prodded my code endlessly until it capitulated and ran.

I remember that rollercoaster of emotion when diving headlong into something new: perplexity and frustration when things didn’t work, followed by insight and relief when they did. I knew what I needed to do; I just didn’t know how.

This process went on for years. Just as I gained a handle on one thing, it seemed there was always another to learn. After I could debug cryptic error messages, I discovered I really needed to learn SQL. Then it was NoSQL. Then VCS. Then CI/CD, OSI, REST and so on. The alphabet soup of software education demanded a neverending list of ingredients. Peel back the onion. Start again.

It wasn’t until after about 5 years of continuous, effortful studying and hands-on experience that I finally felt “comfortable” with programming. I could write and deploy CRUD apps in django, set up and manage my own postgres databases, and navigate the labyrinthine plumbing of git and CI/CD. The world of software, which once seemed opaque and impenetrable, finally made sense.

By this point in my career, I was managing a small team of data engineers and analysts and regularly building “data products”. As engineers first, we were good at building things. We were not as good at building useful things. People said they wanted data, then didn’t end up using it. Data sets, dashboards and workflow tools were created, then forgotten. Each quarter felt like a miniature Pompeii: build, vaporize, forget.

What was the ultimate goal of our data products? Why were we building at all? I took off the engineer’s hat and put on the business person’s hat. I reflected on what exactly the point of all this data was, and realized that it was to effect change.

In other words, the litmus test of data was action. Data should not live on the periphery of execution. It should not be “spooky action at a distance”. It needed to be hands-on.

It became clear to me that we needed to expand our mandate of building data products into taking action and measuring change. If we built things that did not lead to any measurable change, then we had failed. We needed to orient ourselves around action and results. This meant embedding ourselves into business teams, understanding their workflows, integrating data into those workflows, and finally measuring the results of our changes.

I called this a dual mandate, meaning we still retained our original mandate of processing and understanding data, but now had an additional mandate of converting that data into tangible action. Instead of operating reactively via “request-response”, we worked proactively with teams to define their quantifiable goals (KPIs), explore historical data for potential drivers (statistical correlates), convert these into testable hypotheses, and finally measure the results of these experiments.

The change in mindset, from builders to operators, was an important one. We sliced granular audiences for our marketing team to grow email conversion rates by 2x, surfaced qualified leads to our sales team and tripled their conversion rate to over 10%, and highlighted “sticky features” to our product team to increase user engagement by 20%. Data became more hands-on than ever, and as an added benefit, that made it fun.

We could not have done this work without being builders. Our work is technical. We needed to know everything about the data stack: where you get data, how you define it, how you process it, how you make sense of it, how you turn it into action, and how you know it’s working.

At the same time, our work is practical. Data is only useful to the extent it materializes into concrete results, so that also is what I wanted to write about. The culmination of those ideas turned into the series of essays below: how to work with data from start to finish.

Introducing data, from start to finish (this post!)

Data philosophy (Chapter 1)

Data infrastructure (Chapter 2)

Data analytics (Chapter 3)

Know the business and know your customer (Part 1)
Data intuition: understanding how data maps to the business (Part 2)
Defining and calculating “metrics of interest” (Part 3)
Data visualization: What does the data look like? (Part 4)
Generating hypotheses using exploratory data analysis (Part 5)
Measuring results using statistics and A/B testing (Part 6)
Telling your story: how to evangelize the impact of data (Part 7)

Data organizations (Chapter 4)

Crafting a mission, securing buy-in and being accountable (Part 1)
Org structures and seating charts (Part 2)
Product roadmaps and project planning (Part 3)
Recruiting, onboarding and training (Part 4)
On the primacy of data governance (Part 5)
Showcasing work products and being data evangelists (Part 6)

(next)