(This post is part of a series on working with data from start to finish.)
We’ve now reached the end of what I call “the philosophy of data”.
We covered what data is and where it comes from, how it is processed and reduced, how it is integrated into models of reality, and finally how we can move from statistical correlation to experimental causation.
We conclude with causation because it is only by way of causation that we are truly able to understand how a system operates - that is, its causes and effects - and intervene upon it in ways in which we desire.
One of the prevailing themes throughout this series is the transformation of uncertainty as data moves through the data pipeline. When data is first encoded and discretized, it is initially rendered less ambiguous and therefore less uncertain. Ambiguity is further reduced when the data is reduced to a particular analytical resolution, and models reduce uncertainty once again when they make predictions about the future.
Models are rarely perfect, and so we must constantly intervene upon a system in order to validate our models and subsequently improve upon them. The way we improve a model is by making hypotheses about its operation, taking actions to change it, and observing if the results aligned with our expectations. We don’t merely want to act on the data; instead, we want to act in measurably correct ways.
Thus far, our philosophical inquiry into data has been abstract, distant and perhaps even impractical. How do data systems work in actuality?
How do we collect data, process it, analyze it, and turn it into action? What kind of data infrastructure do we need? What principles of data analysis should we follow? And finally, how do we achieve the “last mile” of data analytics: to operationalize our insights and measure our actions?
In the following chapters, we’ll dive into:
As a reminder, the index for the series can be found here in the introduction.