(This post is part of a series on working with data from start to finish.)
Although I’ve been working with data for over a decade, I rarely stopped to think about what exactly data was. What was this thing that was so potentially valuable, yet also ubiquitous? Why was it so worthless upon collection yet so desirable once refined, processed and enriched? Where did all this data come from, and what was its ultimate purpose? To find out, I wanted to start at the beginning - that is, with the philosophy of data.
As in any philosophical enterprise, we begin by defining our terms: what is data?
Data, as it is commonly understood, is the set of one or more (1) measurements encoded into (2) discrete values, typically (3) recorded on a physical medium, such as a piece of paper or computer hard disk.
Data does not exist “naturally” - after all, you cannot see or hear data. Data is instead produced through the process of measurement. Measurement necessitates instrumentation. Your eyes, for example, are a “sensing instrument” which observe your surrounding environment. As you count the number of chairs in a room, the process of measurement transmutes visual information into discrete data.
I impose discreteness on the definition of data because, while there clearly exists ambiguity in the real world, you will not witness such ambiguity in recorded data. A measurement always classifies, and therefore discretizes, observations into “this” or “that”, each of which has clear conceptual bounds. For example, imagine you’re tasked with counting the number of seats in a large auditorium. You may record 186 or 187 seats, but you will never record “not sure” - even if you are! The final, recorded measurement admits no uncertainty. The data collapses any ambiguity inherent in the initial observation.
This process of measurement in fact underlies the fundamental distinction between information and data: information can be ambiguous, while data is not. You may have information that there are many chairs in the auditorium, but you do not have data until you assign a discrete value to that information. Information retains its intrinsic ambiguity and pliability. Data, on the other hand, is always coerced into clearly defined units.
Image credit: DALL-E
As evidenced by its relative absence, you don’t need data to operate in the world. You can get up in the morning, go to work, make dinner and get ready for bed all without the use of data. You operate the vast majority of your life without data, so why collect data at all?
We might define the types of tasks you perform in your everyday life as local. You really only need your local, observable environment - and your physical senses for comprehending such an environment - to accomplish whatever you need to do. On the other hand, many of the problems we care about manifest over longer periods of time and larger physical scales.
For example, you may be able to move a few hundred boxes through a warehouse over the course of a day, but you will be hard pressed to move a few hundred thousand over the course of a year. The information we need to solve such a problem is simply not observable to our local, physical senses. To render this problem tractable, we require observability at a greater scale, and for that, we need data. Data therefore lends us observability at scale, allowing us to tackle newer and larger problems at scale.
All observation begins with an instrument. In order to see, you need eyes. Instruments sample information from the world around us, rendering observable what was previously unobservable.
Imagine, for example, you are tasked with counting the amount of foot traffic passing daily through a busy, retail storefront. There are many practical considerations which make this task challenging: you cannot count quickly enough large groups of people entering at the same time, you may get tired after several hours, or you may simply lose count. Put simply, you are not a suitable measurement instrument for the task at hand.
What then constitutes an ideal measurement instrument? Generally it is one that maximizes three distinct properties:
All instruments sample reality at some degree of resolution - such as a fine grain or a coarse grain - and this resolution can be temporal or spatial in nature. A security camera which takes photos every second has a higher resolution than one which takes photos every minute; similarly, one which takes 8-megapixel photos has higher resolution than one which takes 4-megapixel photos. An instrument’s measurements can be variously correct, corresponding to what we believe the “true” state of affairs to be, as well as variously precise, corresponding to the stability at which the instrument produces measurements.
An incorrect camera would be one whose lens is covered by a fraudulent image of the room, while an imprecise camera one which takes snapshots at an irregular cadence or with unpredictable image artifacts. The ideal security camera is one which correctly conveys the state of the room it observes at the highest degree of detail with the greatest regularity.
Image credit: DALL-E