When we think of data, we typically think of structured data. Data which fits cleanly into a data table, neatly organized into rows and columns. Data which is readily amenable to calculating totals, averages and distinct counts. Data which is easy to summarize and, in turn, understand.
We don’t typically think of unstructured data: text, images, videos, PDFs and other proprietary formats, which collectively comprise the bulk of information in the world. In fact, the universe of unstructured data must necessarily be larger than that of structured data: all structured data ultimately derives from unstructured sources.
Real life is unstructured. When customers walk into and out of a retail store, video cameras do not store data points like {“customer_id”: “fcfbd2e1da00573f”, “direction”: “ENTRY”, “timestamp”: “2023-08-04T14:03:27.215”}
.
Instead, they record raw sensor data which must be converted into records and fields, rows and columns. To produce the data point above, we must classify a customer, match it to existing customers (or not), classify the direction
, and finally record the timestamp. Classifying a customer is itself no trivial task: we need a machine learning model trained on a large volume of historical data that associates image data with “people”, and further, people with historical “customers”.
Even a simpler case of unstructured data - the business email - can be instructive:
Hi Jane,
Our team met last week to discuss the Jan. 24 email campaign, and although the overall performance was good, we had a few lingering questions. Could you provide the underlying campaign data from MailChimp when you get a chance?
Thanks,
Sam
From this unstructured data, we might extract the total number of characters in the email (LEN(email_body)
), the number of lines (LEN(STRING_TO_ARRAY(email_body, ‘\n’))
), and whether or not the email contains the word “data” (CASE WHEN CONTAINS(‘data’, email_body) THEN TRUE ELSE FALSE END
).
These are simple extractions. What if we wanted to know whether the email is_marketing_related
or not?
It’s a useful property to know, but not one that is easily parsed from the email text alone. Here, we would again need a machine learning model trained on a large corpus of text to calculate if this email is “similar” enough to a set of pre-classified, marketing-related emails.
Such a machine learning model does not come cheaply. With structured data, we are off to the races, but with unstructured data, each classifier requires investment into a new model.
Absent a clear business case to invest in such a model, most unstructured data lives on the periphery of data analysis, and by extension, comprehension.
Since the launch of ChatGPT in 2022, the cost of encoding unstructured data has dramatically fallen.
No longer must we build a bespoke machine learning model to classify whether an email is_marketing_related
or not, we simply put the email into the LLM and ask. No longer must we ask if an image contains a dog or not, we simply put it into the LLM and ask. If we want to change the encoding to classify cats instead of dogs, it is as simple as updating one’s prompt.
LLMs have ushered in the era of no-code classifiers. Once the data is encoded, it is immediately consumable in data analysis, statistical analysis and even further machine learning.
As the cost of encoding unstructured data falls, the demand for unstructured data will rise. From where will companies source this unstructured data?
Data, whether structured or unstructured, is typically sourced from three main channels:
It is collected
It is scraped
It is bought (or bartered)
Facebook, for example, collects vast amounts of behavioral data from the users of its platforms. Surveillance systems, such as camera and video tracking systems, collect footage of people who enter within certain premises. Websites collect profile information from users who seek to register to the platform.
In virtually every case of collection, there is a quid pro quo: a service must be provided in return for user consent. If companies want to collect their own unstructured data, they must then provide a valuable service to get it.
Scraping data is what occurs when consent is not expressly obtained. We can, for example, build a crawler to scrape LinkedIn public profiles or, if you’re Google, the entire Internet. However, precisely because the data is necessarily open to the general public, it is often not as valuable as data which is collected privately.
The final route is to buy data or barter for it, for example by offering a bi-directional data integration. Data sets for sale frequently suffer from the same drawback as scraped data: end users typically do not consent to having their data used by anyone except the service providers they directly interface with, and so any remaining data they do consent to share is of low quality. Nevertheless, for data sets which do not contain user information (such as economic or financial data), purchasing data is a common approach for acquiring data.
In the future, companies will put more thought into how they expand their data footprint. Unstructured data will no longer be considered prima facie inaccessible: with an LLM encoder, unstructured data can be cheaply encoded. The challenge will be to procure it.
Companies will be increasingly creative and nimble in how they acquire data. They will build free apps to collect data from users, crawlers to scrape repositories of public text (such as website DOMs), and relationships with data brokers who amass large stores of unstructured data. They will be just as thoughtful in how they acquire unstructured data tomorrow as how they acquire structured data today.