More dangerous subtleties of JOINs in SQL

Baron Schwartz gives two great examples of just how common, yet profoundly consequential, semantic errors in SQL can be. Seemingly trivial differences in syntax can produce results that look like they're correct - that is, what the user intended to produce - but in actuality are not. These bugs do not show up at compilation or run-time, but rather in meetings when colleagues or clients scratch their head over numbers that "don't look right." Semantic errors are especially insidious because they can remain undetected for months until people happen to look in the right places, at which point the damage may be already done. Could you imagine trying to manage your cash flow based on a completely misreported balance sheet?

Today I will cover two more major semantic errors, both of which occur during SQL joins:

  1. Join duplications
  2. Join misses

Finally I'll cover a few steps you can (and should always) use to mitigate the risk of committing these errors.
 

Join duplications

A common misconception about SQL joins is that they give you as many rows as the join specifies - for example, a left join will give you the amount of rows in the left table, or an inner join will give you the amount of rows common to both tables.

This is not true: SQL joins will give you at least as many rows as the join specifies. For example, if you have a left table with 10 rows, you are guaranteed to have at least 10 rows after the join, but you may also have 20 or 100 depending on what you are joining to.

This is best demonstrated by example - consider the two following tables:
 

sales
date product customer quantity
2017-07-01 Stapler Alice 1
2017-07-01 Tissues James 3
2017-07-05 Soap Kevin 2
2017-07-06 Tissues Kelly 2
2017-07-06 Pack of pens Richard 3
products
product price creation_date_utc
Stapler 4.50 2017-07-01
Tissues 2.75 2017-07-01
Soap 3.50 2017-07-01
Pack of pens 2.50 2017-07-01
Tissues 3.00 2017-08-01

Let's say I want to join the `products` table to our log of sales data (the `sales` table) in order to calculate total revenue, being careful not to drop any sales. In this case I would use a LEFT JOIN because, no matter what, I want to make sure I have the same amount of sales transaction before and after the join. The SQL looks like:

 
SELECT *, (s.quantity * p.price) AS revenue 
FROM sales s
LEFT JOIN products p ON s.product = p.product

I expect to get a result set like this:

date product customer_id quantity product price creation_date_utc revenue
2017-07-01 Stapler 3 1 Stapler 4.50 2017-01-01 4.50
2017-07-01 Tissues 3 3 Tissues 2.75 2017-01-01 8.25
2017-07-05 Soap 12 2 Soap 3.50 2017-01-01 7.00
2017-07-06 Tissues 1 2 Tissues 2.75 2017-01-01 5.50
2017-07-06 Pack of pens 12 3 Pack of pens 2.50 2017-01-01 7.50

Total revenue: 4.50 + 8.25 + 7.00 + 5.50 + 7.50 = $32.75

Instead however I get this:
 

date product customer_id quantity product price creation_date_utc revenue
2017-07-01 Stapler 3 1 Stapler 4.50 2017-01-01 4.50
2017-07-01 Tissues 3 3 Tissues 2.75 2017-01-01 8.25
2017-07-01 Tissues 3 3 Tissues 3.00 2017-08-01 9.00
2017-07-05 Soap 12 2 Soap 3.50 2017-01-01 7.00
2017-07-06 Tissues 1 2 Tissues 2.75 2017-01-01 5.50
2017-07-06 Tissues 1 2 Tissues 3.00 2017-08-01 6.00
2017-07-06 Pack of pens 12 3 Pack of pens 2.50 2017-01-01 7.50

Total revenue: 4.50 + 8.25 + 9.00 + 7.00 + 5.50 + 6.00 + 7.50 = $47.75

What happened?

We can see that the two rows for "Tissues" in the left table got duplicated - we now have four of them! This is because, when joining on the `product` column, the join condition (or "join-predicate") is true for multiple rows. In other words, "Tissues" in the left table equals both "Tissues" rows in the right table, so we get two rows for which the condition is true. This happens twice, once for each "Tissues" row in the left table, yielding two duplicated rows.

One can imagine how problematic this is: we have magically grown our sales without ever completed another transaction! Join duplications cause us to overstate our results.

Join duplications, like the one shown above, seems like an easy mistake to catch: you can clearly see the duplication in the data. But is it as easy to catch if this is your query?

 
SELECT SUM(s.quantity * p.price) AS revenue 
FROM sales s
LEFT JOIN products p ON s.product = p.product

Suddenly your result set looks like this:

revenue
47.75
 

Not so easy to spot anymore!

Without getting too deep into the theory, this shouldn't happen if you have a correctly specified data model with proper normalization. You should be performing joins through foreign keys, which link to primary keys, which must be unique, thereby eliminating the risk of join duplications. Alas if only real-world data were so simple! The reality is that data models are rarely ideal, meaning you have to prepare for unexpected results when working with them.

Before we jump to how to fix this, let's cover the other common semantic error with joins.

Note: join duplications occur with every type of join, including inner joins.
 

Join misses

I call these join misses because I think of them intuitively like cache misses: you'd like to have your result pulled from a cache, but it wasn't there so you got nothing instead. With join misses, we expected to join on a certain key but because the condition is not true (for various reasons), we get nothing instead.

Take this example:

sales
date product product_cost_code quantity
2017-07-01 Stapler J 1
2017-07-01 Tissues T 3
2017-07-05 Soap R 2
2017-07-06 Tissues T 2
2017-07-06 Pack of pens X 3
product_costs
product_cost_code product_cost
J 1.80
T .60
R 2.10

Say we want to find the total costs associated with each of our sales - to do this we join the `product_costs` table to the `sales` table. Again we know we want to never drop sales data, so we use a LEFT JOIN. Our query looks like:

 
SELECT *, (quantity * product_cost) AS total_cost
FROM sales s
LEFT JOIN product_costs c ON s.product_cost_code = c.product_cost_code

Our result set looks like:
 

date product product_cost_code quantity product_cost_code product_cost total_cost
2017-07-01 Stapler J 1 J 1.80 1.80
2017-07-01 Tissues T 3 T .60 1.80
2017-07-05 Soap R 2 R 2.10 4.20
2017-07-06 Tissues T 2 T .60 1.20
2017-07-06 Pack of pens X 3 NULL NULL NULL

Our total cost across all sales is: 1.80 + 1.80 + 4.20 + 1.20 = $9.00.

But is it really? Surely the pack of pens costs something? Just because we don't have a record in the `product_costs` table for it doesn't mean there's really no cost, it simply means it's not in the database. This join miss merits more investigation. Perhaps the pack of pens is an invalid transaction; perhaps it should have a record in the `product_costs` table. It's hard to tell, but we can be fairly sure that our total cost isn't $9.00.

Again, this seems like it should be pretty obvious to spot - after all, there's an entire row of NULLs from the right table!

How about now?

 
SELECT SUM(quantity * product_cost) AS total_cost 
FROM sales s
LEFT JOIN product_costs c ON s.product_cost_code = c.product_cost_code

Results:

total_cost
9.00
 

Suddenly it's not so clear if the $9.00 is correct or not.

Unlike join duplications, which overstate our results, join misses understate our results. 

 

The solution: be a data skeptic

Perhaps the most valuable skill in data analysis is skepticism: what are all the ways my analysis can be wrong? It is all too easy to fit a hypothesis post hoc for why your numbers make sense - far harder is it to sanity check along the way and see if each data transformation confirms your a priori hypothesis. Preventing join duplications and join misses requires a high degree of skepticism.
 

Join duplications

Join duplications have a simple solution: always count the number of rows before and after a join. If you start with 100 transactions and after joining a table you have 140, you need to investigate.

In SQL, this means running a COUNT(*) after each join. If you join several tables (eg. 5+) and don't take any COUNT(*)s along the way, you are taking a large risk. Better to join, check for join duplications, check for join misses, and then incrementally continue joining.

In Python pandas, this is easy to do with a decorator. I use a decorator around pandas.merge() called @guarantee which guarantees that the number of rows going into the join equals the number of rows coming out - else throw an error.

If this guarantee fails, you may need to take some kind of sampling of the right table (eg. records with the most recent date, lowest/highest ID number), but whatever you do the values in the join key should be unique. If you don't perform this sampling, you'll end up with extra rows and inflated results. 
 

Join misses

Join misses are a little tricker to debug because some NULL values rightly belong in your data, while others are simply introduced by join misses.

You need to investigate these manually. My strategy is to pick a column you expect to be highly populated in the right table and count the number of nulls in that column before and after the join. If they don't match, you have produced join misses and need to investigate if they are legitimate.

In pandas I typically use Series.value_counts(dropna=False) to check the distribution of data in a column before and after joins. If this looks like exploratory data analysis (EDA), that's because it is. You should always be doing EDA even after you initially vet your data set - this is part of the sanity checking process.

 

Conclusion

A prudent analyst must always be aware of the potential for join duplications and misses. You may have join misses (say in pandas) if you have two columns which are different data types - their values may be the same but the data types themselves are not comparable. You may have join duplicates (in SQL) if your data is case-sensitive because an "A" will be treated the same way as an "a", violating the uniqueness we require on the join key.

One thing I am always surprised to see omitted in introductory SQL courses is an admission of just how tricky joins are. Unless you can safely assume that your data model is perfect (ie. never), joins are extremely prone to introducing errors. If you join data but join it incorrectly (and produce inaccurate results), you are better off not having joined the data at all. Joins should not be used unless you know the risks that come with them. 

SQL joins are not easy, nor are they meant to be intuitive. They require critical thought and foresight. If there's one thing constantly I remind myself when analyzing data, it's to take my time and think about all the ways I could be wrong.

 

Always learn the shortcuts

When learning a new software, there's a common tendency (to which I can personally attest) which favors focusing on core functionalities instead of memorizing keyboard shortcuts. This prioritization is sensible: learn what the software's features are - and more importantly if they're useful - before optimizing on how to perform them faster. Why memorize Microsoft Excel's "Trace Dependents" when you've never used it once before?

But this mentality effects an insidious inertia: once you've developed momentum pointing and clicking to get things done, it no longer seems necessary to learn the keyboard shortcuts. Motivating reasoning quickly follows:

  • The upfront investment of learning keyboard shortcuts outweighs the perceived benefits of faster execution
  • In a world of scarce brain real estate, learning keyboard shortcuts comes at the cost of learning more valuable information
     

These arguments obscure the true value of shortcuts: it's not about faster execution, it's about reducing cognitive load.

We can first acknowledge that muscle memory exists. A classic example of this is touch typing: you don't think about how you are going to type (eg. where the individual keys are), but rather what you are going to type. In other words, by relying on muscle memory instead of cognitive memory, you offload the mental burden of thinking about procedural steps to your fingers.

 
 

But muscle memory is not exclusive to keyboards. I'm sure many professional gamers would argue that, for certain tasks, pointing-and-clicking on the screen is much faster than keyboard shortcuts and is a veritable form of muscle memory. I am equally sure they would say that keyboard shortcuts are faster for other tasks. It would seem the keyboard-mouse trade-off depends on the task.

Which leads us to the second point: memorized keyboard shortcuts have a lower cognitive burden than pointing-and-clicking on multi-menu tasks. The science seems to be far from conclusive on the issue, but I agree with Jeff Atwood that the the superiority of keyboard shortcuts for multi-menu tasks appears self-evident. It also makes sense that chunking keyboard shortcuts is easier than sequential mouse clicks (eg. Alt + A + T is chunked as one shortcut, not three keypresses).

So where do we find a lot of multi-menu tasks? Some candidates:

  • Your OS (eg. switching windows, opening applications, jumping between words/paragraphs/pages)
  • Your browser (eg. new tabs, switching tabs, re-opening tabs, search)
  • Your most commonly used applications (eg. Microsoft Excel, PowerPoint, Tableau)
  • Your editor or IDE of choice (eg. Spyder, PyCharm, vim, emacs, RStudio)
     

Let's take an example. Say you want to find the unique values of a filtered dataset in Excel. This requires several subtasks: selecting your data, opening filters, applying a filter, re-selecting your subset, copying it, pasting it elsewhere, re-selecting and finally removing duplicates. This totals about 15-18 clicks; by contrast it is about half as many keyboard shortcuts. Again, ignoring the upfront investment of learning the shortcuts, the advantage is unambiguous. There is a reason investment bankers and creative professionals invest so much time in learning shortcuts.

But what about that upfront investment?

Using a basic capital budgeting framework, we need to perform the task enough times to justify the initial investment. For those who spend the vast majority of their working lives in just a few applications (namely their browser, Adobe CS, Microsoft Excel, etc.), the conclusion is obvious: over tens of thousands of repetitions, virtually any initial investment is justified. In practice, the investment in these shortcuts requires just a few days (sometimes only hours) of slightly heightened focus - the benefit extends for years.

However the ultimate question still remains: using shortcuts definitely frees up mental bandwidth - so what?

Here I rely more on intuition: a lower cognitive burden allows for more creative and analytical thinking. When you go from searching for the next click to executing with your fingers, you free up your mind to consider more strategic questions. How do I sanity check my work? Are there other ways which would be informative to filter the data? How else can I pivot the data? I would argue that you have a higher likelihood of remembering these questions and avoiding trivial errors by focusing on high-level strategy, not implementation.

All this to say: always learn the shortcuts.


A corollary: in a world increasingly dependent on web apps, these applications should always be equipped with keyboard shortcuts. Thankfully this is fairly trivial now to implement on the web.

Learn just enough Linux to get things done

Different operating systems have long catered to different audiences: Windows for the business professional, Mac for the creative professional and Linux for the software developer. For OS providers, this sort of market segmentation greatly simplified product vision, technical requirements, user experience and marketing direction. However, it also reinforced workplace norms which bucket individuals into narrow, non-overlapping domains: business people can offer no insight into the creative process, and developers no insight into business problems.

In reality, knowledge and skill are fluid, spanning multiple disciplines and fields. The notion that "you can only be good at one thing" is not a roadmap to mastery but rather a prescription for premature optimization. You can only know what you're good at once you've sampled a lot of things - and you may just find that you're good at a lot of them.

For modern business analysts, bridging the gap between business and software development is especially important. Business analysts must be "dual platform," able to leverage command-line tools available only on Linux (or OS X) yet still benefit from the power of Microsoft Office on Windows. Understandably, the world of Linux is intimidating for those with a business degree. Fortunately, as with most things, you only need to learn 20% of the information to accomplish 80% of the work. Here is my 20%.
 

Why modern business analysts should know Linux

Due to its open source roots, Linux benefited from the contributions of thousands of developers over time. They built programs and utilities not only to make their jobs easier, but also the jobs of programmers who followed them. As a result, open source development created a network effect: the more developers built utilities on the platform, the more other developers could leverage those utilities to write their programs right away.

What resulted was an expansive suite of programs and utilities (collectively, software) that were written in Linux, for Linux - much of which was never ported to Windows. One example of this is the popular version control system (VCS) called git. Developers could have written this software to work on Windows, but they didn't. They wrote it to work on the command line for Linux because it was the ecosystem which already had all the tools they needed.

Concretely, development on Windows runs into two main problems:

  1. Basic tasks, like file parsing, job scheduling, and text search are more involved than running a command-line utility
  2. Programming languages (eg. Python, C++) and their associated code libraries will throw errors because they are expecting certain Linux parameters or file system locations

Together, this means more time spent rewriting basic tools already available in Linux and troubleshooting OS compatibility errors. This is not a surprise - the Windows ecosystem simply wasn't designed with software development in mind.

With the case made for Linux development, let's begin with the basics.
 

The fundamental unit of Linux: the "shell"

The shell (also known as the terminal, console or command line) is a text-based user interface through which commands are sent to the machine. On Linux, the shell's default language is called bash. Unlike Windows users who primarily point-and-click inside of windows, Linux developers stick to their keyboard and type commands into the shell. While this transition is at first unnatural for those without a programming background, the benefits of developing in Linux easily outweigh the initial learning investment.

 
 


Learning the few important concepts

Compared to a full-fledged programming language, bash only has a few major concepts that need to be learned. Once these are covered, the rest of bash is just memorization. I'll restate for clarity: being good at bash is simply memorizing about 20-30 commands and their most common arguments.

Linux seems impenetrable to non-developers because of the way that developers seem to effortlessly regurgitate esoteric terminal commands at will. The reality is that they committed only a few dozen commands to memory - for anything more complicated, they too (like all mere mortals) consult Google.

With that out of the way, here are the main concepts in bash.

Command syntax

Commands are case-sensitive and follow the syntax of: {command} {arguments..}

For example, in 'grep -inr', grep is the command (to search for a string of text) and -inr are flags/arguments which change what grep does by default. The only way to learn what these mean is to look them up through Google or by typing 'man grep'. I recommend learning the commands and their most common arguments together; it's too burdensome otherwise to remember what each and every flag does.

Directory aliases

  • The present directory (ie. where am I?): .
  • The parent directory of the present directory: ..
  • The user's home directory: ~
  • The file system root (or the parent of all parents): /
     

For example, to change from the current directory to the parent directory, one would type: "cd .."

Similarly, to copy a file located at "/path/to/file.txt" into the present directory, one would enter "cp /path/to/file.txt ." (note the period at the end of the command). Since these are no more than aliases, the actual path name could be used in their place instead.

STDIN / STDOUT

Anything you type into the window and submit (via ENTER) is called standard input (STDIN).

Anything that a program prints back out to the terminal (eg. text from within a file) is called standard output (STDOUT).

Piping

  1. | 

    A pipe takes the STDOUT of the command to the left of the pipe and makes it the STDIN to the command on the right of the pipe.
    example: echo 'test text' | wc -l

  2. > 

    A greater-than sign takes the STDOUT of the command on the left and writes/overwrites to a new file on the right
    example: ls > tmp.txt

  3. >> 

    Two greater-than signs takes the STDOUT of the command on the left and appends to a new or existing file on the right.
    example: date >> tmp.txt

Wildcards

You can think of this like SQL's % symbol - for example, you might write "WHERE first_name LIKE 'John%'" to catch any first name starting with John.

In bash, you would write "John*". If you want to list all of the files ending with ".json" in a folder, you would write: "ls *.json"

Tab completion

Bash will often finish off commands intelligently for you if you start typing a command and hit your TAB key.

That being said, you should really use something like zsh or fish for autocomplete since it is hard to remember the commands and all their parameters - rather, these tools will autocomplete your commands based on your command history!

Quitting

Sometime's you'll get stuck in some program and you can't get out. This is a very frequent occurrence for beginners in Linux and it is extremely demotivating. Often, quitting has something to do with q. It's good to memorize the following and try them all when you're trapped.

  • Bash
    • CTRL+c
    • q
    • exit
  • Python
    • quit()
    • CTRL+d
  • Nano: CTRL+x
  • Vim: <Esc> :q!

 

My memorized list of bash commands

Here are the commands I use most frequently in Linux (sorted from most to least frequently used). As I mentioned before, knowing just a handful of commands will accomplish the vast majority of programmable tasks you need to perform.

  • cd {directory}
    • change directory
  • ls -lha
    • list directory (verbose)
  • vim or nano
    • command line editor
  • touch {file}
    • create a new empty file
  • cp -R {original_name} {new_name}
    • copy a file or directory (and all of its contents)
  • mv {original_name} {new_name}
    • move or rename a file
  • rm {file}
    • delete a file
  • rm -rf {file/folder}
    • permanently delete a file or folder [use with caution!]
  • pwd
    • print the present working directory
  • cat or less or tail or head -n10 {file}
    • STDOUT contents of a file
  • mkdir {directory}
    • make an empty directory
  • grep -inr {string}
    • find a string in any files in this directory or child directories
  • column -s, -t <delimited_file>
    • display a comma-delimited file in columnar format
  • ssh {username}@{hostname}
    • connect to a remote machine
    • tree -LhaC 3
      • show directory structure 3 levels down (with file sizes and including hidden directories)
    • htop (or top)
      • task manager
    • pip install --user {pip_package}
      • Python package manager to install  packages to ~/.local/bin
    • pushd . ; popd ; dirs; cd -
      • push/pop/view directories onto the stack + change back to last directory
    • sed -i "s/{find}/{replace}/g" {file}
      • replace a string in a file
    • find . -type f -name '*.txt' -exec sed -i "s/{find}/{replace}/g" {} \;
      • replace a string for each file in this and child folders with a name like *.txt
    • tmux new -s session, tmux attach -t session
      • create another terminal session without creating a new window [advanced]
    • wget {link}
      • download a webpage or web resource
    • curl -X POST -d "{key: value}" http://www.google.com
      • send an HTTP request to a web server
    • find <directory>
      • list all directory contents and their children, recursively

     

    Advanced and infrequently commands

    I find it's good to keep a list of commands that are useful in certain situations (eg. which process is blocking a certain network port), even though those situations don't happen very often. These are some uncommon commands I keep nearby:

    • lsof -i :8080
      • list open file descriptors (-i flag for network interfaces)
      • netstat | head -n20
        • list currently open Internet/UNIX sockets and related information
      • dstat -a
        • stream current disk, network, CPU activity & more
      • nslookup <IP address>
        • find hostname for a remote IP address
      • strace -f -e <syscall> <cmd>
        • trace system calls of a program (-e flag to filter for certain system calls)
        • ps aux | head -n20
          • print currently active processes
        • file <file>
          • check what a file type is (eg. executable, binary, ASCII-text file)
        • uname -a
          • kernel information
        • lsb_release -a
          • OS information
        • hostname
          • check the hostname of your machine (ie. the name so other computers can reach you)
            • pstree
              • visualize process forks
            • time <cmd>
              • execute a command and report statistics about how long it took
            • CTRL + z ; bg; jobs; fg
              • send a process in current tty into background and back to foreground
            • cat file.txt | xargs -n1 | sort | uniq -c
              • count unique words in a file
            • wc -l <file>
              • line count in a file
            • du -ha
              • show size on disk for directories and their contents
              • zcat <file.gz>
                • display contents of a zipped text file
              • scp <user@remote_host> <local_path>
                • copy a file from remote to local server, or vice versa
                • man {command}
                  • show manual (ie. documentation) for a command, but you're probably better off using Google

                Hard skills for the modern business analyst

                Within the past decade, advances in data-driven technologies have fundamentally transformed the corporate landscape. In the past, most business problems could be tackled in Microsoft Excel - today this is no longer the case. There is simply too much data which is only accessible via programming, be it querying SQL databases or key-value stores, fetching data from APIs, scraping data from websites or downloading files from web servers. Transforming the data increasingly requires programming as well, such as parsing text files or applying complex analytical functions. Sometimes the size of data alone is the limiting factor - Excel for instance supports no more than 1,048,576 rows. As time goes on, programming will be the only way to adequately manage and analyze all of this data.

                Despite this structural shift in how businesses approach problems, many firms - especially large, established ones - have been slow to extricate themselves from their dependence on Excel. Barring rare exceptions, there is little motivation to train analysts with a 21st-century skill set.

                But while there is no direct cost of increasingly obsolete analytical skills, there is an enormous opportunity cost. Generalist analysts (such as investment banking analysts, management consulting analysts, business analysts, etc.) spend hundreds of hours manually traversing and copying data from flat files or PDFs, updating Excel workbooks and aggregating data found across remote corners of the Internet. In short, these are highly manual, highly automatable tasks. Business insights are overlooked when analysts allocate more time to manual work than analytical work. Even worse, analysts are prone to committing costly mistakes due to the sheer volume of manual labor.

                The ability to program therefore offers a sizable multiplier on analyst productivity. After all, these are the same savvy analysts now technically equipped to automate otherwise tedious and redundant work. Employers need not (and should not) hire programmers to fill this role - rather, they must train their business-minded graduates to program. They must combine the business analyst skill set with the data analyst skill set. They must teach their analysts to code.

                 

                From point-and-click to type-and-enter

                Before diving into the exact programming skills required for today's data analyst, it's important to understand conceptually why programming is so important. Using programming and data languages offers the following benefits:

                • Adaptability: should the underlying dataset or schema change, the analysis can be quickly reproduced
                • Replicability: there is a clear trail of what operations were performed on the data (by contrast, Excel's row or column deletion are "destructive," irreversible operations)
                • Modularity: frequently used data operations can be applied to new projects (eg. timezone conversion, report generation, FX conversions)
                • Scalability: data can be distributed across file systems (eg. HDFS) and databases (eg. Amazon Redshift), as well as analyzed across multiple cores. This not only means larger data sets can be accessed, but also analyzed at a faster speed
                • An open source community: the vast majority of useful libraries for data analysis are open-source and actively maintained (eg. pandas, sci-kit learn, dplyr, ggplot2), meaning they generally get new features and bug fixes faster than closed-source counterparts (eg. Tableau, Excel)
                • Systematic validation: unit tests can be used to validate the data at various stages along the pipeline (eg. ingestion, transformation)
                • Advanced statistical modeling: most complicated and novel statistical models are not offered by closed-source software
                • Complex data transformations: unlike software with a UI, whose data operations are constrained to what the core developers expose, programming languages have no limitation on what you can do with the data (eg. up/downsampling, string manipulations)
                • Rapid visualization: easily facet the data across different dimensions and produce graphs without transforming the underlying data


                While these benefits primarily concern data analysis, there is an added bonus: programming facilitates an understanding of non-analytical fields in data science, such as data engineering or webapp development. This "vertical integration" along the data science stack is non-trivial - it means you are able to manage your data from start to finish.

                 

                Hard technical skills for the modern business analyst

                Data analysts should be able to tackle any business problem which involves, perhaps unsurprisingly, both (1) data and (2) analysis. For example, if data is stored in a database, then a programming language alone is insufficient to solve the business problem - a data language is required. Therefore, proficiency in these technologies (ordered from most to least commonly used) should cover the vast majority business problems faced by analysts:

                • Microsoft Excel for basic data operations, formatting and organization
                • SQL for large-scale data storage and querying
                • A programming language for data-driven workflows
                  • Python
                  • R
                  • Matlab
                • A version control system (VCS) for saving periodic work revisions
                  • Git
                  • Subversion
                • Bash scripting for file system operations and navigation
                • NoSQL for schemaless data storage and fetching
                • A business intelligence (BI) tool for analyzing and visualizing data 
                  • Tableau
                  • Looker
                  • QlikView
                • Web development for sharing data analyses and visualizations (optional)
                • Big data familiarity (this is a broad topic which I explore in a separate post)

                 

                Wrapping up

                I hope this post clarified the necessity of programming in solving business problems. Business analysts will always be on the front lines when it comes to working with data, and as a result they should be equipped with the right tools to handle the data.

                These data analysis skills need not replace their business acumen - on the contrary, learning to code will only enhance it. 

                Why data analysts should exercise caution when taking averages (Part 2)

                This is Part 1 in a 2-part series on miscalculating averages and its implications.

                In Part 1 of this series, I explained the fundamental problem with calculating multidimensional averages: you can never be sure what a single observation represents in the database (given by the primary key), and as a result, you need to "guarantee" what an observation is by specifying a collapsing key. While Part 1 emphasized how to structurally think about the problem of multidimensional averages, this post covers some of its implications.

                 

                1. When it comes to averages, what makes a good question?

                While every analysis begins with a good question, parsing that question linguistically can sometimes be challenging.

                As a recap of my first post, every question with an average should contain at least two elements and sometimes three:

                1. Measure
                2. Averaging key (specifies the denominator)
                3. Grouping key (optional)

                For example, "what is the average daily quantity of products sold?" has a measure of "quantity" and an averaging key of "day". Similarly, "How many texts does the average person send per day?" has a measure of "number of texts", a grouping key of "day" and an averaging key of "person". I think it often helps to rephrase the question like this: "the average [measure] across [averaging key] by [grouping key]."

                In contrast, "what is the daily average?" is an incomplete question because there is no measure. Similarly, "what is the average quantity purchased" is incomplete because there is no averaging key (that is, average quantity purchased across people, across cohorts, over time, etc.?).

                Thinking in terms of grouping and averaging keys can help parse linguistically similar questions:

                • "What is the average daily share price over the last 20 days?"
                  • Output: a single value, the average, calculated by summing each day's closing share price over the last 20 days divided by 20.
                • "What is the average share price by day for the last 20 days?
                  • Output: a list of 20 days (ie. "for each day"), each with its average share price over the day. For example, if your data comes at the minute-level (ie. averaging key is "across minutes"), each average would be sum of all prices that trading day divided by the count of minutes.
                • "What is the average daily share price by security in a portfolio over the last 20 days?
                  • Output: a list of securities, (ie. "for each security"), each with a daily average over the last 20 days.
                • "What is the average share price by security by day?"
                  • Output: [ambiguous - grouping key and averaging key are indiscernible]

                To reiterate, the key here is knowing that the averaging key defines how many records will be in the denominator. If you are unsure what your denominator is, or conversely, how to interpret your averaging key, you will have trouble answering your initial question.

                 

                2. What aggregation functions besides the mean are susceptible to the pitfall?

                Unfortunately, while the difficulty of working with multidimensional data may be most salient with the arithmetic mean, it is not confined to it. In fact, many other aggregation functions, such as the standard deviation (also uses n as the denominator), median, mode, maximum, minimum, first, last, count and count may all be similarly affected.

                We can see this with an example.

                  Example 1
                transactions
                date product state purchaser quantity
                2016-12-23 vacuum NY Brian King 1
                2016-12-23 stapler NY Brian King 3
                2016-12-23 printer ink NY Brian King 2
                2016-12-23 vacuum MA Lauren Mills 1
                2016-12-23 printer ink MA John Smith 5
                2016-12-24 vacuum MA Lauren Mills 1
                2016-12-24 keyboard NY Brian King 2
                2016-12-24 stapler NY Trevor Campbell 1
                2016-12-25 keyboard MA Tom Lewis 4
                 

                "For each day, what is the max quantity purchased by any individual?"

                Since the primary key is more granular than the key on which we are analyzing - that is, (purchaser) - we again will have to collapse the data. In SQL:

                 
                SELECT date, MAX(quantity) FROM (
                    SELECT date, purchaser, SUM(quantity)
                    FROM transactions
                    GROUP BY date, purchaser // COLLAPSING KEY
                ) AS t
                GROUP BY date // GROUPING KEY
                
                // MAX KEY (IMPLIED): [date, purchaser] - [date] = [purchaser]

                From this example, it's clear that while there is no denominator to be affected, aggregating the actual values themselves changes what the answer will be. We can imagine how median, mode and the other calculations are all similarly affected when the underlying data is aggregated into new values.

                 

                3. Can you interpret an averaging key with more than one dimension?

                Yes, the averaging key can be a composite key (ie. composed of one or more dimensions). However, the composite key has to make sense. 

                  Example 2
                apples
                first_name last_name num_apples
                Katie Smith 4
                Alan Johnson 8
                Alan Howard 10
                Tess O'Neal 8
                Katie Powers 5
                 

                For example, the composite key of (first_name, last_name) makes sense because it has a natural interpretation: full_name. Full name corresponds to a single individual, and it makes sense to take averages on the level of individuals - as a result, (first_name, last_name) would be a legitimate averaging key. Many times, however, the averaging key will not make sense with two or more dimensions - this is an indication that more aggregation (ie. the collapsing key) should have been performed.

                 

                4. How far off can the calculated average be with the wrong averaging key?

                By way of example, we can evaluate how the calculated average directionally changes as we change the length of the averaging key.

                Looking at Example 2 above, what is the average when the averaging key is (first_name, last_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 5 = 35. What is the average when the averaging key is (first_name)? It is SUM(4 + 8 + 10 + 8 + 5) / 3 = 11.66.

                As you can tell, the more granular the the averaging key, the more records will be in the denominator (ie. higher) and the smaller the average will be. Conversely, the more collapsed the averaging key, the fewer records will be in the denominator (ie. lower) and the larger the average will be. For highly dimensional datasets (more granularity), this means averages can be significantly lower than what makes sense due to inadequate aggregation.

                 

                Why data analysts should exercise caution when taking averages (Part 1)

                This is Part 1 in a 2-part series on miscalculating averages and its implications.

                Few statistics rival the simplicity and explanatory power of those which are the most basic: the percentage, sum and average, among others. As a result they crop up everywhere, from exploratory data analysis to data dashboards and management reports.

                But one of these, the arithmetic mean, is unusually problematic: while it is conventionally taught with one-dimensional data in mind, calculating it for multidimensional data requires a fundamentally different process. In fact, calculating the arithmetic mean as if your data is one-dimensional will produce grossly incorrect figures, sometimes orders of magnitude off from what was intended. For me, this was a humbling experience: even the arithmetic mean is no less deserving of double- and triple-checking than any other calculation.

                Back to basics

                The arithmetic mean is defined as:

                SUM(all observations) / COUNT(number of observations)

                We can see this through a simple example:

                apples
                first_name num_apples
                Katie 4
                Alan 8
                John 10
                Tess 8
                Jessica 5
                 

                In this example, there is one categorical dimension (first_name), one numerical measure (num_apples) and five observations, also called records, rows, measurements or facts. The grain of fact is what makes each individual record unique: in this case, the grain is composed only of the (first_name) dimension. If Katie appeared twice in the name column, we would know that (first_name) could not be the grain because it is no longer composed of unique observations.

                Using the formula above, we can calculate the arithmetic mean:

                 
                SUM(4 + 8 + 10 + 8 + 5) / 5 = 7

                In SQL, we would write:

                 
                SELECT AVG(num_apples) FROM apples

                Adding complexity: two-dimensional data

                apples
                first_name last_name num_apples
                Katie Smith 4
                Alan Johnson 8
                Alan Howard 10
                Tess O'Neal 8
                Katie Powers 5
                 

                Now there two dimensions (first_name, last_name), one measure (num_apples) and again five observations. Notably the grain is no longer just (first_name) because Katie and Alan appear twice - the grain is now defined as the combination of two columns: (first_name, last_name).

                We can also say the grain is (effectively) equal to the primary key of the table. The primary key specifies the set of one or more columns whose data must be unique for a given table; in addition, every table must have a primary key. You can think of the grain of fact as the semantic description of what makes a record unique, while the primary key is the database implementation of that description. I say "effectively" equal because while the grain and primary key may not actually contain the same set of columns, they will almost always be one-to-one (ie. each grain of fact corresponds to only one unique record as defined by the primary key, and vice versa).

                We can now calculate a few different means. "What is the average number of apples across people?":
                 
                SELECT AVG(num_apples) FROM apples

                This evaluates to 7, just as before.

                How about: "What's the average number of apples across people with unique first names?" We can see that there are 3 unique first names (Katie, Alan, Tess) and the same total as before (35), so the answer is 35 / 3 = 11.66. In SQL we would write:

                 
                SELECT AVG(num_apples) FROM (
                    SELECT first_name, SUM(num_apples)
                    FROM apples
                    GROUP BY first_name
                ) AS t

                Notice how by changing the business question, we decreased the denominator from 5 to 3. We did this by reducing the dimensionality of the data (also known as collapsing or rolling up). Our grain was composed of two dimensions (first_name, last_name) but our analysis concerned only one dimension (first_name), so we summed up all the data beneath that one dimension (first_name) and calculated the average across it. The point here is that we specified the count of items in the denominator by collapsing our data onto a single dimension.

                Summarizing where we are so far

                We know that the combination of one or more dimensions composes the grain of fact (and effectively primary key) of the table. The business question requires an average on one dimension which is less than the number of dimensions composing the grain of fact - therefore we have to collapse the data onto that one dimension before taking the average. By collapsing the data, we guarantee the number of records in the denominator so that it doesn't include all the records in our table (given by the primary key), but rather only those records we care about (given by, soon to be discussed, the "collapsing key").

                When calculating the arithmetic mean, there are two questions you need to answer:

                1. What exactly is the business question asking? On which dimension are we taking the average?
                2. How do we translate this business question into SQL?
                For example, "the average number of items sold across days" implies that, despite how many records are in our table, we need to collapse the data on the dimension "day" alone and take the average of the measure "number of items," which ultimately yields a single number.

                Why did I write "across days" instead of "by days"? Because grammar can make things confusing...

                GROUP BYs and averages

                apples
                date_picked grade first_name last_name num_apples
                2017-01-01 freshman Katie Smith 4
                2017-01-04 junior Alan Johnson 8
                2017-01-04 junior Alan Howard 10
                2017-01-04 freshman Tess O'Neal 8
                2017-01-01 freshman Katie Powers 5
                 

                Here we have four dimensions (date_picked, grade, first_name, last_name), one measure (num_apples) and as usual five observations. At this point it's not entirely clear what the grain is: it could be (first_name, last_name) but also could be (date_picked, grade, first_name, last_name). In other words, if (Katie, Smith) appeared twice with any grade or date_picked, should those two records actually be combined into one? Or, are grade and date_picked part of what makes a record unique? Fortunately, as we will see, knowing the primary key is not relevant to our calculation of the arithmetic mean (but you could always check the actual implementation in the database).

                We can calculate some more averages - for example, "what's the average number of apples picked across days?" There are two unique days (2017-01-01, 2017-01-04) so we know that's our denominator and our total, as before, is 35. The answer is 35 / 2 = 17.5 apples picked per day, on average. In SQL:
                 
                SELECT AVG(num_apples) FROM (
                    SELECT date_picked, SUM(num_apples)
                    FROM apples
                    GROUP BY date_picked
                ) AS t
                Now another average, "what's the average number of apples picked across days by grade?"

                So we'll look at each grade one-by-one: first there is "freshman". Within "freshman", we have two unique dates (2017-01-01 and 2017-01-04), so 2 is our denominator. The numerator is the sum of those three records (4 + 8 + 5) = 17. The average apples picked per day for freshmen then is 17 / 2 = 8.5. Next is "junior". Here we have only one unique date (2017-01-04) and two records, the sum of which is (8 + 10) = 18. The average apples picked per day for juniors is 18 / 1 = 18. In SQL:

                 
                SELECT grade, AVG(num_apples) FROM (
                    SELECT date_picked, grade, SUM(num_apples)
                    FROM apples
                    GROUP BY date_picked, grade
                ) AS t
                GROUP BY grade

                In the results grid, your SQL output would look like:

                grade AVG(num_apples)
                freshman 8.5
                junior 18
                 
                What would happen if I took the same question as above, but reworded it as "what's the average number of apples picked by day by grade?" Suddenly it's not so clear what exactly I'm grouping on (grade) and what I'm averaging across (days) because I use "by" to mean both. Confusing!

                If the business question isn't clear, it's your job as the data analyst to figure out how the words translate into SQL.

                Formalizing the pattern

                Understanding how to average multidimensional data is best tackled with a bit of terminology. There are three essential elements:

                • Collapsing key: the key whose dimensions we explicitly specify to be used for either grouping or averaging the data. This key is specified by the inner query's GROUP BY clause.
                • Grouping key: the key whose dimensions specify the groups for which we will calculate averages. This (optional) key is specified by the outer query's GROUP BY clause.
                • Averaging key: the key across which we ultimately want to evaluate the metric.
                  • This key is not explicitly specified: rather, it equals the collapsing key minus the grouping key. Any dimensions which are in the collapsing key but not in the grouping key will yield the averaging key.
                  • Most business questions are best understood with an averaging key of one dimension.

                Let's put these terms into an actual SQL query, recycling the example from above:

                 
                SELECT grade, AVG(numApples) FROM (
                   SELECT datePicked, grade, SUM(numApples)
                   FROM apples
                   GROUP BY datePicked, grade // COLLAPSING KEY
                ) AS t
                GROUP BY grade // GROUPING KEY
                
                // AVERAGING KEY (IMPLIED): [datePicked, grade] - [grade] = [datePicked]

                For example, if the collapsing key has 4 dimensions and the grouping key has 2, the averaging key will have 2 (difficult to interpret). If the collapsing key has 1 dimension and there is no grouping key, the collapsing key equals the averaging key. 

                Why is the primary key left out of all of this? Because you can never know how many records you are unwittingly including in the denominator of your calculation (which, at most, is given by the primary key of the table). As a result, you must explicitly define a collapsing key. You can think of the primary key as what defines an individual fact for your table, while the collapsing key is what defines an individual fact for your business question.

                Finally, notice what gives rise to the common mistake: we never explicitly declare the most important piece of our business question, the averaging key! Here we see "the pitfall of multidimensional averages": the averaging key is implicit, not explicit.

                The takeaway from this section: collapsing key - grouping key = averaging key

                A final example: tying it all together

                transactions
                date product state purchaser quantity
                2016-12-23 vacuum NY Brian King 1
                2016-12-23 stapler NY Brian King 3
                2016-12-23 printer ink NY Brian King 2
                2016-12-23 vacuum MA Lauren Mills 1
                2016-12-23 printer ink MA John Smith 5
                2016-12-24 vacuum MA Lauren Mills 1
                2016-12-24 keyboard NY Brian King 2
                2016-12-24 stapler NY Trevor Campbell 1
                2016-12-25 keyboard MA Tom Lewis 4
                 

                Knowing what we know now, we won't concern ourselves with what the primary key is. We can jump right into some business questions.

                "What is the daily average number of items sold by state and purchaser?"
                 
                SELECT state, purchaser, AVG(quantity) FROM (
                   SELECT state, purchaser, date, SUM(quantity)
                   FROM transactions
                   GROUP BY state, purchaser, date
                ) AS t
                GROUP BY state, purchaser
                
                // AVERAGING KEY: [state, purchaser, date] - [state, purchaser] = [date]
                "What is the daily average quantity sold by product in NY?"
                 
                SELECT product, AVG(quantity) FROM (
                   SELECT product, date, SUM(quantity)
                   FROM transactions
                   WHERE state = "NY"
                   GROUP BY product, date
                ) AS t
                GROUP BY product
                
                // AVERAGING KEY: [product, date] - [product] = [date]

                I will be the first to say that answering these questions requires a double-take - what exactly is the question asking? Then, how can I translate it into SQL? But that's precisely the point - calculating multidimensional averages requires considerable thought.

                Wrapping up

                Business questions don't come in the form of database logic or program code. Rather, they are formulated using natural language and must be translated into data language. Again, "calculate the average number of travelers by day by city" is not entirely clear. As a data analyst, you must clarify: "what exactly are we taking the average of?" Here it's helpful to think in terms of collapsing, grouping and averaging keys, especially when conceptualizing which records are going into your denominator.

                This problem isn't confined to SQL, but rather any store of relational data, such as pandas DataFrames or R data tables. If you're anything like me, you'll go poring over your old code in search of averages and wonder, "just what exactly am I averaging here?"

                Working with data from start to finish (Part 3)

                This is Part 3 of a 3-part series on data science in name, theory and practice.

                In Part 2 of this series, I discussed a systematic process for making sense out of data, without which it is easy to fall into the trap of drawing unactionable - or worse - unreliable conclusions. With the framework established, the goal of this post is to then walk through some of the code (in Python) required to implement it. I'll use the IGM Economic Experts Panel dataset, which catalogs the responses of dozens of economists on public policy issues - you can see the results of my analysis here.


                EXTRACTION

                Since there's no option to download this data, the first step is figuring out which data to scrape. Here's what the website looks like:

                Below this initial landing survey are links to all the other surveys:

                It's within these surveys (ie. each link) that I find the actual data: tables listing each economist, their vote, their confidence and any comments they have.

                So the roadmap is: scrape the homepage for the links to each survey, then scrape the response tables from each survey.

                Now, how do I get the links to each survey? Ideally, all the links follow the same pattern. To check this, I inspect the DOM by right clicking on the link in Chrome and selecting "Inspect" (you can also view source with CTRL+U or F12). A frame will pop up that looks like this:

                Here I see a pretty discernible pattern: each link is denoted by the <a> tag, nested within a <h2> header tag, wrapped in a <div> with an attribute of class equal to poll-listing (I end up using a slightly different pattern in the code). It's good practice to isolate with a unique pattern - after all, there may be links within header tags whose wrapping <div>s do not have the poll-listing class. 

                Now that I know what to look for, I can write the code. Logically, I want to traverse the DOM and store each link in a collection (eg. list, tuple, dictionary).

                I use requests to pull the DOM of the webpage, http://www.igmchicago.org/igm-economic-experts-panel. Next, I parse the DOM using BeautifulSoup - essentially this gives me a shortcut to isolating certain elements (eg. links, tables, etc.), without which I'd have to structure myself. Using my BeautifulSoup object (in this case, the soup variable), I isolate all the <div> elements whose class attribute equals poll-listing. However, so far .findAll() has only given me an iterable ResultSet object, not the actual links I'm looking for. Therefore, I iterate over each Tag object (the poll variable) and use poll.find('h3', {'class': 'surveyQuestion'}).a['href'] to get the hyperlink as a string. I then store the link in the topics dictionary. Naturally, when using a new and unfamiliar package, I check the documentation religiously to understand all the methods which can make my job easier.

                So after all that, I have the the links in a dictionary. Now I can iterate over each link, and for each page, scrape the tables on those pages. The resulting code is similar:

                First, I declare a data list object because pandas (the package for data tables) accepts data most easily as a list of dictionaries. Each element in the list is a row, and for each element, the dictionary's keys represent the columns. As I scrape each row from the tables, I put each cell into a dictionary, and each dictionary is appended to the data list. 

                So given the list of links for each particular survey, I again use requests to pull their webpages, then BeautifulSoup to isolate the parts I need - in this case the <table> elements. (The line with list('ABCD') basically says: assign a letter to each individual survey because the poll page can include one or multiple surveys).

                Now for the next part of the code - what do I do with these tables?

                The first line essentially iterates over each table on the webpage, with a slight adjustment of tying the assigned letter to its respective table. Now, for each table, I need to pull all the rows (<tr> tags). For each row, pull (from left-to-right) the data in each cell (<td> tags). Retrieve and clean the actual text data from each cell (.get_text()), then put it in a dictionary with some sensible column names. Finally, append the dictionary (row) to the data list. Again, for some perspective, the data list will contain rows from every table on every poll webpage, so it's important to make each row unique by including columns for the survey question (eg. A, B, etc.) and poll URL.

                At last, I've scraped the data from all of the surveys! The last scraping step is to temporarily store the data in a more amenable format:

                For most datasets, you can persist this data (ie. a permanent format on disk, not just RAM) in a Python-specific pickle file or .csv:

                df.to_pickle(os.path.join(os.path.dirname(__file__), 'igmchicago.pkl')
                

                However, for some datasets, you'll want use a relational database.

                STORAGE

                pandas makes it easy to dump a dataframe into SQL, without which you'd have to use raw SQL (ie. building the schema and inserting the data) - a much more painful process. 

                Once your data is in a database, you can run plain-old SQL on it. For example, which economists have average confidence levels higher than the average for their institution?

                While I have a strong preference for fast, simple data manipulation in pandas, SQL is an absolute must for any data/BI analyst.

                EXPLORATORY DATA ANALYSIS

                After persisting the data, I need to understand what's in it. What are the range of categorical variables and what do they indicate? Do some categories mean the same thing? Are there null values? How many rows are there and how many would I expect? I virtually always use the following commands to quickly get a sense of the data:

                EDA will accomplish two things: (1) you will learn the correct interpretation of your data and (2) you will identify some (presumably not all) errors in the data which need to be corrected. In this case, I noticed a couple of issues with my original scraped data.

                DATA CLEANING

                First, I saw that virtually all the recently added economists had very low survey response rates. I checked into the original data and noticed that although these economists were given the opportunity to retroactively respond to surveys, these responses were included as a second row, not the initial row which I scraped. 

                So I had to go back through my scraping code and rescrape the second row if it had data in it. After doing this, I confirmed that the discrepancy in survey response rates between newly added economists and all other economists was removed. Success!

                Second, all comments were stored as a string, even if they were blank (ie. an empty string: ""). The summary statistics therefore would lead one to believe that every economist always left a comment, which was obviously untrue. As a result these empty strings needed to be recoded as null values.

                Finally, I decided that adding the sex of economists would be useful and coding the response categories as numeric values would facilitate aggregation.

                DATA ANALYSIS

                Data analysis is often the meat of any data-driven project: you've gathered, stored, understood and validated the data, now how can you use it to answer business or research questions?

                I did not thoroughly analyze this dataset, though there is a lot I could do. I could run a regression with dummy variables and see if sex was associated with more or less confidence in responses. I could find the average confidence by institution, or tabulate an economist's confidence level over time. In many cases, I don't need more than intermediate statistics. A lot of insight can be gained by simply looking at any metric by a dimension (eg. region, sales channel, etc.), over time, or as a proportion of the total. 

                Here are some quick summary statistics I ran with pandas

                DATA VISUALIZATION

                I am a strong advocate of visualizing everything. Your audience will rarely be statisticians or analysts - often, they are decision-makers who simply need to know the next step and want a quick way of seeing the data-driven evidence. Not only that but visualization will quickly show you if your results seem counter-intuitive, allowing you the chance to review your data and analyses before presenting them.

                Unlike R, which has the powerful and intuitive ggplot2, Python does not feature as robust plotting libraries. The most popular packages I have seen are seabornggplot (ported from R), bokeh (displays in the browser), and the low-level matplotlib. In my opinion, none of these are as mature as ggplot2 due to either bugs or lack of features. While working on this project, I toyed with each library but eventually settled on matplotlib to visualize exactly what I needed - of course, at the cost of concise code and readability. 

                For the sake of brevity, I decided not to include the code to visualize the data here, but you can view the graphs here and the code on GitHub.


                That wraps up a high-level overview of implementing the data pipeline from start to finish. I strongly believe that once you understand the main steps in the process, most analytical projects become feasible. If you need new data, or have to clean data, or need to store and query large amounts of data, or visualize data, all you need is an understanding of these basic principles.

                Hope you enjoyed reading! This post concludes the 3-part series on data science in name, theory and practice.

                Full code for this project can be found on GitHub.

                The data pipeline (Part 2)

                This is Part 2 of a 3-part series on data science in name, theory and practice.

                With all of today's emphasis on big data and machine learning, it's easy to overlook an important fact: data does not just appear out of nowhere - clean, structured and ready to be analyzed. In reality, data must first be gathered, interpreted, cleaned and transformed. In this post, I will outline these steps within framework I call "the data pipeline," based on the flow of data from its extraction to its presentation.

                The intended audience for this post are beginners entering the field of data science as an analyst or engineer. Terminology and relevant software are in bold. Before expanding on the steps, note that most of the software alluded to in this series will be related to Python because that is the language with which I am most familiar. Regardless, the takeaway here is the flow of data in theory; in Part 3, I will demonstrate this flow in practice. 

                1. Data extraction

                The first step to any analysis is getting the relevant data. Sometimes, you are provided the data by the client or can download it online in clean .csv files. More often, you have to get the data yourself. This can mean retrieving and consolidating it from diverse sources (eg. ERP, accounting, Google Analytics), scraping it from websites, interfacing with APIs or even streaming it from hardware. Once you are familiar with scraping data, a new realm of opportunities emerges: any data you see, you can scrape. (Disclaimer: you should never scrape resources prohibited by a website's Terms of Service).

                When it comes to pulling external data, two generalized software solutions I've found are import.io, a free SaaS designed to scrape structured data (eg. listings on eBay), and Google Sheets' import_html() (eg. for a table on ESPN). If these don't fit the bill, you will have to employ a programmatic solution.

                The primary Python library for scraping your data is requests. When you visit a webpage, your browser sends an HTTP request to the web server with certain headers (eg. the page it wants, which browser it is), and sometimes cookies or a payload as well; the web server then replies with an HTTP response, most important of which is the webpage's Document Object Model (DOM). The DOM is the code (HTML tags and all) that your browser uses to render the webpage you see. Requests mimics some of your browser's basic functionality: you create an HTTP request, send it off to the web server and receive the webpage's HTTP response. But because you used requests and not a browser, you get that response in code/text format (ie. not a rendered webpage), which is just the format you need it in to parse it.

                Since you are scraping the data you first saw from your browser, it makes sense to copy the exact request your browser sent to the webpage. Some web servers won't even let you access their pages unless it looks like the request was made from a browser. So head over to Chrome Developer Tools (F12 in Chrome) and click the Network tab to start "recording". Go to any website and you'll see all the requests your browser makes (eg. the page's images, stylesheets, the DOM itself). It can sometimes be a game of guess-and-check to find the right request in a sea of other requests (if the webpage uses AJAX, you can look just at the XHR menu), but once you find it, you can copy the exact information your browser sent (under Headers) and paste it into the parameters requests will use.

                Requests sends the HTTP request and receives the response, but that's as far as it goes. So while you get the web page you were looking for (with the data within it), you get a bunch of HTML tags as well. This string of text represents unstructured data - there is no delineation between the information of interest and everything else, at least as far as your computer knows. To make sense of it, you need to structure it. When it comes to parsing the DOM, you should use BeautifulSoupBeautifulSoup is an HTML parser that structures the DOM, allowing you to quickly isolate any set of elements and extract the data you need. For example, you can iterate over the rows in a table and store certain values in a list or dictionary.

                Less commonly, your HTTP response won't be the DOM but instead a JSON response. You should be familiar with data serialization formats (mainly JSON and XML), though it's not really necessary to get too deep here. When it comes to .json, it behaves exactly like a Python dictionary so you just need to specify keys to get the values you're interested in. Finally, if you need to scrape webpages that use JavaScript, you may need to simulate a browser even more by using something like selenium

                2. Data storage and data modeling

                Once you've scraped the data, you need to find a place to store it. If you primarily need to write a small amount of data and store it for later, you can stick to flat files, eg. comma- or tab-delimited text files. While convenient for write operations, flat files are unsuitable for read operations. Without iterating over each line in the file, how can you extract the 200th row? Flat files lack indexes, meaning you need to traverse each row/line until you find what you're looking for.

                For this reason, most large projects and enterprise-level applications are backed by databases. Databases come in many flavors, though the most common are relational (eg. Postgres, sqlite3, Oracle, SQL Server) and non-relational (eg. MongoDB, Redis, Cassandra). I won't attempt to completely explain their differences, but I like to think that relational databases are used when you have neatly structured data. You specify the relationships between tables (eg. foreign keys), the constraints within tables (eg. column values must be unique or not null) and the type of data for each column. All of these must be known and specified in the database schema in advance. On the other hand, non-relational databases shine when it comes to unstructured data. You have data and you want to throw it into a database - you don't want the database worrying about their relationships or constraints or data types, but rather how to quickly get you the data you're looking for.

                In my opinion, you should work with relational databases (ie. learning SQL) especially if you're starting out. Analyzing unstructured data from a non-relational database is much harder precisely because your data lacks structure.

                Finally, it's worth knowing some commonly used database terminology (which is very hard to grasp unless you understand the big picture). In most enterprise settings, your data comes in as you conduct operations: you sell products, send emails, receive invoices, etc. This operational data compose transactions or events, which are recorded from various software sources (eg. ERP, website, payments platform, etc.). The data is consolidated into a highly normalizedonline transaction processing (OLTP) schema with a focus on write performance (INSERT, UPDATE, DELETE). However, while this stores your data, it does not make it convenient to query (SELECT): for example, what were the top 10 products we sold in each city over the last 12 months? As a result, online analytical processing (OLAP) schemas evolved, which take this same operational data but optimize it for answering business questions. The process of going from OLTP to OLAP requires a process called extract-transform-load (ETL), with key business data ending up in a denormalized data warehouse. Data warehouses typically use a star schema or a snowflake schema, predicated on fact tables and dimension tables. Once your transactional data is in a data warehouse (again, just a periodic, transformed copy of your operational data), business intelligence (BI) tools, such as Looker, Tableau or Pentaho, connect to this data and simplify the process of answering business questions (eg. via query builders or visualization). 

                That was a mouthful! To summarize: operational data comes in and is written to an OLTP schema, then copied and transformed into an OLAP schema via a process called ETL, and finally analyzed using BI tools. 

                3. Exploratory data analysis

                Once your data is persistent (ie. saved to disk and not memory), you need to know what's in it. Exploratory data analysis (EDA) is the systematic process of learning about your data: its contents, quality, relationships and assumptions. When people refer to "mastering the data," thorough EDA is an implicit assumption. Here are some important things to check for:

                • Search for and utilize any associated documentation
                • View all the columns (ie. even those which don't immediately fit on your screen)
                • Check data dimensions (ie. number of columns and rows; do the amounts make sense?)
                • Know the timezone of timestamps
                • Identify calculated fields (eg. a "conversion" column may be calculated from the columns "sessions" and "orders")
                • Determine units of measurement (eg. thousands or millions?)
                • Check for null values (eg. -1, 999)
                • Check for duplicate entries
                • Verify character encoding set (Python's chardet is useful here)
                • Look for typos within manually entered data
                • Check column data types (eg. are they consistent?)
                • Create a bar chart for categorical variables
                • Use histograms to visualize data distributions instead of simple averages
                • Compute descriptive statistics for numerical variables (eg. count, min, max, mean, quartiles)
                • Create a line chart for time series (useful for identifying gaps, truncation or other anomalies in the time series)

                I often use Python pandas for EDA because it simplifies many common tasks: 
                df.head()df.tail()df.shapedf.info()df.describe()df.unique()df.duplicated(); df.value_counts()df.hist()df.plot()

                4. Data cleaning and validation

                Data cleaning uses the insights gleaned from EDA to remediate errors in your data. For example, once you determine that null values are recorded as "None", you may convert these to the NULL data type. Or, you find outliers in your data which represent erroneous inputs - you may decide to drop these.

                Here are some common data integrity issues:

                • Fixing typos and standardizing categorical variables
                • Adjusting for outliers
                • Dropping duplicates
                • Parsing strings for timestamp information
                • Imputing/dropping null data

                Whether you're cleaning your data in Excel or a programming language, always remember to document your changes. This will considerably simplify retracing your steps when you or a colleague need to revisit the raw data.

                5. Formulating the business/research questions

                Coming up with the right questions marks the starting point for data analyses. Unlike the previous steps, data analysis is closely modeled after the scientific method: ideate, test hypotheses and analyze results - then repeat. Coming up with the right questions heavily depends on a contextual understanding of the data - that is, domain knowledge.

                This is where research comes in. Finding experts to speak with in person is likely the quickest and most thorough solution. Nevertheless, Google searches often suffice, if only to get a general understanding of important concepts. For example, what are the relevant key performance indicators (KPIs)? Or why does a certain variable exhibit weekly spikes? Or is the data obviously lacking certain information you'd expect it to have?

                This step is a testament to the importance of EDA: the more of your data you understand, the more you can focus on the parts you don't.

                6. Data transformation

                Before you can answer any business question, you will likely need to transform your data. This is distinct from data cleaning, which is concerned with the integrity of the raw data. Here, rather, you need to transform your data into a format more amenable to analysis. Since analyses will vary depending on the question asked, the required data transformations will vary as well. As a result, it's a good idea to keep a copy of your original, validated dataset, which you can go back to upon testing new hypotheses.

                Some common data transformations are:

                • Split-apply-combine operations (theory and pandas implementation), a.k.a. group-by and aggregations
                • Calculating new columns as a function of other columns
                • Changing the frequency of a time series (ie. upsampling and downsampling)
                • SQL-like joins
                • Wide-to-long or long-to-wide operations (ie. pivoting)
                • Parsing strings with regular expressions
                • Discretization (or binning) of numerical variables
                • Subsetting/truncating portions of the data
                • Normalization (eg. z-score, Box-Cox power transformations, percent changes)
                • Creating dummy variables
                • De-trending a time series and adjusting for seasonality to make it stationary
                   

                7. Data analysis

                Data analysis requires some degree of familiarity with statistics. This can range from topics like measures of central tendency and dispersion, to the conventionally taught frequentist statistics, all the way to the likes of machine learning and Bayesian statistics. For the vast majority of problems, intermediate statistics will provide satisfactory results. This is not to say that machine learning is not useful (quite the contrary), but rather, such sophisticated statistical methods are only worth investing in after intermediate statistics have been thoroughly exhausted. Differences in means, standard deviations and correlations, for example, can go a long way.

                Excel is a great solution for data analysis on a small scale, even if you're just using a sample of bigger data. For data too large for Excel but small enough to fit on your machine (eg. 2-4 GB), I would recommend analyzing your data using pandas. For large datasets, SQL or Business Intelligence (BI) tools are often used for analysis. Finally, for truly big data (eg. several TB), it makes sense to go with a distributed computing environment like Hadoop.

                Data analyses for business problems commonly fall under:

                • Descriptive statistics: information about the sample (eg. counts, totals, means, interquartile ranges, correlations)
                • Inferential statistics: does the sample represent the population?
                  • Retrospective: regression analysis and classification with clearly specified models
                  • Prospective: machine learning (ie. predictive analytics) with training data, test data, and often black box models
                     

                8. Data visualization

                The final step within data analysis is visualization. Data visualization is so important that I feel comfortable using the blanket statement: "no data analysis is complete without a visualization." No matter how comprehensive your tables may be, the fact remains that nobody likes expending extra effort to identify a trend from numbers. This is true regardless of your audience - even in a room full of rocket scientists, you will have fewer people internalizing your results if you use tables than if you use a visualization. Making a graph in Excel costs an extra 5-10 minutes; not making a graph can cost your audience's entire focus.

                Remember, tables and charts are not mutually exclusive! In all likelihood, you should use a table, as long as you keep it in the appendix.

                The following are some chart types which may be useful to visualize your data:

                • Histogram
                • Bar chart
                • Line chart
                • Scatter plot
                • Area charts
                • Heat map
                • Geographic map
                • Hexbin plot
                • Correlation tree
                • Box plot
                   

                Sanity checking every step of the way

                Sanity checking doesn't really follow a sequential order; rather, you should always be sanity checking your work. Without expressly including this in the data pipeline however, it's easy to forget. For example, scraping the wrong data, or dropping useful data, or leaving in duplicate data, are all errors that can seriously compromise your results. Sanity checks are unique to each step: what you check for during analysis and during extraction are completely different. 

                Here are a few common sanity checks for some modules. The list is by no means exhaustive, and I hope to add more over time as I run into them.

                Transformation

                • No additional or loss of null values after grouping/join operations
                • Dataframe dimensions don't deviate significantly from expectations
                • Duplicates included in aggfuncs (eg. 2 identical rows included in a sum operation)
                • Sum of some numeric value (eg. quantity, revenue) from original data matches that of transformed data
                   

                Analysis

                • Check your residual plots
                • Data is normalized
                • Time series is stationary
                   

                Visualization (more superficial)

                • Axes have names and units of measurement
                • Labels are discernable

                This wraps up some major topics in the field of data, from data extraction to analysis to presentation. In the next post, I'll demonstrate the data pipeline from start to finish using Python.

                Reclaiming the term "data science" (Part 1)

                This post comprises Part 1 of a 3-part series on data science in name, theory and practice.

                When I first began delving into data science, I had trouble figuring out where to start. NoSQL databases, Hadoop, machine learning - these are buzzwords thrown around the community yet none of them serve as useful entry points for beginners. Today's the post is the one I wish I had read when taking my first steps into data science. 

                One of the most important steps when studying a new field is to understand the jargon - the disciplinary-specific terms which abstract from otherwise complex concepts. But the jargon does not come easy: starting at the wrong end of the complexity spectrum can have you, for example, tackling number theory before you've finished algebra. As a result, determining the order in which to learn the terminology must necessarily precede a comprehension of that terminology.

                Though "data science" has easily been the most popular term emerging from the 21st century data-related fields, it is by far the least accessible. Instead, I believe the data field is more intuitively segmented into four closely related, first-order terms, which I will attempt to define:

                • Data science: the implementation of advanced statistical concepts (eg. machine learning, Bayesian statistics, probability theory) primarily using modern software and (often big) data. As the engineer Josh Wills tweeted, data scientists are better statisticians than most programmers and better programmers than most statisticians.
                • Data analysis: the analysis of business and research questions using database querying (eg. SQL), data wrangling, intermediate statistics, problem-solving skills, domain experience and a knack for seeing the big picture. Data analysts are generalists (like business analysts and management consultants) while data scientists are typically specialists (like mathematicians and statisticians).
                • Data engineering: the extraction, validation and transformation of data, as well as the design of database architecture.
                • Big data (related, though orthogonal): the use of distributed computing to analyze massive datasets which cannot be processed on a single workstation or server.

                You'll notice there's currently no single term, no field of study, no academic department, which unifies these terms. And yet, they should be studied in unison: any project which gathers data and tests hypotheses necessitates them. Absent any overarching field name, a data practitioner must traverse nominally discrete fields, each of which imply a considerable degree of specialty.

                "Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this roadmap.

                A general "data science" field has two properties which distinguishes it from other fields. The first is that its subfields all revolve around databases, or the storage of data (structured or unstructured). This means that data science is not so broad as to be the equivalent of information science, but rather a subset. Second, the subfields all offer methodologies and tools for applying the scientific method. For example, while something like data security is highly related to databases, it offers little use when analyzing business questions using data.

                What happens to all of today's "data scientists"? Surely, not everyone who works in the field of data science should suddenly carry a title reserved for statistical whizzes. Rather, today's data scientists should be renamed; "computational statisticians" or "statistical scientists" are two reasonable candidates. Frankly, "data" is far too broad a term to be restricted to advanced statistics.

                Most importantly, "data science" - the field - becomes accessible. Studying data science no longer means starting with algorithms and probability, but rather more introductory topics like like gathering, storing, curating, validating, transforming, analyzing and visualizing data. "Data scientist" as a job title loses meaning, just as "computer scientist" or "information scientist" are too general to describe a job function. "Data scientist" no longer signifies a senior, elevated "data analyst," but rather a fundamentally different career path. As a result, data analysts are no longer encouraged to pursue advanced statistics as the next step on their career path, but can instead branch out to a variety of specialties similar to their generalist counterparts in management consulting.

                Despite the popularity of data science's current usage, a redefinition of the term will clarify the field's direction and establish the foundation for a coherent, organized taxonomy for all data-related disciplines.

                  You should be using DBeaver

                  If you often work with PostgreSQL or sqlite3, you should be using DBeaver.

                  When I first started with Postgres, I was introduced to pgAdmin 3. pgAdmin is Postgres' de facto database management software (but practically speaking, IDE) that comes bundled with the installation of Postgres. The thing is: pgAdmin is terrible. It performs better than it looks, but not by much.

                   
                  A UI straight out of Windows 95...

                  A UI straight out of Windows 95...

                   


                  When I first started using pgAdmin, I believed this is how the SQL world was. A shoddy development environment was a simple fact of life. I would always have to live with two open windows (one for the database manager and one for the SQL query manager), typing out "SELECT ..." just to see the first few rows of my table, a lack of ER diagrams, and non-existent auto-formatting or auto-completion. Does pgAdmin work? Yes. Does pgAdmin work well? Not a chance. Using pgAdmin for writing SQL queries is like using gedit for your scripts - you can do it, but you're not going to like it.

                  You can imagine my delight when I learned about DBeaver. When all you've ever known was pgAdmin, DBeaver is an oasis in the desert. Not convinced? Here are some reasons why:

                  1) DBeaver is free.

                  Yes, there are other Postgres IDEs on the market (too many, in fact, which is why out of convenience everyone uses pgAdmin), but most aren't free. Why pay for a product whose quality is uncertain when you have great free products already on the market? This points to the success of editors/IDEs such as Atom, Sublime Text and Anaconda.

                  2) DBeaver is open-source.

                  I don't think I need to explain this one, save for the fact that many of its competitors are not.

                  3) DBeaver features one of the few post-2000 UIs.

                  DBeaver doesn't just look better than pgAdmin (which, honestly, isn't saying much) and many competitors, DBeaver looks good. The fonts are well-chosen, tab corners are rounded, color scheme is non-obstrusive, alternating rows are shaded, and icons are intuitive. Querying one database versus another - often, your development or production environment - is a breeze: just a simple dropdown up top. Most importantly, if you don't like a window, you can get rid of it. Your setup can be as busy or minimalist as you please. Personally, the UI of my IDE makes a major difference in my productivity.

                   
                   

                  4) DBeaver features a data view.

                  I released an audible sigh of relief when I double clicked a table for the first time and, voilà, my data was just there. No "SELECT * FROM ...", no opening new windows.

                   
                   

                  Seeing my data alone is awesome. But DBeaver offers more... much more. I can click on column headers to sort my data. I can use the filter box above my results to specify which columns I want to filter and how. If I'm feeling downright lazy, I can right click on a cell and DBeaver will suggest a dozen ways to filter the data - this is especially useful if I don't feel like typing out long column names (admittedly, autocomplete doesn't work in the filter box). All of this is done without requerying the database.

                  For DBeaver, viewing data is not enough of course! DBeaver also lets you UPDATE, DELETE and INSERT data into your tables, all from the data view. Simply clicking on a cell, changing the value (just like Excel), and a CTRL+S sends an UPDATE command to the database. DELETEs work the same way. If you want to INSERT a row, there's a icon at the bottom of the table to do just that. Naturally, this is no substitute for learning your CRUD commands, but these features make those operations all too convenient.

                  5) DBeaver offers ER diagrams

                   
                  erd.png
                   

                  Sufficient, concise and informative.

                  6) DBeaver uses just one window! Your SQL scripts lie in multiple tabs

                  Frankly, it's not that this is an impressive feat - it's just disappointing that pgAdmin doesn't offer this. You can also use CTRL+<PgUp>/<PgDown> to cycle through your tabs.

                  7) DBeaver's SQL editor includes auto-formatting, auto-completion and hyperlinks

                  Reading poorly formatted SQL is just as bad as reading poorly formatted code - sometimes worse because an unintended join can call into question the results of any data analysis. DBeaver has a setting which capitalizes your code as you type, and DBeaver can restructure your code to conform to commonly accepted syntax standards. 

                  DBeaver also has autocomplete, which is a boon for those of you who are used to pgAdmin - ie. no auto-complete!

                  Finally, DBeaver includes updates SQL objects you type in the editor to be hyperlinks. A CTRL+click leads to the object and its associated metadata. I haven't really used this feature yet, but it seems like it could be a cool trick to show people at parties.

                  8) DBeaver shows metadata for database objects

                  Finally, DBeaver offers an abundance of metadata. Column names and their references, a table's DDL, estimated row counts, table constraints and more. This is information I use daily and it can sometimes be a pain in the ass to get with a more parsimonious IDE.

                   
                   

                  9) DBeaver has customizable... well, everything

                  Changing the hotkeys for autocomplete and new SQL script were priorities for me, but then I got a little carried away. Font faces, font sizes and tab colors all got some form of modification. I think my DBeaver layout is actually uglier now, but hey, at least I made it ugly.

                  10) DBeaver's creator and maintainer is extremely active

                  Serge, the creator and maintainer, does an amazing job at responding to issues, bugs and questions. Serge monitors the forums and his GitHub page. In the two times that I've had issues, both were attended to within hours. Did I mention he does this all for free? He's just a good guy.

                  11) DBeaver is cross-platform.

                  Windows? Linux? OS X? No problem. This doesn't seem like a major point until you decide to switch from Windows to Linux (as I did) and want to take your software with you. Some of it goes (sorry Microsoft Excel), some if it stays, and I'm glad to say that DBeaver made it over.


                  I would be remiss if I did not point out two flaws of DBeaver. The first: there is no "Server Status" like the one in pgAdmin. This is huge because I have wearily boot up pgAdmin to check for slow or hung queries and database locks. The second: you can't see postgres jobs. Personally, this is not a necessity, but a nice-to-have. Unfortunately, these are likely to never be implemented (for reasons that Serge has explained), so pgAdmin still lives on my machine at the very end of my task panel. 

                  If you skipped to the very end of this article, I'll save you the trouble: just use DBeaver.

                  Iterative design

                  Mike Bostock, the creator of d3.js, gave an insightful lecture on design a couple years ago. Perhaps the most enduring takeaway from the entire talk was the title: "Design is a search problem." Bostock concisely defines the design methodology: try many designs and optimize the way in which you find the best one.

                  I'm no graphic designer, but I tried different designs for a logo I worked on to see how it would turn out.

                  Iteration 1

                  Iteration 2

                  Iteration 3

                  Iteration 4

                  The problem with OpenRefine: clean vs. messy data

                  I stumbled upon OpenRefine over half a year ago while researching tools that should be part of every data scientist’s toolkit. Originally developed at Google, OpenRefine is designed to transform messy datasets into organized, clean ones.

                  After several weeks of using the software, I feel the software’s raison d’etre must be emphasized. OpenRefine is designed for messy data. Said differently, if you have clean data that simply needs to be reorganized, you’re better off using Microsoft Excel, R, SAS, Python pandas or virtually any other database software.

                  First of all, what is “clean” data?

                  Clean datasets require very few (often one) transformations per column. For example, say you have a table of employees that includes a ‘date_of_birth’ column. You may need to convert this ‘date_of_birth’ string into a datetime type. To do this, you apply one operation (a datetime function) to the entire column. Or, let’s say you have an ID column whose values are in the string form ‘employee_id_1212459’ - split the string by underscores and take the last element. Even simple data replacement is still in the realm of a clean dataset: replace all values encoded as ‘999’ (ie. a filtered subset) with type Null. All of these clean datasets share one commonality: relatively few operations and subsets are required to reorganize the data. Most can be transformed using a few lines of code.

                  Messy data, by contrast, is when dozens of operations and subsets are required to reorganize the data. For example, consider a visitor who each day signs his name in the guest log slightly differently: “john smith”, “J Smith”, “John Smifh”, “Jon Smith”. Several operations are needed to transform this column – perhaps even a transformation for each entry. As the programming rules become more and more ad hoc, the messier the dataset is.

                  In fact, what differentiates clean data from messy data is not organizational structure but data consistency. While clean datasets may not be organized as you’d like (eg. datetimes are stored as strings), they are at least consistent. Messy datasets, on the other hand, are defined by inconsistency: typos abound and there is no standardization on how the data should be input. You will only ever find messy datasets, practically by definition, when humans input the data themselves.

                   
                   

                  In today’s age of data abundance, the vast majority of data is not produced by human entry. Though sometimes carelessly organized, the data is clean. This is not where OpenRefine excels. Why?

                  Because while OpenRefine can work with clean data, so can a whole host of other software. Need to replace all values greater than n? Filter by value and drag down your replacement value in Excel. Need to group together similar names? Remove duplicates and sort, then manually edit the group names together and VLOOKUP old names to new group names (ie. a crosswalk). Have a complicated regular expression you need to implement? It’s probably easier to use a programming language of your choice than try to learn the Google Refine Expression Language (GREL).

                  Not only can these software work with clean data, they are often better at it. OpenRefine hiccups over larger datasets and has buggy loading when it comes to nonstandard file formats (like .xlsx). It’s also not a relational database so you can’t (easily) join together multiple tables – everything has to be on one giant table. If you’re already proficient in Excel, there’s an Excel formula for virtually everything you’d find in the GREL (except a few convenience functions like .reverse()). Any advanced functionality in OpenRefine requires learning the GREL, at which point you’re better off using a more widely-used programming language.

                  Ok, so we’ve determined that OpenRefine really doesn’t add much when it comes to clean data, which brings us back to the original point: OpenRefine is meant for messy data. Here, in its limited applicability, OpenRefine really shines. Clustering can save a lot of manual work for grouping together inconsistently typed strings (even, for example, identifying “John Smith” and “Smith, John” as a group, which can be difficult to keep track of when going through a sorted list by hand). Additionally, OpenRefine displays an easy-to-access histogram (via its numeric facet) for columns housing inconsistent data types. In one view, it shows you numeric outliers, as well as counts of strings and blanks in the column. Of course, this can all be done in Excel or a programming language, but not as conveniently.

                  All in all, OpenRefine does one thing really well: traverse messy data. If you’re not sure that you’re dealing with messy data (hint: you probably aren’t), OpenRefine can quickly show you the breakdown of each of your columns. Especially when it comes to similar strings, OpenRefine can help enormously in clustering these. Beyond that, however, you’re better off sticking with your more familiar tools in lieu of learning a new one.