Visualizing the opinions of the world's leading economists

Several times a year, Chicago Booth's IGM Economic Experts Panel surveys dozens of leading economists on major public policy issues, ranging from the effects of a $15 minimum wage to those from the Brexit. The panel publishes summary statistics for each survey, such as how many economists agreed or disagreed with the survey question and their confidence in their responses. However I was interested in the other ways the data could be explored: for example, do economists from different research institutions vote differently? Which economists respond to the survey most and least often? Which economists have the most verbose comments?

This project represents the results of a series on data science in name, theory and practice. I chose this dataset because analyzing it required every step along the data pipeline, from scraping to cleaning to analyzing and visualizing data. Though I could have included a couple more graphs (which are in the code), I decided instead to show only the most interesting ones. First, some summary statistics about the dataset - there are:

  • 132 survey topics between Sept. 2011 and today
  • 195 survey questions (ie. some topics include more than one question)
  • 51 surveyed economists (40 male, 11 female), between 6 and 9 from each university

And now for some charts...

Which economists responded to the survey most and least often?

We can see the majority of economists responded over 80% of the time. Quite a few economists responded to every survey question. There seems to be no pattern between which economists responded and their university affiliation.

Which economists commented most and least frequently?

I expected an inverse relationship between an economist's number of comments and their comment length - as they comment more often (bar height), the length of their comments (bar color) would diminish. However, there seems to be no pattern here - for example Anil Kashyap comments the most often yet is also in the highest quintile in terms of average comment length. Additionally, there is greater variability in comment frequency as compared to survey response rate, likely because including a comment takes more time and effort.

What was the relationship between an economist's average confidence in their responses and the variability of their confidence?

I wasn't sure what I would see here and I'm still not sure too much can be interpreted from this graph. Excluding (arbitrarily) the outliers on the left and the right, there appears to be an inverse relationship between mean confidence and variability of confidence. For example, as one becomes more confident in her responses, she will generally use a more narrow range to report her confidence.

And finally, is there a difference between responses of economists at freshwater (Chicago) and saltwater (Harvard) universities?

Here, I only sampled the 30 most recent survey topics to not overcrowd the visualization. From this graph (and others, such as plotting all universities), it looks like the survey questions are fairly uncontroversial. Most often, economists from different universities agree with each other, and on topics where they disagreed, it was often due to small sample size (eg. one economist disagreeing with another). At least from visual inspection, it doesn't look like there is much disagreement between the economic schools of thought on major public policy issues.


If you found these graphs interesting, feel free to explore the data on your own! The Python pickle file is included along with the code to the graphs. In Part 3 of the series on data science, I walk through the data pipeline by example, from extraction to visualization.

The accompanying code and data are on GitHub.

Building a personalized recommendation engine for events in Boston

Searching for new, interesting things to do in Boston is a great way to get out of the house, but it can also be time-consuming. Among most local aggregators, volume of events is the objective, not quality. I schedule an hour each Sunday to browsing these events. It's not an activity I really look forward to - it's a manual and tedious process. If the event fits certain criteria (price, relevance, location), it's an event I'd like to attend. If not, skip.

When I thought about the problem formulaically as above, I realized it would be pretty simple to set up as a program. Build a basic front-end that interfaces with a database of events, and a back-end which houses those events and make predictions from my preferences.

I ended up scraping data from The Boston Calendar because it is the best local events aggregator I could find. Once a week, I pull new events for the upcoming week and receive an e-mail asking me to mark events I like. The more events I like, the more the algorithm will learn my preferences and recommend events which better suit my preferences.

The live webapp is located here and the code is located on GitHub.

How positive are 2016's candidates during presidential debates?

Update: this article was featured in The US News & World Report.

While watching some of 2016's presidential debates, I remember having the distinct feeling that some candidates appeared to view the United States as a country replete with problems to solve rather than opportunities to grow. Indeed, it seemed to be a far cry from President Obama's rallying call of hope, the ideological platform that propelled him to the White House in 2008. While diction may be no more than how one frames the same situation, a candidate's level of a positivity can play an important role in how they are viewed by the public.

Using the full debate transcripts from UC Santa Barbara's "The American Presidency Project", I evaluated just how positive the candidates were during debates. For each debate, I combined each candidate's speech, then passed it through a natural language processor located here (see footnote 1) to calculate the sentiment.

Both Hillary Clinton and Bernie Sanders maintained a fairly high level of positive language throughout their debates, only dropping to 50% in the February 11th debate.

Jeb Bush, Ted Cruz and Marco Rubio equally tilted toward positive language, on par with the Democratic candidates. Chris Christie spiked lower at times (sometimes significantly), as did John Kasich.

However, there is one candidate who, as usual, found a way to differentiate himself:

The data speaks for itself: Donald Trump is extremely negative.

All in all, there is surprisingly more positive language than I expected. In a later project, it could be worthwhile to compare these results to President Obama's debate transcripts in 2008 and 2012.

1. This data set uses Python's natural language toolkit package ("nltk") to evaluate sentiment in text. It was trained on Twitter messages and movie reviews - as a result, it may not be as accurate as it would be by using actual political speeches.

Code, data and graphs for this project can be found on GitHub.

How many MBTA vehicles are running in Boston right now?

Whenever I dejectedly waited for the T, I always wondered: just how many vehicles are running? Am I waiting so long because there are too few vehicles on the route? How many are typically running - and is it more or less than that? While I'm at it, is it particularly busy now? Knowing a trip's duration is a proxy for how busy the route is (because the T spends longer at each spot to pick up and drop off passengers), so can I estimate how busy it is throughout the day?

To answer these questions, both visually and dynamically, I decided to make a webapp...

Ok, that's not really true. In reality, I wanted to develop my first webapp while learning new technologies along the way. No PHP, no apache, no upstart daemons.

In that measure, I consider this project a massive success. I developed a solid understanding of a web framework (django), a templating language (Jinja2), a webserver (nginx), virtual environments (virtualenv), a distributed task queue (celery), a task manager (supervisor), a caching system (memcached), a CSS framework (Bootstrap 3), responsive web design, AJAX (using JQuery), an emergent plotting framework (plotly.js), load testing (loader.io) and finally Google Analytics. I also continued to refine my knowledge of pandas, SQL, git, bash and Unix permissions. I'm extremely satisfied with the technical functionality of the app and, to my surprise, I'm not disappointed with the design either.

Nevertheless, one thing about the app bugs me: it's kind of useless. Sure, it's neat to see how many vehicles are on the track, but more vehicles than usual doesn't necessarily mean the wait time is any less. Similarly, trip duration from start to finish isn't what people actually care about - they want to know how long their commute takes throughout the day. Only after getting fairly deep in the project did I realize that, from a user experience standpoint, the app is a failure.

I'm hopeful, though. Now liberated from my obsessive need to understand the technical details of a webapp above else, I learned a valuable lesson: user experience comes first. I'm optimistic my future projects will improve on this.

The website is embedded below and is accessible in full here.


UPDATE [11-27-2016]: This app unfortunately used a lot of resources on my lightweight t2.small EC2 instance so I decided to retire it. I've included some images below for posterity. As always, the code can be found on my GitHub.

 
 
 
 

2013's U.S. fatal automobile accidents: a statistical analysis (Part 2)

After validating, cleaning and crosswalking the Fatality Analysis Reporting System (FARS) dataset, I was curious if certain variables had a larger effect on driver fatalities than others. For example, how did vehicle size affect the odds of dying, or previous DWI convictions? Which has a stronger effect: drinking or drug use?

Since dying (or not dying) is a binary outcome, I used a binary logistic multiple regressions model. Many independent variables of interest were categorical (eg. restraint use, weather, sex, surface conditions) while others were continuous. As a result, it was important to convert the categorical values into dummy variables before throwing them into the regression. Fortunately, pandas makes this really simple:

pandas.get_dummies(dataframe)

For example, a "SEX" column that accepts "Male" or "Female" will be split into two columns, SEX_Male and SEX_Female, each with their respective 1s and 0s. However, in order to avoid the dummy variable trap, one dummy needs to be dropped for each category - it becomes the reference group to which all other categories will be compared. 

Some additional housekeeping: I add a constant to the model and impute the mean for missing observations among the continuous variables (an oversimplifying, albeit simple, assumption, but probably better than dropping all those observations). Below are the results of the model:

The coefficients lack direct interpretation, however, because they're not odds, they're logodds. We can exponentiate each coefficient to calculate odds. Here's the table below:

This table yields more intuitive explanations. For example, holding other variables constant, "DEFORMED_Disabling Damage" has an extremely strong effect - your odds of dying are over 7 times higher if your vehicle is totaled than if it isn't (which is obvious). Here's the same table, visually:

Some other findings from the model include:

  • "DR_DRINK_Drinking" is lower than "DRUG_DUMMY" - those using drugs are more likely to die than those who drink while driving.
  • The "BODY_SIZE" variable follows the trend we would expect: from "Medium" up to "Very large", the odds of dying decrease the larger the vehicle you are in (when compared to small vehicles like sedans). For example, the odds of dying in a very large vehicle are 19% as high as those in a small vehicle. 
  • "SEX" does not seem to have a large effect on the odds of dying, nor do previous suspensions ("PREV_SUS").
  • "AGE" is 1.027, meaning that for each additional decade of age, the odds of dying 1.3x higher (1.027^10).
  • "WEATHER_Snow" appears to have significantly lower odds of dying than in clear weather, probably because most accidents happen during normal, light conditions.
  • Negligent driving ("DR_SF"), drunk driving ("DR_DRINK_Drinking"), using drugs while driving ("DRUG_DUMMY") and speeding ("SPEED_DUMMY") all increase the odds of dying in a motor vehicle (as expected), with negligent driving the least and driving while using drugs the most.
  • For each additional occupant in the vehicle, the odds of the driver dying almost halve. This is likely because the more people there are as occupants, the lower the odds that the fatality was the driver (because now, it could potentially be the passengers). This is a data artifact and shouldn't be interpreted as having more people in the vehicle decrease the driver's odds of dying (ie. causal).
  • For restraint use ("REST_USE"), the odds of dying when wearing a seatbelt were 16% as high as those without. 

If the 2014 FARS dataset, to be released this coming December, follows the same pattern as the 2013 data, I expect these variables to have some predictive value on which drivers survive and which die. We can use the sci-kit learn package to implement a really simple logistic regression model:

X, y = X.values, y.values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=35)
LR = LogisticRegression()
LR = LR.fit(X_train, y_train)
LR.score(X_test, y_test)
> 0.80207515423443632

The model should correctly predict who died and who survived about 80% of the time. To prevent overfitting to that specific slice of training data, we can use cross validation to fit a model to 5 different slices of the data:

scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=5)
print scores.mean()
> 0.800823907131

About the same, so overfitting to the training data doesn't seem to be an issue here. 

All in all, this analysis confirms some well-known priors: driving a larger vehicle, being younger, using a seatbelt and driving safely are all significant factors in surviving a fatal car accident.

Code and supporting files can be found on GitHub.

Where are food inspection violations in Boston located?

Jerry Genser (a colleague) and I thought it would be interesting to analyze one of Boston's many public datasets, located here. We were curious how food inspection violations changed over time and depending on the area - most importantly, we were wondering how this could best be visualized. Below is a looping video of our results. The live webapp can be viewed here and the code here.

 
 

A dynamic approach to security analysis

The motivation for this project was to calculate and visualize important financial metrics for any publicly traded security, as well as illustrate how significantly results can change by altering input assumptions. Traditionally, these calculations are performed in Excel. This process is time-consuming, error-prone and produces static output. To my knowledge, there is no freely available online application that allows various input assumptions to be flexibly changed. Leveraging R to automatically run these calculations yields useful financial information extremely quickly.

Although some important financial statistics are readily available online, such as betas and realized returns, they provide an incomplete picture of stock returns. For example, despite being widely interpreted as a measure of risk, standard deviation provides a narrow, one-dimensional view of a stock's variability. In reality, stock returns are rarely normal, so variables like skewness and kurtosis cannot be assumed away.

Similarly, the CAPM is widely used as the de facto model for calculating required return, largely due to its simplicity. But while the CAPM empirically explains about 70% of variability in market returns, the Fama-French 3 factor model explains over 90%. Despite this, CAPM calculators online are abundant while Fama-French ones are virtually nonexistent.

Notably, these calculations are not difficult to perform. Heavily tested and widely used programming libraries perform them automatically. For these reasons, this webapp offers a more comprehensive view of stock returns (click on the image below to view the live webapp).

All code and .csv files can be found on GitHub.

When should you go to the RMV in Massachusetts?

Update: this article was featured in The Boston Globe.

As I approached my 24th birthday, I realized that my driver's license was about to expire. I'd have to go to the RMV. These agencies are notorious for long wait times so naturally I wondered when the best time to go would be. Certainly not lunch hours, probably not close of business either - perhaps right before then?

Fortunately, MassDOT posts its wait times for licensing and registration; for example, the Boston RMV location is posted hereBut it only gives you a point in time - it doesn't tell you how the wait time has changed over time.

I decided this would be a pretty simple task, starting with a Python script. Query all RMV location websites every minute and store that in a database (eg. .csv file), host that database on Amazon EC2, run an R Shiny server on it and have an R Shiny webapp auto-update with the database.

For simplicity, I wondered what a "typical" week looked like - the following R Shiny app takes the entire database and averages each minute for each weekday, Monday through Friday. Here are the results (click on the image to see the live webapp - about 30s load time):

Somewhat as I expected (though it varies by location), the best time to go is before close of business and between 10-11am. What I didn't expect is how much it varies by day (eg. Friday is a terrible day to go; Wednesday is pretty good). As this database queries the RMV websites and grows, it will illustrate a better and better picture of when you should go to the RMV.

Although the practical application of this webapp is apparent, I mostly did it because it seemed like a multi-disciplinary project. And it was. I was able to get my hands dirty with R (backend code), R Shiny (webapp interface), R ggplot2 (plotting), Amazon AWS (virtual server), apache (webserver), git (code version control), Python scraping and scheduled live data refreshes - all in a single project.

If anyone ever wants to build their own app, or simply access the data, it's available online. The data, which is live and scraping the RMV websites each minute, can be found here. The code (Python and R) can be found here.

I also thought it'd be useful to map of the RMV locations - if one location is not too far away and has persistently lower average wait times, it probably makes sense to drive there instead:

Charity efficiency: is bigger better?

The Better Business Bureau (BBB) is an institution in the United States that collects self-reported data on thousands of American businesses and charities - data which can be found here. For each BBB-accredited charity (and many non-BBB accredited ones as well who choose to give most data), you can find some important information like their assets, liabilities, income, as well as how they allocate their funds between programs, fundraising and administration.

We can define efficiency as the amount of funds spent on programs divided by the total income for a charity. Then we can plot efficiency versus charity size (by net assets). Are larger charities more efficient?

The answer is a clear no: there is no correlation between charity efficiency and size. More specifically, if we look at just the largest assets (ie. those with net assets between 100M and 1B), we again see no correlation.

Perhaps our initial efficiency definition wasn't the best measure of efficiency - after all, money spent for further fundraising may still be useful. So, we can redefine "efficiency" as non-administrative spending, or: 1 - administrative spending/total income. We now plot non-administrative spending against size:

Again we see no correlation between another measure of efficiency and size. But we also see another important result: virtually every large charity spends less than 20% on administration costs. Said otherwise, for every dollar you give to one of these charities, $0.80 is going to your cause (in some way). That's comforting.

Regressing efficiency (both funds spent on programs or non-administrative spending) on other readily available data like state, tax-exempt status or size of staff all yield no correlations. In other words, none of these variables explain the efficiency of charities. In even more words - we require more detailed data: variables that are significantly correlated with some measure of efficiency.

Perhaps in a follow-up analysis - here, at least, a negative result is still a result.

Code and .csv's can be found at bit.ly/1qdlRRB

NBA: How do PER and win percentage correlate?

The player efficiency rating, or PER, is a widely used metric in professional basketball to quantitatively determine how "efficiently" players play. For example, shooting a low percentage, being on the court but not collecting any stats, or high turnover rates all make a player less efficient. More information can be found here.

Notably, PER is an individual metric, not a team metric. Adding up the PER for every player on a team may say something about the team, but it also misses everything about the effects of teamwork, coaching, team defense and so on. Despite this, does team PER explain anything about a team's success?

First, we can graph each team's median and mean PER. As of February 1st 2014, the teams on the left have the highest win percentage; the teams on the right the lowest. If team PER highly explained a team's win percentage, we'd expect a cleanly decreasing line: as team PER declines, so does win percentage. Not what we see, however:

Some interesting observations: the Jazz have a higher median PER than the Thunder and Pacers. Dallas leads the league in mean PER. The majority of teams have a fairly normally distributed set of efficient players. And of course, Kevin Durant's PER (the only observation above 30) is absurd relative to a league average of 15.

There doesn't look like there is a great relationship between team PER and win percentage, but we can get more precise. We can regress team PER on win percentage data found here and get:

So there's clearly some sort of relationship, but how strong is it?

lm(formula = team.data$WIN_PCT ~ team.data$PER)

Coefficients:
 EstimateStd. Error t value Pr(>|t|)
(Intercept)-0.73201 0.24308 -3.011 0.00546 **
team.data$PER 0.08203 0.016125.087 2.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1171 on 28 degrees of freedom
Multiple R-squared: 0.4803, Adjusted R-squared: 0.4618
F-statistic: 25.88 on 1 and 28 DF, p-value: 2.185e-05

The relationship is significant between PER and win percentage (ie. "something is definitely going on here"), but our model only explains about 48% of the data, so it's not great for predictions. How does team PER correlate with a team's offensive efficiency or defensive efficiency? We can graph that as well:

The relationship is significant and, here, PER explains about 79% of the variation in a team's offensive efficiency. For defensive efficiency: 

This time, the relationship is insignificant with a p-value of .806 and PER explaining only .2% of the variation in defensive efficiency. In other words, the model is extremely poor and PER says nothing about how a team performs defensively.

Intuitively, this makes sense. PER not only captures more offensive stats, but offensive stats are easier to come by. How tight Andre Iguodala plays an opposing player isn't recorded but his field goal percentage is. PER would naturally do a better job at explaining a team's offensive success, and this offensive success leads to a higher win percentage (so, PER is correlated with a higher win percentage). However, the reason it doesn't do a better job explaining win percentage (only 48% vs. 79% at explaining offensive efficiency) is because defensive efficiency is (as shown above) almost completely unrepresented. So, if defensive efficiency is important to a team's success, PER is clearly missing something in explaining win percentage. And, as you can guess, defensive efficiency is very important to a team's win percentage:

For its high win percentage, Indiana has a very mediocre offensive efficiency, but as apparent on the second graph, its defensive efficiency is anomalistically high (a lower number is better). Milwaukee is exactly where we would expect it: very poor offensive and defensive efficiency and therefore the lowest win percentage. Miami is the opposite of Indiana: very high offensive efficiency (along with Portland), but mediocre defensive efficiency. One final interesting fact: Portland and Miami have virtually the same offensive efficiency and win percentage, yet Portland is somehow significantly worse on the defensive end. I'm not sure how you can explain this observation besides the fact that these are imperfect metrics.

To summarize, team PER is definitely correlated with win percentage and a team's offensive efficiency, but does nothing to explain what happens on a team's defensive end.

Code and .csv's can be found at http://bit.ly/1fMYJQY.