After validating, cleaning and crosswalking the Fatality Analysis Reporting System (FARS) dataset, I was curious if certain variables had a larger effect on driver fatalities than others. For example, how did vehicle size affect the odds of dying, or previous DWI convictions? Which has a stronger effect: drinking or drug use?
Since dying (or not dying) is a binary outcome, I used a binary logistic multiple regressions model. Many independent variables of interest were categorical (eg. restraint use, weather, sex, surface conditions) while others were continuous. As a result, it was important to convert the categorical values into dummy variables before throwing them into the regression. Fortunately, pandas makes this really simple:
For example, a "SEX" column that accepts "Male" or "Female" will be split into two columns, SEX_Male and SEX_Female, each with their respective 1s and 0s. However, in order to avoid the dummy variable trap, one dummy needs to be dropped for each category - it becomes the reference group to which all other categories will be compared.
Some additional housekeeping: I add a constant to the model and impute the mean for missing observations among the continuous variables (an oversimplifying, albeit simple, assumption, but probably better than dropping all those observations). Below are the results of the model:
The coefficients lack direct interpretation, however, because they're not odds, they're logodds. We can exponentiate each coefficient to calculate odds. Here's the table below:
This table yields more intuitive explanations. For example, holding other variables constant, "DEFORMED_Disabling Damage" has an extremely strong effect - your odds of dying are over 7 times higher if your vehicle is totaled than if it isn't (which is obvious). Here's the same table, visually:
Some other findings from the model include:
- "DR_DRINK_Drinking" is lower than "DRUG_DUMMY" - those using drugs are more likely to die than those who drink while driving.
- The "BODY_SIZE" variable follows the trend we would expect: from "Medium" up to "Very large", the odds of dying decrease the larger the vehicle you are in (when compared to small vehicles like sedans). For example, the odds of dying in a very large vehicle are 19% as high as those in a small vehicle.
- "SEX" does not seem to have a large effect on the odds of dying, nor do previous suspensions ("PREV_SUS").
- "AGE" is 1.027, meaning that for each additional decade of age, the odds of dying 1.3x higher (1.027^10).
- "WEATHER_Snow" appears to have significantly lower odds of dying than in clear weather, probably because most accidents happen during normal, light conditions.
- Negligent driving ("DR_SF"), drunk driving ("DR_DRINK_Drinking"), using drugs while driving ("DRUG_DUMMY") and speeding ("SPEED_DUMMY") all increase the odds of dying in a motor vehicle (as expected), with negligent driving the least and driving while using drugs the most.
- For each additional occupant in the vehicle, the odds of the driver dying almost halve. This is likely because the more people there are as occupants, the lower the odds that the fatality was the driver (because now, it could potentially be the passengers). This is a data artifact and shouldn't be interpreted as having more people in the vehicle decrease the driver's odds of dying (ie. causal).
- For restraint use ("REST_USE"), the odds of dying when wearing a seatbelt were 16% as high as those without.
If the 2014 FARS dataset, to be released this coming December, follows the same pattern as the 2013 data, I expect these variables to have some predictive value on which drivers survive and which die. We can use the sci-kit learn package to implement a really simple logistic regression model:
X, y = X.values, y.values X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=35) LR = LogisticRegression() LR = LR.fit(X_train, y_train) LR.score(X_test, y_test) > 0.80207515423443632
The model should correctly predict who died and who survived about 80% of the time. To prevent overfitting to that specific slice of training data, we can use cross validation to fit a model to 5 different slices of the data:
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=5) print scores.mean() > 0.800823907131
About the same, so overfitting to the training data doesn't seem to be an issue here.
All in all, this analysis confirms some well-known priors: driving a larger vehicle, being younger, using a seatbelt and driving safely are all significant factors in surviving a fatal car accident.
Code and supporting files can be found on GitHub.