While organizing a data science workshop this summer, I realized that I hadn't ever written a blog post about the data science project I worked on last year. So, this post will be a summary of what I learned from working on the project.
The problem I worked on was Kaggle's Fire Peril Loss Cost. Given over 300 features, including weather and crime, for over one million insurance policies, we wanted to predict how much the insurance company would lose on a policy due to a fire. This is tricky as fires are rare events, and thus almost all policies have no loss.
We first implemented one-hot encoding to turn the categorical variables into boolean arrays indicating whether a certain category applied to a policy or not. This is better than assigning a number in succession to each category (e.g. Category 1=0, Category 2=1, Category 3=2,...) as doing this will assign a hierarchy/metric on the data, which could create spurious relations. One this encoding was done, all of our data was in a numerical form that was amenable for use in machine learning algorithms.
With so many features on so many policies, the entire dataset would not fit in memory on my laptop. Also, we found that not all 300 features were good predictors of the target value. Thus, we spent time selecting the most important features to make predictions.
There were a manageable amount of categorical variables, so we kept all of these, though we did try removing some of them to see if there was any performance benefit from doing so. For each continuous variable, we tried a model where the prediction was simply the value of the variable. Since the evaluation metric (a weighted gini index) only depended on the ordering of the predictions and not the magnitudes, this analysis method was amenable for all of the continuous variables. Notably, we found that one of the variables (var13) was already a reasonably good predictor of the target. We kept the 30 continuous features that scored best under this metric. We chose this selection method over other more common feature selection methods (such as PCA) to avoid some of the stability issues associated with them, but it may have been interesting, given more time, to see how various feature selection methods fared with one another.
Since only ~0.3% of policies had any loss, we considered using a classifier to first identify the policies with loss. Ideally, the rate of policies with loss identified by this classifier as not having loss would be sufficiently low. Then a regressor could be used on the policies identified by the classifier as having loss, and then the training set would be less singular. We tried a few classifiers, but did not have much success with this approach.
We then tried to use regressors directly. We tried many of the machine learning regressors available in scikit-learn. We found good results from ridge (Tikhonov) regression and gradient boosted decision trees. In the end, we ended up combining the predictions of the two methods, which will be discussed a bit later.
Ridge regression is similar to standard linear regression, but instead of just minimizing the 2-norm of the vector of residuals, there is a penalty term proportional to the two-norm of the vector of coefficients multiplying the features. This penalty for feature coefficients ensures that no feature coefficients become too large. Large feature coefficients could be a sign of a singular prediction matrix and thus could fluctuate wildly. We found through testing that the optimal constant in front of the 2-norm of the vector of feature coefficients was quite small, so the ridge regression was acting quite similarly to linear regression. Note that with ridge regression it is important that all the features are normalized, as not doing this affects the size of the 2-norm of the vector of feature coefficients.
A decision tree is a set of rules (decisions) used to group policies into different classes. These rules are simple ones such as "is var13>0.5?" These rules are chosen at each step so that they best split the set of items (the subsets should mostly all be of the same values). There are different notions for what best means here, but using the gini impurity or information gain (entropy) are common choices. With enough rules, given a grouping, each policy in the group will all have similar target values. A new policy with an unknown target can then be compared against these rules and the target can be predicted to take on the value of the target in the group of policies it ends up with. Note that the depth of the tree (the number of rules) needs to be limited so that the method does not overfit. One could come up with enough rules such that each policy is its own group, but then the method loses predictive power for a new policy with an unknown target value.
On its own, the decision tree is not great at regression. The power of the method comes from the "gradient boosting" part of the name. After a decision tree is created, there will invariably be some policies that are misclassified. In analogy to gradient descent, the decision tree is then then trained to fit the residuals as the new target variable (or more generally the negative gradient of the loss function). This corrects for errors made at each iteration, and after many iterations, makes for quite a robust regressor and classifier.
We got good performance from each of these methods, and the two methods arrive at the prediction in very different ways. Thus, we combine the results from these two machine learning methods to arrive at our final prediction. We considered a standard mean as well as a geometric mean for the final prediction. We found that the geometric combination was more useful. This seems reasonable in predicting rare events as then both methods have to agree that the prediction value is large to net a large prediction, whereas only one method has to have a small prediction value to net a small prediction.
We probably could have dealt with missing entries better. It turned out that many of the features were strongly correlated with other features (some even perfectly) so we could have used this information to try to fill in the missing features. Instead, we filled all missing entries with a value of 0. In general, it's probably best to treat missing features as a systematic error and the effect could be quantified through cross-validating by considering various scenarios of filling in missing entries.
It also turned out that some of the features that were labeled as continuous were not actually continuous and were discrete (there were only a few values that the continuous variable took). There may have been some performance benefit from implementing one-hot encoding on these as well.
For the ridge regression, we could have applied standard model selection methods such as AIC and BIC to choose the key features. For the gradient-boosted decision trees, using these methods is a bit trickier as the complexity of the fit is not easy to determine.
One lesson I learned was the importance of cross-validation. -fold cross-validation randomly splits up the training set into subsets. Then, each subset is used as a test set with the complement used as the training set. This gives an idea of how well the model is expected to perform and also how much the model may be expected to overfit. The cross-validation estimate of the error will be an overestimate of the true prediction error, since only a subset of the data is used for prediction, whereas the whole dataset would be used for a true prediction . Ideally, one is in a regime where this difference is not crucial. By adjusting the value of a good regime where this is true can be found.
While we did split up our data set into the halves and test by training on one half and predicting on the other, we could have been more careful about the process. In particular, the feature selection process could have been carefully verified. In addition, we trusted our position on the Kaggle leaderboards more than our cross-validation scores, which led our final predictions to overfit to the leaderboard more than we would have liked.
1. Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning.