Kaggle’s Titanic - Machine Learning from Disaster is a classic introductory problem for getting familiar with the fundamentals of machine learning. Here’s a quick run-through on how I tuned some features to improve a model.
Alexis provides a brief introduction for making a submission to kaggle with some sample code for this challenge. She uses a random forest classifier for the model in her example.
Results
The submission of this model resulted in a score of 0.77511
Contribution
We can start by looking for entries in the dataset to dropout or modify to improve the performance of the model. The passenger ID contains a unique and non-null value which means that there will be no duplicates to drop. There are missing values in other columns that we can explore.
Output
Since the number of missing values for Fare and Embarked column is negligible, it is best to simply drop these entries from the dataset. However, there are a considerable amount of missing values for the Age and Cabin columns. A naive approach for filling in the missing values for the columns would be to fill them in with the mean, median, or mode of the column. The better approach is to look at the relationships between Age and the other columns, then determine how to replace the missing values.
We can apply the same principle for the Cabin column. The format of the cabin data will need to be changed. The cabin data is given as <Cabin Type><Room Number>. The room number can be discarded but since the cabin type likely has some influence on rather a passenger survives, it should be included. To check this, we’ll look at the relationship between the cabin type and passenger survival.
Output
Pclass appears to have a high influence on Age and Cabin column. This information can be applied to extract finer approximations to replace the missing entries as opposed to using a “one fits all” approximation.
Output
Looking at the % survived chart and how the Cabin relates to Pclass, we can reasonably assume that cabin 2, 4, 5, are roughly the same. We’ll bin these cabins and the remaining cabins will go in a separate bin. A PClass of 1 will belong to the first bin and anything else belongs to the second bin.
Output
Now that there are no more missing values, we can add the Age and Cabin columns to the features for the original model.
Without tuning any of the hyperparameters for the given model, I was able to slightly improve the model by including some of the features that were initially incompatible with the Random Forest Classifier. Surprisingly, adding just Fare and Embarked resulted in a lower score. Additionally, removing Cabin and including the rest of the features resulted in the best score. Doing further data analysis and engineering on the features may result in meaningful changes to the scores, but changing the hyperparameters for the RFC or experimenting with other types of models is likely the better approach to improving the scores at this point.