Titanic - Machine Learning from Disaster

Introduction

Kaggle’s Titanic - Machine Learning from Disaster is a classic introductory problem for getting familiar with the fundamentals of machine learning. Here’s a quick run-through on how I tuned some features to improve a model.

Where to start?

Alexis’s notebook

Alexis provides a brief introduction for making a submission to kaggle with some sample code for this challenge. She uses a random forest classifier for the model in her example.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import os

PATH = "../input/titanic/" # file path to the datasets

for dirname, _, filenames in os.walk('PATH'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load Datasets
train_data = pd.read_csv(PATH + "train.csv")
test_data = pd.read_csv(PATH + "test.csv")

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame(
    {'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)

Results

The submission of this model resulted in a score of 0.77511

Contribution

We can start by looking for entries in the dataset to dropout or modify to improve the performance of the model. The passenger ID contains a unique and non-null value which means that there will be no duplicates to drop. There are missing values in other columns that we can explore.

data_list = [train_data.drop("Survived", axis=1), test_data]
data_list = [train_data, test_data]
for i, dl in enumerate(data_list):
        data_list[i].Sex = dl.Sex.apply(lambda sex: 0 if sex == "male" else 1)
all_data = pd.concat(data_list)

missing_vals = [
    all_data[col].isnull().sum() for col in all_data.columns.to_list()]
labels = all_data.columns.to_list()
ser = pd.Series(data=missing_vals, index=labels, name="by amount")
ser_missing = ser[ser > 0].drop("Survived", axis=0)

percentages = ser_missing.apply(
    lambda x: "%.2f " % (x * 100 / all_data.shape[0]))
percentages.name = "by percent"
print(f"Total number of rows: {all_data.shape[0]}\n\n{ser_missing} \
        \n\n{percentages}")
plt.show()

Output

Total number of rows: 1309

Age          263
Fare           1
Cabin       1014
Embarked       2
Name: by amount, dtype: int64         

Age         20.09 
Fare         0.08 
Cabin       77.46 
Embarked     0.15 
Name: by percent, dtype: object

Since the number of missing values for Fare and Embarked column is negligible, it is best to simply drop these entries from the dataset. However, there are a considerable amount of missing values for the Age and Cabin columns. A naive approach for filling in the missing values for the columns would be to fill them in with the mean, median, or mode of the column. The better approach is to look at the relationships between Age and the other columns, then determine how to replace the missing values.

We can apply the same principle for the Cabin column. The format of the cabin data will need to be changed. The cabin data is given as <Cabin Type><Room Number>. The room number can be discarded but since the cabin type likely has some influence on rather a passenger survives, it should be included. To check this, we’ll look at the relationship between the cabin type and passenger survival.

pd.options.mode.chained_assignment = None
### Note: You would drop the Fare and Embarked null values
    # dl = dl[dl["Fare"].notna() & dl["Embarked"].notna()]
    # but these entries are required for the competition

    # I'll fill the `Fare` Column in with the median
    # and bin the values for the `Fare` Column.
for i, dl in enumerate(data_list):
    dl["Fare"] = dl["Fare"].fillna(dl["Fare"].median())
    dl["Fare"] = pd.qcut(dl["Fare"], q=4, labels=['A','B','C','D'])

    # Replacing `Embarked` na values with "S"
    dl["Embarked"] = dl["Embarked"].fillna("S")
    dl["Cabin"] = dl["Cabin"].apply(lambda x: x[0] if pd.notna(x) else x)
    dl["Cabin"] = dl["Cabin"].astype('category').cat.codes
    data_list[i] = dl
all_data = pd.concat(data_list)

corr = all_data.corr()[["Age", "Cabin"]].drop("PassengerId", axis=0)
corr

Output

               Age     Cabin
Survived -0.077221  0.287944
Pclass   -0.408106 -0.563667
Sex      -0.063645  0.133479
Age       1.000000  0.205097
SibSp    -0.243699 -0.009317
Parch    -0.150917  0.034465
Cabin     0.205097  1.000000

Pclass appears to have a high influence on Age and Cabin column. This information can be applied to extract finer approximations to replace the missing entries as opposed to using a “one fits all” approximation.

train_data_copy = train_data.copy()
train_data_copy["Cabin"] = train_data_copy["Cabin"].astype(
    'category').cat.codes
group_survive = train_data_copy.groupby("Cabin")["Survived"].sum()
goup_count = train_data_copy.groupby("Cabin")["Survived"].count()

percentages = []
for (u, v) in zip(group_survive, goup_count):
    percentages.append(u / v * 100)

fig = plt.figure(figsize =(10, 7))
plt.bar(group_survive.index, percentages)
plt.xlabel("Cabin type")
plt.ylabel("% Survived")
plt.axis([-1, 8, 0, 100])
plt.show()

age_group = all_data.groupby(["Pclass"])["Age"].mean().astype(int)
cabin_group = all_data.groupby(["Cabin"])["Pclass"].agg(pd.Series.mode)

for i in all_data[all_data["Age"].isna()].index:
    Pclass = all_data.iloc[i].Pclass
    all_data.loc[i, "Age"] = round(age_group[Pclass])
    
cabin_group

Output

Cabin
-1    3
 0    1
 1    1
 2    1
 3    1
 4    1
 5    2
 6    3
 7    1
Name: Pclass, dtype: int64

Looking at the % survived chart and how the Cabin relates to Pclass, we can reasonably assume that cabin 2, 4, 5, are roughly the same. We’ll bin these cabins and the remaining cabins will go in a separate bin. A PClass of 1 will belong to the first bin and anything else belongs to the second bin.

all_data['Cabin'] = all_data['Cabin'].replace([2,4,5], 0)
all_data['Cabin'] = all_data['Cabin'].replace([0,1,3,6,7], 1)
for i in all_data[all_data["Cabin"] == -1].index:
    Pclass = all_data.iloc[i].Pclass
    all_data.loc[i, "Cabin"] = 0 if Pclass == 1 else 1

missing_vals = [
    all_data[col].isnull().sum() for col in all_data.columns.to_list()]
labels = all_data.columns.to_list()
ser = pd.Series(
    data=missing_vals, index=labels, name="by amount").drop(
        "Survived", axis=0)
ser

Output

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
Name: by amount, dtype: int64

Now that there are no more missing values, we can add the Age and Cabin columns to the features for the original model.

def train(features):
    train_data = all_data[all_data["Survived"].notna()]
    test_data = all_data[all_data["Survived"].isna()]

    y = train_data["Survived"]

    X = pd.get_dummies(train_data[features])
    X_test = pd.get_dummies(test_data[features])


    model = RandomForestClassifier(
        n_estimators=100, max_depth=5, random_state=1)
    model.fit(X, y)
    predictions = model.predict(X_test)

    output = pd.DataFrame(
        {
            'PassengerId': test_data.PassengerId,
            'Survived': predictions.astype(int)
        }
    )
    output.to_csv('new_submission.csv', index=False)

Scores

Original Score: 0.77511

train(["Pclass", "Sex", "SibSp", "Parch", "Fare" "Embarked"])

Score: 0.76555

train(["Pclass", "Sex", "SibSp", "Parch", "Cabin", "Age"])

Score 0.78468

train(["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare" "Embarked"])

Score: 0.78708

Conclusion

Without tuning any of the hyperparameters for the given model, I was able to slightly improve the model by including some of the features that were initially incompatible with the Random Forest Classifier. Surprisingly, adding just Fare and Embarked resulted in a lower score. Additionally, removing Cabin and including the rest of the features resulted in the best score. Doing further data analysis and engineering on the features may result in meaningful changes to the scores, but changing the hyperparameters for the RFC or experimenting with other types of models is likely the better approach to improving the scores at this point.