What is SHAP?
“SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see papers for details and citations).” — SHAP
Or in other words, SHAP is a great way to explain the outputs of your model in a relatively easy way. If you’ve ever delivered your model to a client you’ll know how critical this step can be.
What is PyCaret?
“PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment.” — PyCaret
PyCaret is great for rapid model development for a lot of machine learning problems. In an earlier article I showed how to predict house prices using PyCaret and submit the results to Kaggle. It was a basic intro so the performance was not expected to be great, still it achieved a RMSE of around $40K.
So how can we use SHAP with PyCaret?
The first step is to train a model using PyCaret. Here I will retrain a House Price Estimator model using PyCaret. If you’re following along and haven’t installed PyCaret, please view my previous article to see how to. It is also important to install SHAP, you can do like this.
conda install -c conda-forge shap
or
pip install shap
You’ll also need a copy of the data from Kaggle.
In my previous article I used the raw data, here I will preprocess the data before loading.
First lets import our libraries.
import pandas as pd
from pycaret.regression import *
import shap
from sklearn.preprocessing import OneHotEncoder
from itertools import chain
from collections import defaultdictpd.set_option('display.max_columns', None)
shap.initjs()
Now lets read in our data.
train = pd.read_csv('data/train.csv')
This data has a few ordinal features. We need to specify them for preprocessing and passing to PyCaret.
ordinal_features = {
'KitchenQual': ['Fa', 'TA', 'Gd', 'Ex'],
'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
'ExterQual': ['Fa', 'TA', 'Gd', 'Ex'],
'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
'BsmtQual': ['None', 'Fa', 'TA', 'Gd', 'Ex'],
'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd'],
'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
'CentralAir': ['N', 'Y'],
'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ','GLQ'],
'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ','GLQ'],
'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],
'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],
'LandContour': ['Low', 'HLS', 'Bnk', 'Lvl'],
'Electrical': ['None', 'Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],
'Functional': ['Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],
'PavedDrive': ['N', 'P', 'Y'],
'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'],
'LandSlope': ['Sev', 'Mod', 'Gtl']
}Nan_ordinals = []
Now let’s build a function to preprocess the data.
The data has a few fields that can be merged to create new features that work better in the modelling process.
def new_features(df, test_df=False, test_df_path='data/test.csv'):
# Merge features
df['BsmtBath'] = df['BsmtFullBath'] + 0.5*df['BsmtHalfBath']
df['Bath'] = df['FullBath'] + 0.5*df['HalfBath']
df['Age'] = df['YrSold'] - df['YearBuilt']
df['RemodAge'] = df['YrSold'] - df['YearRemodAdd']
df['GarageAge'] = df['YrSold'] - df['GarageYrBlt']
for col in [
'BsmtFullBath',
'BsmtHalfBath',
'FullBath',
'HalfBath',
'YrSold',
'YearBuilt',
'YearRemodAdd',
'GarageYrBlt'
]:
try:
df.drop(col, axis = 1, inplace=True)
except KeyError as e:
pass
#Fill NaN
df.fillna({'LotFrontage': 0}, inplace=True)
A few features would also work better as one-hot-encoded features.
# One-hot
one_hot_features = ['SaleCondition', 'SaleType', 'GarageType', 'Foundation', 'MasVnrType', 'Exterior2nd', 'Exterior1st',
'RoofStyle', 'HouseStyle', 'BldgType', 'Condition1', 'Neighborhood', 'LotConfig', 'MSZoning']
enc = OneHotEncoder(handle_unknown='ignore')
one_hot_df = df[one_hot_features]
one_hot_df.fillna('None', inplace=True)
enc.fit(one_hot_df)
list_of_lists_names = [[one_hot_features[x] + '_' + s for s in enc.categories_[x].tolist()] for x in range(len(enc.categories_))]
one_hot_list_names = list(chain.from_iterable(list_of_lists_names))
one_hot_df = pd.DataFrame(enc.transform(one_hot_df).toarray(), columns=one_hot_list_names)
df.drop(one_hot_features, axis=1)
df = pd.concat([df, one_hot_df], axis=1)
We also need to fill in nulls on the ordinal features.
# NaN fill for ordinals
ordinal_nans = [i for i in list(ordinal_features) if df[i].isnull().sum() > 0]
df[ordinal_nans] = df[ordinal_nans].fillna('None')
We also need to drop some features.
# Drop features
drop_features = ['Alley', 'Street', 'Utilities', 'Condition2', 'RoofMatl', 'Heating', 'PoolQC', 'MiscFeature']
df.drop(drop_features, axis=1, inplace=True)
This method is also developed for the test data so we can preprocess for submission.
if not test_df:
return df, list_of_lists_names, one_hot_features
else:
test_df = pd.read_csv(test_df_path)
test_df['BsmtBath'] = test_df['BsmtFullBath'] + 0.5*test_df['BsmtHalfBath']
test_df['Bath'] = test_df['FullBath'] + 0.5*test_df['HalfBath']
test_df['Age'] = test_df['YrSold'] - test_df['YearBuilt']
test_df['RemodAge'] = test_df['YrSold'] - test_df['YearRemodAdd']
test_df['GarageAge'] = test_df['YrSold'] - test_df['GarageYrBlt']
for col in [
'BsmtFullBath',
'BsmtHalfBath',
'FullBath',
'HalfBath',
'YrSold',
'YearBuilt',
'YearRemodAdd',
'GarageYrBlt'
]:
try:
df.drop(col, axis = 1, inplace=True)
except KeyError as e:
pass test_df.fillna({'LotFrontage': 0}, inplace=True)
one_hot_test_df = test_df[one_hot_features]
one_hot_test_df.fillna('None', inplace=True)
one_hot_test_df = pd.DataFrame(enc.transform(one_hot_test_df).toarray(), columns=one_hot_list_names)
test_df.drop(one_hot_features, axis=1)
test_df = pd.concat([test_df, one_hot_test_df], axis=1) test_df[ordinal_nans] = test_df[ordinal_nans].fillna('None') test_df.drop(drop_features, axis=1, inplace=True)
return test_df
Now let’s use our function to to get our preprocessed training data.
train_fin, group_features_list, group_names = new_features(train)
Awesome, now we are finally ready to start using PyCaret. First step is to setup the model training process.
hseprc_reg = setup(
data=train_fin,
target='SalePrice',
train_size=0.8,
fold=5,
categorical_features=['MSSubClass'],
numeric_features=['GarageCars', 'Fireplaces'],
ordinal_features=ordinal_features,
n_jobs=None
)
Now let’s train a bunch’o’models and find the best one.
top5 = compare_models(n_select=5,sort='RMSE')
tuned_top5 = [tune_model(i, optimize='RMSE') for i in top5]
ensem_top5 = [ensemble_model(i, n_estimators = 10, optimize='RMSE') for i in tuned_top5]
blend = blend_models(tuned_top5, optimize='RMSE')
blend_ensem = blend_models(ensem_top5, optimize='RMSE')
model = automl(optimize='RMSE')
After a while (shouldn’t take too long, it only took 5 min on my 5 year old basic HP) you should have your results. We can find the model performance against the holdout set using
predict_model(model)
Here I got a RMSE of around $24K using a CatBoost Regressor, not bad. Let’s finalize and save our model so we can use it later.
final_model = finalize_model(model)
save_model(final_model, 'house_price_model')
Now we can move onto using SHAP
So as I explained earlier, SHAP is great for explaining model outputs. Let’s dive into getting it working with our model from earlier. First let’s load in our saved model.
saved_model = load_model('house_price_model')
So you might try running the model straight through SHAP.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(train_fin)
This will likely fail, as mine did with this error.
CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=2]="RL": Cannot convert 'b'RL'' to float
What is going on here? Why is it failing?
Well the answer is the fact that “model” is not just an instance of the CatBoost model but actually a whole sklearn pipeline created by PyCaret. We can check this by looking at the type of the saved model.
type(saved_model)
> sklearn.pipeline.Pipeline
The data being passed to SHAP is not the same the data being passed to the model. It has to first be transformed. Fortunately the pipeline is able to be used as another preprocessing stage, preparing the data for the model at the last stage of the pipeline.
train_pipe = saved_model[:-1].transform(train_fin)
Great, now the data is ready to be used in SHAP with the model as the explainer.
explainer = shap.TreeExplainer(saved_model.named_steps["trained_model"])
shap_values = explainer.shap_values(train_pipe)
Let’s explore the SHAP values for the first house in the training data.
house_idx = 0
shap.force_plot(explainer.expected_value, shap_values[house_idx,:], train_pipe.iloc[house_idx,:])
Look at that, amazing. we can see the shap values and how the features are influencing the regression outputs. Another common view for SHAP is viewing all of the explanations at the same time.
shap.force_plot(explainer.expected_value, shap_values, train_pipe)
I hope this tutorial proved useful for you, I know that learning about how to use SHAP with PyCaret was hugely useful for me and added a huge amount of value to my models when explaining outputs to clients. I’d recommend, if you are new to SHAP to check out their website for tutorials and info that should make all your models explainable.
If you’d like to get in touch with me to explain any projects or just have a chat about data science, you can contact me on LinkedIn or by email.
UPDATE: You can use SHAP to interpret your trained model results using PyCaret’s “interpret_model” module. This works great when training but fails on loaded models for the same reason SHAP does.