A simple worked example of Shap#

This notebook shows a very simple example of Shap. We examine scores in a pub quiz. Those scores depend on the players present (Tim, Mark, and Carrie). The pub quiz team can have any number and combination of players - including none of the team turning up!

Data has been generated according to a known algorithm:

  1. Add 3 marks if Tim is present

  2. Add 6 marks if Mark is present

  3. Add 9 marks if Carrie is present

  4. Add 0-20% (as an integer)

We then fit an XGBoost regressor model to predict the score given any number and combination of players, and then fit a Shap model to explore the XGBoost model.

Note: Shap values are usually reported as log odds shifts in model predictions. For a description of the relationships between probability, odds, and Shap values (log odds shifts) see here.

# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shap

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor

Load data#

scores = pd.read_csv('shap_example.csv')

Show first eight scores (a single set of all combinations).

scores.head(8)
Tim Mark Carrie Score
0 0 0 0 0
1 1 0 0 4
2 0 1 0 6
3 0 0 1 11
4 1 1 0 11
5 1 0 1 14
6 0 1 1 15
7 1 1 1 19

Calculate average of all games#

Shap values show change from the global average of all scores.

global_average = scores['Score'].mean()
print(f'Global average: {global_average:0.1f}')
Global average: 9.8

Show averages by whether player is present or not#

Show averages by whether player is present or not, and show difference from average score.

players = ['Tim', 'Mark', 'Carrie']
for player in players:
    print(f'Average scores wrt {player}')
    average_scores = scores.groupby(player).mean()['Score']
    print(average_scores.round(1))
    print('\nDifference from average:')
    difference = average_scores - global_average
    print(difference.round(1))
    print('\n')
Average scores wrt Tim
Tim
0     8.1
1    11.6
Name: Score, dtype: float64

Difference from average:
Tim
0   -1.8
1    1.8
Name: Score, dtype: float64


Average scores wrt Mark
Mark
0     6.8
1    12.8
Name: Score, dtype: float64

Difference from average:
Mark
0   -3.0
1    3.0
Name: Score, dtype: float64


Average scores wrt Carrie
Carrie
0     5.2
1    14.4
Name: Score, dtype: float64

Difference from average:
Carrie
0   -4.6
1    4.6
Name: Score, dtype: float64

Split into X and y and fit XGBoost regressor model#

X = scores.drop('Score', axis=1)
y = scores['Score']
# Define model
model = XGBRegressor()

# Fit model
model.fit(X, y)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=36,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

Get predictions and plot observed vs. predicted#

(Predicted includes a random component to score, so no model can be exact).

y_pred = model.predict(X)
y_array = np.array(y).reshape(-1,1)
y_pred_array = np.array(y_pred).reshape(-1,1)
slr = LinearRegression()
slr.fit(y_array, y_pred_array)
y_pred_best_fit = slr.predict(y_array)
r_square = metrics.r2_score(y_array, y_pred_best_fit)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot()
ax.scatter(y, y_pred)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')

ax.plot (y, slr.predict(y_array), color = 'red')
text = f'R squared: {r_square:.3f}'
ax.text(16, 12, text, 
         bbox=dict(facecolor='white', edgecolor='black'))
ax.grid()
plt.show()
../_images/eb9488993c32783443d6dba4f82a064129940a02f2d94f1dad40df03121c5be5.png

Train Shap model#

# Train explainer on Training set
explainer = shap.TreeExplainer(model, X)

# Get Shapley values along with base and features
shap_values_extended = explainer(X)
shap_values = shap_values_extended.values

Show beeswarm of Shap#

The Beeswarm plot shows the Shap values for each instance predicted. Each player has a Shap value for their presence or absence in a team, which shows the effect of their presence/absence compared with the global average of all scores.

features = list(X)

shap.summary_plot(shap_values=shap_values, 
                  features=X,
                  feature_names=features,
                  show=False)
plt.show()
../_images/07962270de23cbcd5dcf24cff56e80259d4d56bdf3624d666be0512951d7e995.png

Calculate Shap for each player when present or not#

Here we calculate the average Shap values for the absence and presence of a player in the team.

shap_summary_by_player = dict()
for player in list(X):
    player_shap_values = shap_values_extended[:, player]
    df = pd.DataFrame()
    df['player_present'] = player_shap_values.data
    df['shap'] = player_shap_values.values
    shap_summary = df.groupby('player_present').mean()
    shap_summary_by_player[player] = shap_summary  
for player in list(X):
    print (f'Shap for {player}:')
    print(shap_summary_by_player[player].round(1))
    print()
Shap for Tim:
                shap
player_present      
0               -1.7
1                1.7

Shap for Mark:
                shap
player_present      
0               -3.0
1                3.0

Shap for Carrie:
                shap
player_present      
0               -4.6
1                4.6

Show waterfall plots for lowest and highest scores#

Waterfall plots show the influence of features on the predicted outcome starting from a baseline model prediction.

Here we show the lowest and highest score.

# Get the location of an example each where probability of giving thrombolysis
# is <0.1 or >0.9

location_low_score = np.where(y_pred == np.min(y_pred))[0][0]
location_high_score = np.where(y_pred == np.max(y_pred))[0][0]
fig = shap.plots.waterfall(shap_values_extended[location_low_score])
../_images/62b33c95777dc2f0df52b280364e94954dfc4a833acaead1860377c0193e8d1f.png
fig = shap.plots.waterfall(shap_values_extended[location_high_score])
../_images/7ed0e97724815d0b80a022e1ae9ae459f60f1977e275b40642c730259ec4fe90.png

Pick a random example.

random_location = np.random.randint(0, len(scores))
fig = shap.plots.waterfall(shap_values_extended[random_location])
../_images/4d3ff576a82df7137c10e60b13af76b2978ff46b7aceb9770d25ace447a0e2d1.png