# A simple worked example of Shap#

This notebook shows a very simple example of Shap. We examine scores in a pub quiz. Those scores depend on the players present (Tim, Mark, and Carrie). The pub quiz team can have any number and combination of players - including none of the team turning up!

Data has been generated according to a known algorithm:

Add 3 marks if Tim is present

Add 6 marks if Mark is present

Add 9 marks if Carrie is present

Add 0-20% (as an integer)

We then fit an XGBoost regressor model to predict the score given any number and combination of players, and then fit a Shap model to explore the XGBoost model.

Note: Shap values are usually reported as *log odds shifts* in model predictions. For a description of the relationships between probability, odds, and Shap values (log odds shifts) see here.

```
# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")
```

```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shap
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
```

## Load data#

```
scores = pd.read_csv('shap_example.csv')
```

Show first eight scores (a single set of all combinations).

```
scores.head(8)
```

Tim | Mark | Carrie | Score | |
---|---|---|---|---|

0 | 0 | 0 | 0 | 0 |

1 | 1 | 0 | 0 | 4 |

2 | 0 | 1 | 0 | 6 |

3 | 0 | 0 | 1 | 11 |

4 | 1 | 1 | 0 | 11 |

5 | 1 | 0 | 1 | 14 |

6 | 0 | 1 | 1 | 15 |

7 | 1 | 1 | 1 | 19 |

## Calculate average of all games#

Shap values show change from the global average of all scores.

```
global_average = scores['Score'].mean()
print(f'Global average: {global_average:0.1f}')
```

```
Global average: 9.8
```

## Show averages by whether player is present or not#

Show averages by whether player is present or not, and show difference from average score.

```
players = ['Tim', 'Mark', 'Carrie']
for player in players:
print(f'Average scores wrt {player}')
average_scores = scores.groupby(player).mean()['Score']
print(average_scores.round(1))
print('\nDifference from average:')
difference = average_scores - global_average
print(difference.round(1))
print('\n')
```

```
Average scores wrt Tim
Tim
0 8.1
1 11.6
Name: Score, dtype: float64
Difference from average:
Tim
0 -1.8
1 1.8
Name: Score, dtype: float64
Average scores wrt Mark
Mark
0 6.8
1 12.8
Name: Score, dtype: float64
Difference from average:
Mark
0 -3.0
1 3.0
Name: Score, dtype: float64
Average scores wrt Carrie
Carrie
0 5.2
1 14.4
Name: Score, dtype: float64
Difference from average:
Carrie
0 -4.6
1 4.6
Name: Score, dtype: float64
```

## Split into X and y and fit XGBoost regressor model#

```
X = scores.drop('Score', axis=1)
y = scores['Score']
```

```
# Define model
model = XGBRegressor()
# Fit model
model.fit(X, y)
```

```
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=36,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
```

## Get predictions and plot observed vs. predicted#

(Predicted includes a random component to score, so no model can be exact).

```
y_pred = model.predict(X)
```

```
y_array = np.array(y).reshape(-1,1)
y_pred_array = np.array(y_pred).reshape(-1,1)
slr = LinearRegression()
slr.fit(y_array, y_pred_array)
y_pred_best_fit = slr.predict(y_array)
r_square = metrics.r2_score(y_array, y_pred_best_fit)
```

```
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot()
ax.scatter(y, y_pred)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')
ax.plot (y, slr.predict(y_array), color = 'red')
text = f'R squared: {r_square:.3f}'
ax.text(16, 12, text,
bbox=dict(facecolor='white', edgecolor='black'))
ax.grid()
plt.show()
```

## Train Shap model#

```
# Train explainer on Training set
explainer = shap.TreeExplainer(model, X)
# Get Shapley values along with base and features
shap_values_extended = explainer(X)
shap_values = shap_values_extended.values
```

## Show beeswarm of Shap#

The Beeswarm plot shows the Shap values for each instance predicted. Each player has a Shap value for their presence or absence in a team, which shows the effect of their presence/absence compared with the global average of all scores.

```
features = list(X)
shap.summary_plot(shap_values=shap_values,
features=X,
feature_names=features,
show=False)
plt.show()
```

## Calculate Shap for each player when present or not#

Here we calculate the average Shap values for the absence and presence of a player in the team.

```
shap_summary_by_player = dict()
for player in list(X):
player_shap_values = shap_values_extended[:, player]
df = pd.DataFrame()
df['player_present'] = player_shap_values.data
df['shap'] = player_shap_values.values
shap_summary = df.groupby('player_present').mean()
shap_summary_by_player[player] = shap_summary
```

```
for player in list(X):
print (f'Shap for {player}:')
print(shap_summary_by_player[player].round(1))
print()
```

```
Shap for Tim:
shap
player_present
0 -1.7
1 1.7
Shap for Mark:
shap
player_present
0 -3.0
1 3.0
Shap for Carrie:
shap
player_present
0 -4.6
1 4.6
```

## Show waterfall plots for lowest and highest scores#

Waterfall plots show the influence of features on the predicted outcome starting from a baseline model prediction.

Here we show the lowest and highest score.

```
# Get the location of an example each where probability of giving thrombolysis
# is <0.1 or >0.9
location_low_score = np.where(y_pred == np.min(y_pred))[0][0]
location_high_score = np.where(y_pred == np.max(y_pred))[0][0]
```

```
fig = shap.plots.waterfall(shap_values_extended[location_low_score])
```

```
fig = shap.plots.waterfall(shap_values_extended[location_high_score])
```

Pick a random example.

```
random_location = np.random.randint(0, len(scores))
fig = shap.plots.waterfall(shap_values_extended[random_location])
```