Predicting differences between local and benchmark decisions#

This experiment focuses on hospitals who would give thrombolysis to at least 50% more patients if the majority vote of 30 benchmark hospitals were applied. We build a model to predict those patients, out of patients who will be thrombolysed by the majority of the benchmark hospitals, who will thrombolysed at a local unit. The XGBoost model used to make predictions uses 8 features:

  • S2BrainImagingTime_min

  • S2StrokeType_Infarction

  • S2NihssArrival

  • S1OnsetTimeType_Precise

  • S2RankinBeforeStroke

  • StrokeTeam

  • AFAnticoagulent_Yes

  • S1OnsetToArrival_min

Aims:

  • Of all those patients thrombolysed by benchmark decision, build an XGBoost model to predict which patients, would be thrombolysed at a local unit.

  • Investigate model predictions using Shap

# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import auc
from sklearn.metrics import roc_curve

from xgboost import XGBClassifier

import shap

Function to calculate accuracy measures#

def calculate_accuracy(observed, predicted):
    
    """
    Calculates a range of accuracy scores from observed and predicted classes.
    
    Takes two list or NumPy arrays (observed class values, and predicted class 
    values), and returns a dictionary of results.
    
     1) observed positive rate: proportion of observed cases that are +ve
     2) Predicted positive rate: proportion of predicted cases that are +ve
     3) observed negative rate: proportion of observed cases that are -ve
     4) Predicted negative rate: proportion of predicted cases that are -ve  
     5) accuracy: proportion of predicted results that are correct    
     6) precision: proportion of predicted +ve that are correct
     7) recall: proportion of true +ve correctly identified
     8) f1: harmonic mean of precision and recall
     9) sensitivity: Same as recall
    10) specificity: Proportion of true -ve identified:        
    11) positive likelihood: increased probability of true +ve if test +ve
    12) negative likelihood: reduced probability of true +ve if test -ve
    13) false positive rate: proportion of false +ves in true -ve patients
    14) false negative rate: proportion of false -ves in true +ve patients
    15) true positive rate: Same as recall
    16) true negative rate: Same as specificity
    17) positive predictive value: chance of true +ve if test +ve
    18) negative predictive value: chance of true -ve if test -ve
    
    """
    
    # Converts list to NumPy arrays
    if type(observed) == list:
        observed = np.array(observed)
    if type(predicted) == list:
        predicted = np.array(predicted)
    
    # Calculate accuracy scores
    observed_positives = observed == 1
    observed_negatives = observed == 0
    predicted_positives = predicted == 1
    predicted_negatives = predicted == 0
    
    true_positives = (predicted_positives == 1) & (observed_positives == 1)
    
    false_positives = (predicted_positives == 1) & (observed_positives == 0)
    
    true_negatives = (predicted_negatives == 1) & (observed_negatives == 1)
    
    false_negatives = (predicted_negatives == 1) & (observed_negatives == 0)
    
    accuracy = np.mean(predicted == observed)
    
    precision = (np.sum(true_positives) /
                 (np.sum(true_positives) + np.sum(false_positives)))
        
    recall = np.sum(true_positives) / np.sum(observed_positives)
    
    sensitivity = recall
    
    f1 = 2 * ((precision * recall) / (precision + recall))
    
    specificity = np.sum(true_negatives) / np.sum(observed_negatives)
    
    positive_likelihood = sensitivity / (1 - specificity)
    
    negative_likelihood = (1 - sensitivity) / specificity
    
    false_positive_rate = 1 - specificity
    
    false_negative_rate = 1 - sensitivity
    
    true_positive_rate = sensitivity
    
    true_negative_rate = specificity
    
    positive_predictive_value = (np.sum(true_positives) / 
                            (np.sum(true_positives) + np.sum(false_positives)))
    
    negative_predictive_value = (np.sum(true_negatives) / 
                            (np.sum(true_negatives) + np.sum(false_negatives)))
    
    # Create dictionary for results, and add results
    results = dict()
    
    results['observed_positive_rate'] = np.mean(observed_positives)
    results['observed_negative_rate'] = np.mean(observed_negatives)
    results['predicted_positive_rate'] = np.mean(predicted_positives)
    results['predicted_negative_rate'] = np.mean(predicted_negatives)
    results['accuracy'] = accuracy
    results['precision'] = precision
    results['recall'] = recall
    results['f1'] = f1
    results['sensitivity'] = sensitivity
    results['specificity'] = specificity
    results['positive_likelihood'] = positive_likelihood
    results['negative_likelihood'] = negative_likelihood
    results['false_positive_rate'] = false_positive_rate
    results['false_negative_rate'] = false_negative_rate
    results['true_positive_rate'] = true_positive_rate
    results['true_negative_rate'] = true_negative_rate
    results['positive_predictive_value'] = positive_predictive_value
    results['negative_predictive_value'] = negative_predictive_value
    
    return results

Load data#

thrombolysis_decision_data = pd.read_csv(
    './predictions/benchmark_decisions_combined_xgb_key_features.csv')

Add label where benchmark = 1, but observed = 0

thrombolysis_decision_data['benchmark_yes_observed_no'] = (
    (thrombolysis_decision_data['majority_vote'] == 1) &
    (thrombolysis_decision_data['observed'] == 0))
thrombolysis_decision_data
unit observed predicted_thrombolysis predicted_proba majority_vote benchmark_yes_observed_no
0 TXHRP7672C 1 1.0 0.880155 1.0 False
1 SQGXB9559U 1 1.0 0.627783 1.0 False
2 LFPMM4706C 0 0.0 0.042199 0.0 False
3 MHMYL4920B 0 0.0 0.000084 0.0 False
4 EQZZZ5658G 1 1.0 0.916311 1.0 False
... ... ... ... ... ... ...
88787 OYASQ1316D 1 1.0 0.917107 1.0 False
88788 SMVTP6284P 0 0.0 0.023144 0.0 False
88789 RDVPJ0375X 0 0.0 0.089444 0.0 False
88790 FAJKD7118X 0 1.0 0.615767 1.0 True
88791 QWKRA8499D 0 0.0 0.160670 0.0 False

88792 rows × 6 columns

Combine with feature data for patients.

feature_data = pd.read_csv(
    './predictions/test_features_collated_key_features.csv')
feature_data['benchmark_yes_observed_no'] = \
    thrombolysis_decision_data['benchmark_yes_observed_no'] * 1
feature_data['majority_vote'] = \
    thrombolysis_decision_data['majority_vote'] * 1
feature_data['predicted_thrombolysis'] = \
    thrombolysis_decision_data['predicted_thrombolysis'] 
feature_data['observed'] = thrombolysis_decision_data['observed']
feature_data
S2BrainImagingTime_min S2StrokeType_Infarction S2NihssArrival S1OnsetTimeType_Precise S2RankinBeforeStroke StrokeTeam AFAnticoagulent_Yes S1OnsetToArrival_min S2Thrombolysis benchmark_yes_observed_no majority_vote predicted_thrombolysis observed
0 17.0 1 14.0 1 0 TXHRP7672C 0 186.0 1 0 1.0 1.0 1
1 25.0 1 6.0 1 0 SQGXB9559U 0 71.0 1 0 1.0 1.0 1
2 138.0 1 2.0 1 0 LFPMM4706C 0 67.0 0 0 0.0 0.0 0
3 21.0 0 11.0 1 0 MHMYL4920B 0 86.0 0 0 0.0 0.0 0
4 8.0 1 16.0 1 0 EQZZZ5658G 0 83.0 1 0 1.0 1.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
88787 3.0 1 11.0 1 1 OYASQ1316D 0 140.0 1 0 1.0 1.0 1
88788 69.0 1 3.0 0 4 SMVTP6284P 0 189.0 0 0 0.0 0.0 0
88789 63.0 1 3.0 1 1 RDVPJ0375X 0 154.0 0 0 0.0 0.0 0
88790 14.0 1 4.0 1 0 FAJKD7118X 0 77.0 0 1 1.0 1.0 0
88791 41.0 1 1.0 1 1 QWKRA8499D 0 77.0 0 0 0.0 0.0 0

88792 rows × 13 columns

Identify hospitals where benchmark decision would lead to at least 50% higher thrombolysis#

thrombolysis_counts = (thrombolysis_decision_data.groupby('unit').agg('sum').drop(
        'predicted_proba', axis=1))

thrombolysis_counts['benchmark_ratio'] = \
    thrombolysis_counts['majority_vote'] / thrombolysis_counts['observed']
mask = thrombolysis_counts['benchmark_ratio'] >= 1.5

low_thrombolysing_hospitals_df = thrombolysis_counts[mask]
low_thrombolysing_hospitals_df.sort_values(
    'benchmark_ratio', inplace=True, ascending=False)

low_thrombolysing_hospitals = list(low_thrombolysing_hospitals_df.index)


low_thrombolysing_hospitals_df.head()
observed predicted_thrombolysis majority_vote benchmark_yes_observed_no benchmark_ratio
unit
HZMLX7970T 34 28.0 80.0 49 2.352941
LFPMM4706C 21 7.0 46.0 29 2.190476
LZAYM7611L 40 23.0 86.0 55 2.150000
XPABC1435F 36 39.0 76.0 41 2.111111
LECHF1024T 114 110.0 230.0 128 2.017544

Build a model to distinguish between local and top30 thrombolysis#

# Restrict data to low thrombolysing hospitals and benchmark = yes and observed = no

mask = feature_data['StrokeTeam'].isin(low_thrombolysing_hospitals)
restricted_data = feature_data[mask]

# Get only patients thrombolysed in top 30
mask = restricted_data['majority_vote'] == 1
restricted_data = restricted_data[mask]

cols_to_drop = ['benchmark_yes_observed_no', 'predicted_thrombolysis',
               'majority_vote','observed']
restricted_data.drop(cols_to_drop, axis=1, inplace=True)
restricted_data
S2BrainImagingTime_min S2StrokeType_Infarction S2NihssArrival S1OnsetTimeType_Precise S2RankinBeforeStroke StrokeTeam AFAnticoagulent_Yes S1OnsetToArrival_min S2Thrombolysis
1 25.0 1 6.0 1 0 SQGXB9559U 0 71.0 1
28 38.0 1 13.0 1 0 DQKFO8183I 0 149.0 1
91 18.0 1 15.0 1 0 OUXUZ1084Q 0 122.0 0
159 2.0 1 16.0 1 2 IATJE0497S 0 78.0 1
161 8.0 1 4.0 1 0 DANAH4615Q 0 160.0 1
... ... ... ... ... ... ... ... ... ...
88667 66.0 1 16.0 1 0 LGNPK4211W 0 46.0 0
88698 5.0 1 16.0 1 2 ISIZF6614O 0 109.0 1
88740 10.0 1 5.0 1 0 NFBUF0424E 0 67.0 0
88758 9.0 1 3.0 1 0 ISIZF6614O 0 89.0 0
88783 19.0 1 14.0 1 0 SQGXB9559U 0 119.0 1

4788 rows × 9 columns

Set up X, Y and train/test split.

X = restricted_data.drop('S2Thrombolysis', axis=1)
y = restricted_data['S2Thrombolysis']
# Remove hospital ID
X.drop('StrokeTeam', axis=1, inplace=True)
skf = StratifiedKFold(n_splits = 5, random_state=42, shuffle=True)
skf.get_n_splits(X, y)

# Set up list to store models
model_kfold = []

# Set up lists for k-fold fits
observed_kfold = []
predicted_proba_kfold = []
predicted_kfold = []
X_train_kfold = []
X_test_kfold = []
y_train_kfold = []
y_test_kfold = []

# Set up list for feature importances
importances_kfold = []

# Loop through the k-fold splits
k_fold = 0
for train_index, test_index in skf.split(X, y):
    k_fold += 1
    
    # Get X and Y train/test
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    X_train_kfold.append(X_train)
    X_test_kfold.append(X_test)
    y_train_kfold.append(y_train)
    y_test_kfold.append(y_test)

    # Define model
    model = XGBClassifier(
        verbosity = 0, 
        scale_pos_weight=0.8, 
        random_state=42, 
        learning_rate=0.1)
    
    # Fit model
    model.fit(X_train, y_train)    
    model_kfold.append(model)
    
    # Get predicted probabilities
    y_probs = model.predict_proba(X_test)[:,1]
    predicted_proba_kfold.append(y_probs)
    observed_kfold.append(y_test)
    
    # Get feature importances
    importance = model.feature_importances_
    importances_kfold.append(importance)
    
    # Get class
    true_rate = np.mean(y_test)
    y_class = y_probs >= 0.5
    y_class = np.array(y_class) * 1.0
    predicted_kfold.append(y_class)
    
    # Print accuracy
    accuracy = np.mean(y_class == y_test)
    print (
        f'Run {k_fold}, accuracy: {accuracy:0.3f}')
Run 1, accuracy: 0.658
Run 2, accuracy: 0.691
Run 3, accuracy: 0.696
Run 4, accuracy: 0.671
Run 5, accuracy: 0.662

Accuracy measures#

# Set up list for results
k_fold_results = []

# Loop through k fold predictions and get accuracy measures
for i in range(5):
    results = calculate_accuracy(observed_kfold[i], predicted_kfold[i])
    k_fold_results.append(results)
    
# Put results in DataFrame
accuracy_results = pd.DataFrame(k_fold_results).T
accuracy_results
0 1 2 3 4
observed_positive_rate 0.530271 0.530271 0.530271 0.530825 0.529781
observed_negative_rate 0.469729 0.469729 0.469729 0.469175 0.470219
predicted_positive_rate 0.492693 0.534447 0.518789 0.502612 0.493208
predicted_negative_rate 0.507307 0.465553 0.481211 0.497388 0.506792
accuracy 0.657620 0.691023 0.696242 0.670846 0.662487
precision 0.690678 0.707031 0.718310 0.700624 0.694915
recall 0.641732 0.712598 0.702756 0.663386 0.646943
f1 0.665306 0.709804 0.710448 0.681496 0.670072
sensitivity 0.641732 0.712598 0.702756 0.663386 0.646943
specificity 0.675556 0.666667 0.688889 0.679287 0.680000
positive_likelihood 1.977942 2.137795 2.258858 2.068474 2.021696
negative_likelihood 0.530331 0.431102 0.431483 0.495540 0.519202
false_positive_rate 0.324444 0.333333 0.311111 0.320713 0.320000
false_negative_rate 0.358268 0.287402 0.297244 0.336614 0.353057
true_positive_rate 0.641732 0.712598 0.702756 0.663386 0.646943
true_negative_rate 0.675556 0.666667 0.688889 0.679287 0.680000
positive_predictive_value 0.690678 0.707031 0.718310 0.700624 0.694915
negative_predictive_value 0.625514 0.672646 0.672451 0.640756 0.630928
accuracy_results.T.describe()
observed_positive_rate observed_negative_rate predicted_positive_rate predicted_negative_rate accuracy precision recall f1 sensitivity specificity positive_likelihood negative_likelihood false_positive_rate false_negative_rate true_positive_rate true_negative_rate positive_predictive_value negative_predictive_value
count 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.530284 0.469716 0.508350 0.491650 0.675644 0.702312 0.673483 0.687425 0.673483 0.678080 2.092953 0.481532 0.321920 0.326517 0.673483 0.678080 0.702312 0.648459
std 0.000370 0.000370 0.018009 0.018009 0.017189 0.010853 0.032409 0.021543 0.032409 0.008041 0.110045 0.047551 0.008041 0.032409 0.032409 0.008041 0.010853 0.022659
min 0.529781 0.469175 0.492693 0.465553 0.657620 0.690678 0.641732 0.665306 0.641732 0.666667 1.977942 0.431102 0.311111 0.287402 0.641732 0.666667 0.690678 0.625514
25% 0.530271 0.469729 0.493208 0.481211 0.662487 0.694915 0.646943 0.670072 0.646943 0.675556 2.021696 0.431483 0.320000 0.297244 0.646943 0.675556 0.694915 0.630928
50% 0.530271 0.469729 0.502612 0.497388 0.670846 0.700624 0.663386 0.681496 0.663386 0.679287 2.068474 0.495540 0.320713 0.336614 0.663386 0.679287 0.700624 0.640756
75% 0.530271 0.469729 0.518789 0.506792 0.691023 0.707031 0.702756 0.709804 0.702756 0.680000 2.137795 0.519202 0.324444 0.353057 0.702756 0.680000 0.707031 0.672451
max 0.530825 0.470219 0.534447 0.507307 0.696242 0.718310 0.712598 0.710448 0.712598 0.688889 2.258858 0.530331 0.333333 0.358268 0.712598 0.688889 0.718310 0.672646

Receiver Operator Characteristic and Sensitivity-Specificity Curves#

Receiver Operator Characteristic Curve:

# Set up lists for results
k_fold_fpr = [] # false positive rate
k_fold_tpr = [] # true positive rate
k_fold_thresholds = [] # threshold applied
k_fold_auc = [] # area under curve

# Loop through k fold predictions and get ROC results 
for i in range(5):
    fpr, tpr, thresholds = roc_curve(
        observed_kfold[i], predicted_proba_kfold[i])
    roc_auc = auc(fpr, tpr)
    k_fold_fpr.append(fpr)
    k_fold_tpr.append(tpr)
    k_fold_thresholds.append(thresholds)
    k_fold_auc.append(roc_auc)

# Show mean area under curve  
mean_auc = np.mean(k_fold_auc)
sd_auc = np.std(k_fold_auc)
print (f'\nMean AUC: {mean_auc:0.4f}')
print (f'SD AUC: {sd_auc:0.4f}')
Mean AUC: 0.7271
SD AUC: 0.0158

Sensitivity-specificity curve:

k_fold_sensitivity = []
k_fold_specificity = []

for i in range(5):
    # Get classificiation probabilities for k-fold replicate
    obs = observed_kfold[i]
    proba = predicted_proba_kfold[i]
    
    # Set up list for accuracy measures
    sensitivity = []
    specificity = []
    
    # Loop through increments in probability of survival
    thresholds = np.arange(0.0, 1.01, 0.01)
    for cutoff in thresholds: #  loop 0 --> 1 on steps of 0.1
        # Get classificiation using cutoff
        predicted_class = proba >= cutoff
        predicted_class = predicted_class * 1.0
        # Call accuracy measures function
        accuracy = calculate_accuracy(obs, predicted_class)
        # Add accuracy scores to lists
        sensitivity.append(accuracy['sensitivity'])
        specificity.append(accuracy['specificity'])
    
    # Add replicate to lists
    k_fold_sensitivity.append(sensitivity)
    k_fold_specificity.append(specificity)

Combined plot:

fig = plt.figure(figsize=(10,5))

# Plot ROC
ax1 = fig.add_subplot(121)
for i in range(5):
    ax1.plot(k_fold_fpr[i], k_fold_tpr[i], color='orange')
ax1.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('Receiver Operator Characteristic Curve')
text = f'Mean AUC: {mean_auc:.3f}'
ax1.text(0.64,0.07, text, 
         bbox=dict(facecolor='white', edgecolor='black'))
plt.grid(True)

# Plot sensitivity-specificity
ax2 = fig.add_subplot(122)
for i in range(5):
    ax2.plot(k_fold_sensitivity[i], k_fold_specificity[i])
ax2.set_xlabel('Sensitivity')
ax2.set_ylabel('Specificity')
ax2.set_title('Sensitivity-Specificity Curve')
plt.grid(True)


plt.tight_layout(pad=2)
plt.savefig('./output/decision_comparison_roc_sens_spec_key_features.jpg', dpi=300)

plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_31_0.png

Identify cross-over of sensitivity and specificity#

def get_intersect(a1, a2, b1, b2):
    """ 
    Returns the point of intersection of the lines passing through a2,a1 and b2,b1.
    a1: [x, y] a point on the first line
    a2: [x, y] another point on the first line
    b1: [x, y] a point on the second line
    b2: [x, y] another point on the second line
    """
    s = np.vstack([a1,a2,b1,b2])        # s for stacked
    h = np.hstack((s, np.ones((4, 1)))) # h for homogeneous
    l1 = np.cross(h[0], h[1])           # get first line
    l2 = np.cross(h[2], h[3])           # get second line
    x, y, z = np.cross(l1, l2)          # point of intersection
    if z == 0:                          # lines are parallel
        return (float('inf'), float('inf'))
    return (x/z, y/z)
intersections = []
for i in range(5):
    sens = np.array(k_fold_sensitivity[i])
    spec = np.array(k_fold_specificity[i])
    df = pd.DataFrame()
    df['sensitivity'] = sens
    df['specificity'] = spec
    df['spec greater sens'] = spec > sens

    # find last index for senitivity being greater than specificity 
    mask = df['spec greater sens'] == False
    last_id_sens_greater_spec = np.max(df[mask].index)
    locs = [last_id_sens_greater_spec, last_id_sens_greater_spec + 1]
    points = df.iloc[locs][['sensitivity', 'specificity']]

    # Get intersetction with line of x=y
    a1 = list(points.iloc[0].values)
    a2 = list(points.iloc[1].values)
    b1 = [0, 0]
    b2 = [1, 1]

    intersections.append(get_intersect(a1, a2, b1, b2)[0])

mean_intersection = np.mean(intersections)
sd_intersection = np.std(intersections)
print (f'\nMean intersection: {mean_intersection:0.4f}')
print (f'SD intersection: {sd_intersection:0.4f}')
Mean intersection: 0.6772
SD intersection: 0.0133

Shap values#

We will look into detailed Shap values for the first train/test split.

Get Shap values#

k_fold_shap_values_extended = []
k_fold_shap_values = []

for k in range(5):
    
    # Set up explainer using typical feature values from training set
    explainer = shap.TreeExplainer(model_kfold[k], X_train_kfold[k])

    # Get Shapley values along with base and features
    shap_values_extended = explainer(X_test_kfold[k])
    k_fold_shap_values_extended.append(shap_values_extended)
    # Shap values exist for each classification in a Tree; 1=give thrombolysis
    shap_values = shap_values_extended.values
    k_fold_shap_values.append(shap_values)       

    print (f'Completed {k+1} of 5')
Completed 1 of 5
Completed 2 of 5
Completed 3 of 5
Completed 4 of 5
Completed 5 of 5

Get average Shap values for each k-fold#

shap_values_mean_kfold = []
features = list(X_train_kfold[0])

for k in range(5):
    shap_values = k_fold_shap_values[k]
    # Get mean Shap values for each feature
    shap_values_mean = pd.DataFrame(index=features)
    shap_values_mean['mean_shap'] = np.mean(shap_values, axis=0)
    shap_values_mean['abs_mean_shap'] = np.abs(shap_values_mean)
    shap_values_mean['mean_abs_shap'] = np.mean(np.abs(shap_values), axis=0)
    shap_values_mean['rank'] = shap_values_mean['mean_abs_shap'].rank(
        ascending=False).values
    shap_values_mean.sort_index()
    shap_values_mean_kfold.append(shap_values_mean)

Examine consistency across top Shap values (mean |Shap|)#

‘Raw’ Shap values from XGBoost model are log odds ratios.

# Build df for k fold values
mean_abs_shap = pd.DataFrame()
for k in range(5):
    mean_abs_shap[f'{k}'] = shap_values_mean_kfold[k]['mean_abs_shap']
    
# Build df to show min, median, and max
mean_abs_shap_summary = pd.DataFrame()
mean_abs_shap_summary['min'] = mean_abs_shap.min(axis=1)
mean_abs_shap_summary['median'] = mean_abs_shap.median(axis=1) 
mean_abs_shap_summary['max'] = mean_abs_shap.max(axis=1)
mean_abs_shap_summary.sort_values('median', inplace=True, ascending=False)
top_10_shap = list(mean_abs_shap_summary.head(10).index)
fig = plt.figure(figsize=(6,6))
ax1 = fig.add_subplot(111)
ax1.violinplot(mean_abs_shap.loc[top_10_shap].T,
              showmedians=True,
              widths=1)
ax1.set_ylim(0)
labels = top_10_shap
ax1.set_xticks(np.arange(1, len(labels) + 1))
ax1.set_xticklabels(labels, rotation=45, ha='right')
ax1.grid(which='both')
ax1.set_ylabel('|Shap| (log odds ratio)')
plt.savefig('output/decision_comparison_shap_violin_key_features.jpg', dpi=300)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_42_0.png

Examine consitency of feature importances#

# Build df for k fold values
importances_df = pd.DataFrame()
for k in range(5):
    importances_df[f'{k}'] = importances_kfold[k]

# Build df to show min, median, and max
importances_summary = pd.DataFrame()
importances_summary['min'] = importances_df.min(axis=1)
importances_summary['median'] = importances_df.median(axis=1) 
importances_summary['max'] = importances_df.max(axis=1)
importances_summary.sort_values('median', inplace=True, ascending=False)
importance_features_index = list(importances_summary.index)
# Add feature names back in
importances_summary['feature'] = \
    [list(X_train)[feat] for feat in importance_features_index]
importances_summary.set_index('feature', inplace=True)
importances_summary
min median max
feature
S1OnsetTimeType_Precise 0.372726 0.410977 0.423405
AFAnticoagulent_Yes 0.223079 0.238576 0.276714
S2RankinBeforeStroke 0.163903 0.183379 0.196325
S2NihssArrival 0.077793 0.081011 0.090417
S2BrainImagingTime_min 0.043173 0.044716 0.047084
S1OnsetToArrival_min 0.040286 0.041661 0.044221
S2StrokeType_Infarction 0.000000 0.000000 0.000000
top_10_importances = list(importances_summary.head(10).index)
fig = plt.figure(figsize=(6,6))
ax1 = fig.add_subplot(111)
ax1.violinplot(importances_summary.loc[top_10_importances].T,
              showmedians=True,
              widths=1)
ax1.set_ylim(0)
labels = top_10_importances
ax1.set_xticks(np.arange(1, len(labels) + 1))
ax1.set_xticklabels(labels, rotation=45, ha='right')
ax1.grid(which='both')
ax1.set_ylabel('Importance')
plt.savefig('output/decision_comparison_importance_violin_key_features.jpg', dpi=300)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_47_0.png

Compare top 10 Shap and importances#

compare_shap_importance = pd.DataFrame()
compare_shap_importance['Shap'] = mean_abs_shap_summary.head(10).index
compare_shap_importance['Importance'] = importances_summary.head(10).index
compare_shap_importance
Shap Importance
0 S2NihssArrival S1OnsetTimeType_Precise
1 S2RankinBeforeStroke AFAnticoagulent_Yes
2 S1OnsetTimeType_Precise S2RankinBeforeStroke
3 S2BrainImagingTime_min S2NihssArrival
4 S1OnsetToArrival_min S2BrainImagingTime_min
5 AFAnticoagulent_Yes S1OnsetToArrival_min
6 S2StrokeType_Infarction S2StrokeType_Infarction
shap_importance = pd.DataFrame()
shap_importance['Shap'] = mean_abs_shap_summary['median']
shap_importance = shap_importance.merge(
    importances_summary['median'], left_index=True, right_index=True)
shap_importance.rename(columns={'median':'Importance'}, inplace=True)
shap_importance.sort_values('Shap', inplace=True, ascending=False)
shap_importance.head(10)
Shap Importance
S2NihssArrival 0.503784 0.081011
S2RankinBeforeStroke 0.443493 0.183379
S1OnsetTimeType_Precise 0.369657 0.410977
S2BrainImagingTime_min 0.286047 0.044716
S1OnsetToArrival_min 0.239312 0.041661
AFAnticoagulent_Yes 0.079252 0.238576
S2StrokeType_Infarction 0.000000 0.000000
fig = plt.figure(figsize=(6,6))
ax1 = fig.add_subplot(111)
ax1.scatter(shap_importance['Shap'],
            shap_importance['Importance'])

ax1.set_xscale('log')
ax1.set_yscale('log')
ax1.set_xlabel('Shap')
ax1.set_ylabel('Importance')
ax1.grid()
plt.savefig(
    'output/decision_comparison_shap_importance_correlation_key_features.jpg',
    dpi=300)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_51_0.png

Further analysis of one k-fold#

Having established that Shap values have good consistency across k-fold replictaes, here we show more detail on Shap using the first k_fold replicate.

# Get all key values from first k fold
model = model_kfold[0]
shap_values = k_fold_shap_values[0]
shap_values_extended = k_fold_shap_values_extended[0]
importances = importances_kfold[0]
y_pred = predicted_kfold[0]
y_prob = predicted_proba_kfold[0]
X_train = X_train_kfold[0]
X_test = X_test_kfold[0]
y_train = y_train_kfold[0]
y_test = y_test_kfold[0]

Beeswarm plot#

A Beeswarm plot shows all points. The feature value for each point is shown by the colour, and its position indicates the Shap value for that instance.

fig = plt.figure(figsize=(6,6))

shap.summary_plot(shap_values=shap_values, 
                  features=X_test,
                  feature_names=features,
                  max_display=8,
                  cmap=plt.get_cmap('nipy_spectral'), show=False)
plt.savefig('output/xgb_decision_comparison_beeswarm_key_features.jpg', 
            dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_55_0.png

Plot Waterfall and decision plot plots for instances with low or high probability of receiving thrombolysis#

Waterfall plot and decision plots are alternative ways of plotting the influence of features for individual cases.

# Get the location of an example each of low and high probablility
location_low_probability = np.where(y_prob == np.min(y_prob))[0][0]
location_high_probability = np.where(y_prob == np.max(y_prob))[0][0]

An example with low probability of receiving thrombolysis.

fig = shap.plots.waterfall(shap_values_extended[location_low_probability],
                           show=False, max_display=8)
plt.savefig('output/xgb_decision_comparison_waterfall_low_key_features.jpg', 
            dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_59_0.png

An example with high probability of receiving thrombolysis.

fig = shap.plots.waterfall(shap_values_extended[location_high_probability],
                           show=False, max_display=8)
plt.savefig('output/xgb_decision_comparison_waterfall_high_key_features.jpg',
            dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_61_0.png

Show the relationship between feature value and Shap value for top 5 influential features.#

feat_to_show = top_10_shap[0:5]

for feat in feat_to_show:
    shap.plots.scatter(shap_values_extended[:, feat], x_jitter=0)
../../_images/06_predict_differences_between_local_and_benchmark_key_features_63_0.png ../../_images/06_predict_differences_between_local_and_benchmark_key_features_63_1.png ../../_images/06_predict_differences_between_local_and_benchmark_key_features_63_2.png ../../_images/06_predict_differences_between_local_and_benchmark_key_features_63_3.png ../../_images/06_predict_differences_between_local_and_benchmark_key_features_63_4.png

Showing waterfall plots using probability values#

Though Shap values for XGBoost most accurately describe the effect on log odds ratio of classification, it may be easier for people to understand influence of features using probabilities. Here we plot waterfall plots using probabilities.

A disadvantage of this method is that it distorts the influence of features someone - those features pushing the probability down from a low level to an even lower level get ‘squashed’ in apparent importance. This distortion is avoided when plotting log odds ratio, but at the cost of using an output that is poorly understandable by many.

# Set up explainer using typical feature values from training set
explainer = shap.TreeExplainer(model, X_train, model_output='probability')

# Get Shapley values along with base and features
shap_values_extended = explainer(X_test)
shap_values = shap_values_extended.values
# Get the location of an example each of low and high probablility

location_low_probability = np.where(y_prob==np.min(y_prob))[0][0]
location_high_probability = np.where(y_prob==np.max(y_prob))[0][0]

An example with low probability of receiving thrombolysis.

fig = shap.plots.waterfall(shap_values_extended[location_low_probability],
                           show=False, max_display=8)
plt.savefig(
    'output/xgb_decision_comparison_waterfall_low_probability_key_features.jpg', 
    dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_68_0.png

An example with high probability of receiving thrombolysis.

fig = shap.plots.waterfall(shap_values_extended[location_high_probability],
                           show=False, max_display=8)
plt.savefig(
    'output/xgb_decision_comparison_waterfall_high_probability_key_features.jpg', 
    dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()
../../_images/06_predict_differences_between_local_and_benchmark_key_features_70_0.png

Observations#

  • We can predict those that will receive thrombolysis at a local unit, out of those who will be thrombolysed by the majority of the benchmark hospitals, with 68% accuracy (AUC 0.737).

  • The five most important distinguishing features are:

    • S2NihssArrival

    • S2RankinBeforeStroke

    • S1OnsetTimeType_Precise

    • S2BrainImagingTime_min

    • S1OnsetToArrival_min