Modular TensorFlow model with 1D embedding - analyse#

Embedding converts a categorical variable into a projection onto n-dimensional space [1], and has been shown to be an effective way to train neural network when using categorical data, while also allowing a measure of similarity/distance between different values of the categorical data, Here we use embedding for hospital ID. We also convert patient data and pathway data into an embedded vector (this may also be known as encoding the data in a vector with fewer dimensions than the original data set for those groups of features).

[1] Guo C, & Berkhahn F. (2016) Entity Embeddings of Categorical Variables. arXiv:160406737 [cs] http://arxiv.org/abs/1604.06737

Pre-trained models from 002b_1d_modular_fit.ipynb

Model are fitted to previously split training and test data sets.

Aims#

Test performance of model using subnets for:

  1. Patient clinical data: Age, gender, ethnicity, disability before stroke, stroke scale data. Pass through one hidden layer (with 2x neurons as input features) and then to single neuron with sigmoid activation.

  2. Pathway process data: Times of arrival and scan, time of day, day of week. Pass through one hidden layer (with 2x neurons as input features) and then to single neuron with sigmoid activation.

  3. Hospital ID (one-hot encoded): Connect input directly to single neuron with sigmoid activation.

The three subnet outputs are then past to a single neuron with sigmoid activation for final output.

Note: In this experiment the output of each subnet is a vector with a single value (a 1D output), but the model is structured to allow for more neurons(dimensions) for each subnet, e.g. to allow for encoding of hospitals that allow more complex interactions with patients (which may also have >1 dimension output).

Import libraries#

path = './saved_models/1d_modular/'
# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import numpy as np
import os
import pandas as pd

# sklearn for pre-processing
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import auc

# TensorFlow api model
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import backend as K
from tensorflow.keras.losses import binary_crossentropy

Import data#

train_data, test_data = [], []
data_loc = '../data/kfold_5fold/'
for i in range(5):
    
    train_data.append(pd.read_csv(data_loc + 'train_{0}.csv'.format(i)))
    test_data.append(pd.read_csv(data_loc + 'test_{0}.csv'.format(i)))

Define function to scale data#

def scale_data(X_train, X_test):
    """Scale data 0-1 based on min and max in training set"""
    
    # Initialise a new scaling object for normalising input data
    sc = MinMaxScaler()

    # Set up the scaler just on the training set
    sc.fit(X_train)

    # Apply the scaler to the training and test sets
    train_sc = sc.transform(X_train)
    test_sc = sc.transform(X_test)
    
    return train_sc, test_sc

Define function for accuracy#

def calculate_accuracy(observed, predicted):
    
    """
    Calculates a range of accuracy scores from observed and predicted classes.
    
    Takes two list or NumPy arrays (observed class values, and predicted class 
    values), and returns a dictionary of results.
    
     1) observed positive rate: proportion of observed cases that are +ve
     2) Predicted positive rate: proportion of predicted cases that are +ve
     3) observed negative rate: proportion of observed cases that are -ve
     4) Predicted negative rate: proportion of predicted cases that are -ve  
     5) accuracy: proportion of predicted results that are correct    
     6) precision: proportion of predicted +ve that are correct
     7) recall: proportion of true +ve correctly identified
     8) f1: harmonic mean of precision and recall
     9) sensitivity: Same as recall
    10) specificity: Proportion of true -ve identified:        
    11) positive likelihood: increased probability of true +ve if test +ve
    12) negative likelihood: reduced probability of true +ve if test -ve
    13) false positive rate: proportion of false +ves in true -ve patients
    14) false negative rate: proportion of false -ves in true +ve patients
    15) true positive rate: Same as recall
    16) true negative rate: Same as specificity
    17) positive predictive value: chance of true +ve if test +ve
    18) negative predictive value: chance of true -ve if test -ve
    
    """
    
    # Converts list to NumPy arrays
    if type(observed) == list:
        observed = np.array(observed)
    if type(predicted) == list:
        predicted = np.array(predicted)
    
    # Calculate accuracy scores
    observed_positives = observed == 1
    observed_negatives = observed == 0
    predicted_positives = predicted == 1
    predicted_negatives = predicted == 0
    
    true_positives = (predicted_positives == 1) & (observed_positives == 1)
    
    false_positives = (predicted_positives == 1) & (observed_positives == 0)
    
    true_negatives = (predicted_negatives == 1) & (observed_negatives == 1)
    
    false_negatives = (predicted_negatives == 1) & (observed_negatives == 0)
    
    accuracy = np.mean(predicted == observed)
    
    precision = (np.sum(true_positives) /
                 (np.sum(true_positives) + np.sum(false_positives)))
        
    recall = np.sum(true_positives) / np.sum(observed_positives)
    
    sensitivity = recall
    
    f1 = 2 * ((precision * recall) / (precision + recall))
    
    specificity = np.sum(true_negatives) / np.sum(observed_negatives)
    
    positive_likelihood = sensitivity / (1 - specificity)
    
    negative_likelihood = (1 - sensitivity) / specificity
    
    false_positive_rate = 1 - specificity
    
    false_negative_rate = 1 - sensitivity
    
    true_positive_rate = sensitivity
    
    true_negative_rate = specificity
    
    positive_predictive_value = (np.sum(true_positives) / 
                            (np.sum(true_positives) + np.sum(false_positives)))
    
    negative_predictive_value = (np.sum(true_negatives) / 
                            (np.sum(true_negatives) + np.sum(false_negatives)))
    
    # Create dictionary for results, and add results
    results = dict()
    
    results['observed_positive_rate'] = np.mean(observed_positives)
    results['observed_negative_rate'] = np.mean(observed_negatives)
    results['predicted_positive_rate'] = np.mean(predicted_positives)
    results['predicted_negative_rate'] = np.mean(predicted_negatives)
    results['accuracy'] = accuracy
    results['precision'] = precision
    results['recall'] = recall
    results['f1'] = f1
    results['sensitivity'] = sensitivity
    results['specificity'] = specificity
    results['positive_likelihood'] = positive_likelihood
    results['negative_likelihood'] = negative_likelihood
    results['false_positive_rate'] = false_positive_rate
    results['false_negative_rate'] = false_negative_rate
    results['true_positive_rate'] = true_positive_rate
    results['true_negative_rate'] = true_negative_rate
    results['positive_predictive_value'] = positive_predictive_value
    results['negative_predictive_value'] = negative_predictive_value
    
    return results

Define function for line intersect#

Used to find point of sensitivity-specificity curve where sensitivity = specificity.

def get_intersect(a1, a2, b1, b2):
    """ 
    Returns the point of intersection of the lines passing through a2,a1 and b2,b1.
    a1: [x, y] a point on the first line
    a2: [x, y] another point on the first line
    b1: [x, y] a point on the second line
    b2: [x, y] another point on the second line
    """
    s = np.vstack([a1,a2,b1,b2])        # s for stacked
    h = np.hstack((s, np.ones((4, 1)))) # h for homogeneous
    l1 = np.cross(h[0], h[1])           # get first line
    l2 = np.cross(h[2], h[3])           # get second line
    x, y, z = np.cross(l1, l2)          # point of intersection
    if z == 0:                          # lines are parallel
        return (float('inf'), float('inf'))
    return (x/z, y/z)

Use trained models to predict outcome of test data sets#

# Set up lists for accuracies and ROC data
accuracies = []
roc_fpr = []
roc_tpr = []

# Set up lists for observed and predicted
observed = []
predicted_proba = []
predicted = []

# Get data subgroups
subgroups = pd.read_csv('../data/subnet.csv', index_col='Item')
# Get list of clinical items
clinical_subgroup = subgroups.loc[subgroups['Subnet']=='clinical']
clinical_subgroup = list(clinical_subgroup.index)
# Get list of pathway items
pathway_subgroup = subgroups.loc[subgroups['Subnet']=='pathway']
pathway_subgroup = list(pathway_subgroup.index)
# Get list of hospital items
hospital_subgroup = subgroups.loc[subgroups['Subnet']=='hospital']
hospital_subgroup = list(hospital_subgroup.index)

# Loop through 5 k-folds
for k in range(5):
    
    # Load data
    train = pd.read_csv(f'../data/kfold_5fold/train_{k}.csv')
    test = pd.read_csv(f'../data/kfold_5fold/test_{k}.csv')
    
    # OneHot encode stroke team
    coded = pd.get_dummies(train['StrokeTeam'])
    train = pd.concat([train, coded], axis=1)
    train.drop('StrokeTeam', inplace=True, axis=1)
    coded = pd.get_dummies(test['StrokeTeam'])
    test = pd.concat([test, coded], axis=1)
    test.drop('StrokeTeam', inplace=True, axis=1)
    
    # Split into X, y
    X_train_df = train.drop('S2Thrombolysis',axis=1) 
    y_train_df = train['S2Thrombolysis']
    X_test_df = test.drop('S2Thrombolysis',axis=1) 
    y_test_df = test['S2Thrombolysis'] 
    
    # Split train and test data by subgroups
    X_train_patients = X_train_df[clinical_subgroup]
    X_test_patients = X_test_df[clinical_subgroup]
    X_train_pathway = X_train_df[pathway_subgroup]
    X_test_pathway = X_test_df[pathway_subgroup]
    X_train_hospitals = X_train_df[hospital_subgroup]
    X_test_hospitals = X_test_df[hospital_subgroup]
            
    # Convert to NumPy
    X_train = X_train_df.values
    X_test = X_test_df.values
    y_train = y_train_df.values
    y_test = y_test_df.values
    
    # Scale data
    X_train_patients_sc, X_test_patients_sc = \
        scale_data(X_train_patients, X_test_patients)

    X_train_pathway_sc, X_test_pathway_sc = \
        scale_data(X_train_pathway, X_test_pathway)

    X_train_hospitals_sc, X_test_hospitals_sc = \
        scale_data(X_train_hospitals, X_test_hospitals)
    
    # Load model
    filename = f'{path}model_{str(k)}.h5'
    model = keras.models.load_model(filename)
    
    # Get and store probablity
    probability = model.predict(
        [X_test_patients_sc, X_test_pathway_sc, X_test_hospitals_sc])
    observed.append(y_test)
    predicted_proba.append(probability.flatten())
    
    # Get and store class
    y_pred_class = probability >= 0.5
    y_pred_class = y_pred_class.flatten()
    predicted.append(y_pred_class)
    
    # Get accuracy measurements
    accuracy_dict = calculate_accuracy(y_test, y_pred_class)
    accuracies.append(accuracy_dict)
    
    # ROC
    curve_fpr = [] # false positive rate
    curve_tpr = [] # true positive rate

    # Loop through increments in probability of survival
    thresholds = np.arange(0, 1.01, 0.01)
    for cutoff in thresholds: #  loop 0 --> 1 on steps of 0.1
        # Get whether passengers survive using cutoff
        predicted_class = probability >= cutoff
        predicted_class = predicted_class.flatten() * 1.0
        # Call accuracy measures function
        accuracy = calculate_accuracy(y_test, predicted_class)
        # Add accuracy scores to lists
        curve_fpr.append(accuracy['false_positive_rate'])
        curve_tpr.append(accuracy['true_positive_rate'])
    
    # Add roc to overall lists
    roc_fpr.append(curve_fpr)
    roc_tpr.append(curve_tpr)
results = pd.DataFrame(accuracies)
results.describe().T
count mean std min 25% 50% 75% max
observed_positive_rate 5.0 0.295261 0.000146 0.295080 0.295176 0.295232 0.295401 0.295417
observed_negative_rate 5.0 0.704739 0.000146 0.704583 0.704599 0.704768 0.704824 0.704920
predicted_positive_rate 5.0 0.301188 0.009075 0.291690 0.292309 0.301810 0.308968 0.311161
predicted_negative_rate 5.0 0.698812 0.009075 0.688839 0.691032 0.698190 0.707691 0.708310
accuracy 5.0 0.854500 0.000281 0.854091 0.854324 0.854661 0.854709 0.854717
precision 5.0 0.748799 0.007519 0.741236 0.741401 0.748696 0.755722 0.756939
recall 5.0 0.763643 0.015298 0.747858 0.748381 0.764941 0.776296 0.780738
f1 5.0 0.756012 0.003723 0.752034 0.752371 0.756731 0.758447 0.760475
sensitivity 5.0 0.763643 0.015298 0.747858 0.748381 0.764941 0.776296 0.780738
specificity 5.0 0.892566 0.006443 0.885723 0.886655 0.892356 0.898692 0.899402
positive_likelihood 5.0 7.121689 0.285743 6.832007 6.848993 7.106183 7.387168 7.434095
negative_likelihood 5.0 0.264719 0.015230 0.247551 0.252301 0.263414 0.279984 0.280345
false_positive_rate 5.0 0.107434 0.006443 0.100598 0.101308 0.107644 0.113345 0.114277
false_negative_rate 5.0 0.236357 0.015298 0.219262 0.223704 0.235059 0.251619 0.252142
true_positive_rate 5.0 0.763643 0.015298 0.747858 0.748381 0.764941 0.776296 0.780738
true_negative_rate 5.0 0.892566 0.006443 0.885723 0.886655 0.892356 0.898692 0.899402
positive_predictive_value 5.0 0.748799 0.007519 0.741236 0.741401 0.748696 0.755722 0.756939
negative_predictive_value 5.0 0.900189 0.005153 0.894904 0.895050 0.900548 0.904475 0.905967

Receiver Operator Characteristic (ROC) Curve and Sensitivity-Specificity Curves#

Calculate areas of ROC curves.

k_fold_auc = []
for k in range(5):
    # Get AUC
    area = auc(roc_fpr[k], roc_tpr[k])
    print (f'ROC AUC: {area:0.3f}')
    k_fold_auc.append(area)

# Show mean area under curve  
mean_auc = np.mean(k_fold_auc)
sd_auc = np.std(k_fold_auc)
print (f'\nMean AUC: {mean_auc:0.4f}')
print (f'SD AUC: {sd_auc:0.4f}')
ROC AUC: 0.921
ROC AUC: 0.922
ROC AUC: 0.919
ROC AUC: 0.921
ROC AUC: 0.922

Mean AUC: 0.9211
SD AUC: 0.0011

Plot Receiver Operator Characteristic Curve

fig, ax = plt.subplots(figsize=(6,6))
for i in range(5):
    ax.plot(roc_fpr[i], roc_tpr[i], color='orange', linestyle='-')
ax.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
ax.set_xlabel('False Positive Rate (1 - Specificity)')
ax.set_ylabel('True Positive Rate (Sensitivity)')
ax.text(0.55, 0.1, f'Mean AUC: {mean_auc:0.3f}', fontsize=12, 
        bbox=dict(facecolor='w', alpha=1.0))
ax.grid()
filename = path + '1d_roc.jpg'
plt.savefig(filename, dpi=300)
plt.show()
../_images/002c_1d_modular_analyse_19_0.png

Plot Sensitivity-Specificity Curve alongside Receiver Operator Characteristic Curve

k_fold_sensitivity = []
k_fold_specificity = []


for i in range(5):
    # Get classificiation probabilities for k-fold replicate
    obs = observed[i]
    proba = predicted_proba[i]
    
    # Set up list for accuracy measures
    sensitivity = []
    specificity = []
    
    # Loop through increments in probability of survival
    thresholds = np.arange(0.0, 1.01, 0.01)
    for cutoff in thresholds: #  loop 0 --> 1 on steps of 0.1
        # Get whether passengers survive using cutoff
        predicted_class = proba >= cutoff
        predicted_class = predicted_class * 1.0
        # Call accuracy measures function
        accuracy = calculate_accuracy(obs, predicted_class)
        # Add accuracy scores to lists
        sensitivity.append(accuracy['sensitivity'])
        specificity.append(accuracy['specificity'])
    
    # Add replicate to lists
    k_fold_sensitivity.append(sensitivity)
    k_fold_specificity.append(specificity)
fig = plt.figure(figsize=(10,5))

# Plot ROC
ax1 = fig.add_subplot(121)
for i in range(5):
    ax1.plot(roc_fpr[i], roc_tpr[i], color='orange')
ax1.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('Receiver Operator Characteristic Curve')
text = f'Mean AUC: {mean_auc:.3f}'
ax1.text(0.64,0.07, text, 
         bbox=dict(facecolor='white', edgecolor='black'))
plt.grid(True)

# Plot sensitivity-specificity
ax2 = fig.add_subplot(122)
for i in range(5):
    ax2.plot(k_fold_sensitivity[i], k_fold_specificity[i])
ax2.set_xlabel('Sensitivity')
ax2.set_ylabel('Specificity')
ax2.set_title('Sensitivity-Specificity Curve')
plt.grid(True)


plt.tight_layout(pad=2)
plt.savefig('./output/nn_1d_roc_sens_spec.jpg', dpi=300)

plt.show()
../_images/002c_1d_modular_analyse_22_0.png

Identify cross-over of sensitivity and specificity#

sens = np.array(k_fold_sensitivity).mean(axis=0)
spec = np.array(k_fold_specificity).mean(axis=0)
df = pd.DataFrame()
df['sensitivity'] = sens
df['specificity'] = spec
df['spec greater sens'] = spec > sens

# find last index for senitivity being greater than specificity 
mask = df['spec greater sens'] == False
last_id_sens_greater_spec = np.max(df[mask].index)
locs = [last_id_sens_greater_spec, last_id_sens_greater_spec + 1]
points = df.iloc[locs][['sensitivity', 'specificity']]

# Get intersetction with line of x=y
a1 = list(points.iloc[0].values)
a2 = list(points.iloc[1].values)
b1 = [0, 0]
b2 = [1, 1]

intersect = get_intersect(a1, a2, b1, b2)[0]
print(f'\nIntersect: {intersect:0.3f}')
Intersect: 0.845

Collate and save results#

hospital_results = []
kfold_result = []
threshold_results = []
observed_results = []
prob_results = []
predicted_results = []

for i in range(5):
    hospital_results.extend(list(test_data[i]['StrokeTeam']))
    kfold_result.extend(list(np.repeat(i, len(test_data[i]))))
    threshold_results.extend(list(np.repeat(thresholds[i], len(test_data[i]))))
    observed_results.extend(list(observed[i]))
    prob_results.extend(list(predicted_proba[i]))
    predicted_results.extend(list(predicted[i]))    
    
model_results = pd.DataFrame()
model_results['hospital'] = hospital_results
model_results['observed'] = np.array(observed_results) * 1.0
model_results['prob'] = prob_results
model_results['predicted'] = predicted_results
model_results['k_fold'] = kfold_result
model_results['threshold'] = threshold_results
model_results['correct'] = model_results['observed'] == model_results['predicted']

# Save
filename = './predictions/nn_1d_k_fold.csv'
model_results.to_csv(filename, index=False)

Calibration and assessment of accuracy when model has high confidence#

# Collate results in Dataframe
reliability_collated = pd.DataFrame()

# Loop through k fold predictions
for i in range(5):
    
    # Get observed class and predicted probability
    obs = observed[i]
    prob = predicted_proba[i]
    
    # Bin data with numpy digitize (this will assign a bin to each case)
    step = 0.10
    bins = np.arange(step, 1+step, step)
    digitized = np.digitize(prob, bins)
        
    # Put single fold data in DataFrame
    reliability = pd.DataFrame()
    reliability['bin'] = digitized
    reliability['probability'] = prob
    reliability['observed'] = obs
    classification = 1 * (prob > 0.5 )
    reliability['correct'] = obs == classification
    reliability['count'] = 1
    
    # Summarise data by bin in new dataframe
    reliability_summary = pd.DataFrame()

    # Add bins and k-fold to summary
    reliability_summary['bin'] = bins
    reliability_summary['k-fold'] = i

    # Calculate mean of predicted probability of thrombolysis in each bin
    reliability_summary['confidence'] = \
        reliability.groupby('bin').mean()['probability']

    # Calculate the proportion of patients who receive thrombolysis
    reliability_summary['fraction_positive'] = \
        reliability.groupby('bin').mean()['observed']
    
    # Calculate proportion correct in each bin
    reliability_summary['fraction_correct'] = \
        reliability.groupby('bin').mean()['correct']
    
    # Calculate fraction of results in each bin
    reliability_summary['fraction_results'] = \
        reliability.groupby('bin').sum()['count'] / reliability.shape[0]   
    
    # Add k-fold results to DatafRame collation
    reliability_collated = reliability_collated.append(reliability_summary)
    
# Get mean results
reliability_summary = reliability_collated.groupby('bin').mean()
reliability_summary.drop('k-fold', axis=1, inplace=True)
reliability_summary
confidence fraction_positive fraction_correct fraction_results
bin
0.1 0.023090 0.018403 0.981597 0.462070
0.2 0.145258 0.139055 0.860945 0.082392
0.3 0.247580 0.232161 0.767839 0.056360
0.4 0.349330 0.324901 0.675099 0.050277
0.5 0.450219 0.427628 0.572372 0.047713
0.6 0.550560 0.558217 0.558217 0.053223
0.7 0.651588 0.650801 0.650801 0.063130
0.8 0.751653 0.786995 0.786995 0.077625
0.9 0.847036 0.862959 0.862959 0.085721
1.0 0.922530 0.912524 0.912524 0.021489
fig = plt.figure(figsize=(10,5))


# Plot predicted prob vs fraction psotive
ax1 = fig.add_subplot(1,2,1)

# Loop through k-fold reliability results
for i in range(5):
    mask = reliability_collated['k-fold'] == i
    k_fold_result = reliability_collated[mask]
    x = k_fold_result['confidence']
    y = k_fold_result['fraction_positive']
    ax1.plot(x,y, color='orange')
# Add 1:1 line
ax1.plot([0,1],[0,1], color='k', linestyle ='--')
# Refine plot
ax1.set_xlabel('Model probability')
ax1.set_ylabel('Fraction positive')
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.grid()

# Plot accuracy vs probability
ax2 = fig.add_subplot(1,2,2)
# Loop through k-fold reliability results
for i in range(5):
    mask = reliability_collated['k-fold'] == i
    k_fold_result = reliability_collated[mask]
    x = k_fold_result['confidence']
    y = k_fold_result['fraction_correct']
    ax2.plot(x,y, color='orange')
# Refine plot
ax2.set_xlabel('Model probability')
ax2.set_ylabel('Fraction correct')
ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.grid()

ax3 = ax2.twinx()  # instantiate a second axes that shares the same x-axis
for i in range(5):
    mask = reliability_collated['k-fold'] == i
    k_fold_result = reliability_collated[mask]
    x = k_fold_result['confidence']
    y = k_fold_result['fraction_results']
    ax3.plot(x,y, color='blue')
    
ax3.set_xlim(0, 1)
ax3.set_ylim(0, 0.5)
ax3.set_ylabel('Fraction of samples')

custom_lines = [Line2D([0], [0], color='orange', alpha=0.6, lw=2),
                Line2D([0], [0], color='blue', alpha = 0.6,lw=2)]

plt.legend(custom_lines, ['Fraction correct', 'Fraction of samples'],
          loc='upper center')

plt.tight_layout(pad=2)
plt.savefig('./output/nn_1d_reliability.jpg', dpi=300)
plt.show()
../_images/002c_1d_modular_analyse_30_0.png
bins = [0.1, 0.2, 0.9, 1.0]
acc = reliability_summary.loc[bins].mean()['fraction_correct']
frac = reliability_summary.loc[bins].sum()['fraction_results']

print ('For samples with at least 80% confidence:')
print (f'Proportion of all samples: {frac:0.3f}')
print (f'Accuracy: {acc:0.3f}')
For samples with at least 80% confidence:
Proportion of all samples: 0.652
Accuracy: 0.905

Observations#

  • Overall accuracy = 85.5% (90.5% for those 65% samples with at least 80% confidence of model)

  • Using nominal threshold (50% probability), specificity (89%) is greater than sensitivity (76%)

  • The model can achieve 84.5% sensitivity and specificity simultaneously

  • ROC AUC = 0.921