Find similar patients who are treated differently within the same hospital#

Aims#

  • Investigate the number of misclassifications within a hospital

  • Define a similarity measure using the decision tree structure of a random forest classifier

  • For patients with a predicted outcome different to their true outcome, find similar patients that were treated differently

Code#

Import libraries#

import os
import json
import pickle as pkl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline

import sklearn as sk
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")

Load pre-trained hospital models into dictionary hospital2model#

keys = hospitals

values = trained_classifier, threshold, patients, outcomes

Note: patients is a numpy array.

with open ('./models/trained_hospital_models_for _cohort.pkl', 'rb') as f:
    
    hospital2model = pkl.load(f)

Select a hospital for investigation#

hospital = list(hospital2model.keys())[1]
hospital
'TPFFP4410O'

Load test cohort and extract hospital patients#

cohort = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')

test_patients_df = cohort.loc[cohort['StrokeTeam']==hospital]
test_patients_df.head(5)
StrokeTeam S1AgeOnArrival S1OnsetToArrival_min S2RankinBeforeStroke Loc LocQuestions LocCommands BestGaze Visual FacialPalsy ... S2NewAFDiagnosis_Yes S2NewAFDiagnosis_missing S2StrokeType_Infarction S2StrokeType_Primary Intracerebral Haemorrhage S2StrokeType_missing S2TIAInLastMonth_No S2TIAInLastMonth_No but S2TIAInLastMonth_Yes S2TIAInLastMonth_missing S2Thrombolysis
105 TPFFP4410O 72.5 68.0 2 2 2.0 2.0 0.0 2.0 2.0 ... 0 1 0 1 0 0 0 0 1 0
720 TPFFP4410O 87.5 108.0 0 3 2.0 2.0 2.0 3.0 3.0 ... 0 1 0 1 0 0 0 0 1 0
759 TPFFP4410O 82.5 131.0 4 1 2.0 2.0 0.0 0.0 1.0 ... 0 0 1 0 0 0 0 0 1 0
779 TPFFP4410O 57.5 145.0 0 0 0.0 0.0 0.0 0.0 2.0 ... 0 1 1 0 0 0 0 0 1 1
898 TPFFP4410O 67.5 162.0 1 0 0.0 0.0 1.0 2.0 3.0 ... 0 0 1 0 0 0 0 0 1 1

5 rows × 101 columns

Misclassifications#

If a patient is misclassified by the hospital model for the hospital that they attended, this suggests there is a patient in the data set used to train the model that is similar yet was treated differently

Find misclassified patients in test cohort#

forest, threshold, _, _ = hospital2model[hospital]
y_test = test_patients_df['S2Thrombolysis'].values
X_test = test_patients_df.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).values

y_prob = forest.predict_proba(X_test)[:,1]
y_pred = np.array([1 if p >= threshold else 0 for p in y_prob])
misclassified = np.where(y_pred!=y_test)[0]
misclassified
array([ 8, 16, 17, 20, 21, 22, 33, 39])

Posterior probability distribution#

probs = y_prob[misclassified]

plt.hist(probs, 5, width=0.07)

plt.show()
../_images/within_hospital_variability_19_0.png

Similarity Metric#

When a decision tree is fit to the training data, each internal node in the tree will be split into two further nodes based on the feature that maximises the information gain (or minimises the entropy). Once a split results in two nodes that are pure (only containing samples from a single class) no further information gain is possible: the two nodes are leaves, representing the ends of two paths through the tree. As each patient can take only one path through the decision tree, and all patients start in the same node (the root node), we can use these paths to find the similarity between any pair of patients:

\(S(i,j) = \frac{P(i,j)}{\sqrt{P(i)P(j)}}\).

Here, \(S(i,j)\) represents the similarity between patient \(i\) and patient \(j\) as a function of each patient’s path, \(P\), through the decision tree, where \(P(i)\) correspond to the path length of patient \(i\), \(P(j)\) the path length of \(j\) and \(P(i,j)\) the length of the shared path of \(i\) and \(j\).

Path length is measured using the information gain (change in entropy) at each split. Consider patient \(z\), who takes a path through the decision tree which is comprised of a set of \(N\) nodes, each passed sequentially and with index \(n \in [1,N]\), where \(n=1\) represents the root node and \(n=N\) the leaf node. If each node with index \(n\) has \(X_n\) patients passing through it, the length of the path of patient \(z\), \(P(z)\), is given by:

\(P(z)= \sum_{n=1}^{N-1} X_{n+1} (H_n - H_{n+1} \frac{X_{n+1}}{X_n})\),

where \(H_n\) is the entropy at node \(n\) and the information gain at each split has been weighted by \(X_{n+1}\), the total number of patients that move from node \(n\) to node \(n+1\).

From the second equation it is clear that if patients \(i\) and \(j\) take exactly the same path through the decision tree their path lengths are equal: \(P(i) = P(j) = P(i,j)\). Substituting this into the first equation it is clear that if two patients take identical paths the similarity between them is one. Conversely, if patients \(i\) and \(j\) diverge at the root node (\(n=1\)), their shared path contains only this node: \(P(i,j) = 0\) and hence the similarity between them is equal to zero.

As a random forest is composed of many decision trees, for each pair of patients we take the distance between them as the average over all trees in the forest. Using this measure of similarity, for each patient in each hospital we found the most similar patients that are treated differently.

Functions#

Node weights#

def find_information_gain(X_train, y_train, estimator):
    
    """
    Function to find the information gain for each branch in a tree
    
    Input
    
    X_train: (numpy array) features used to train the DecisionTree classifier
    y_train: (numpy array) target used to train the DecisionTree classifier
    estimator: (sklearn model) trained DecisionTree classifier
    
    Output
    
    info_gain: (numpy array) information gain associated with move to child 
                             node (node i) from parent node.
    """
    
    
    # claculate the entropy of each node
    
    y_true = np.where(y_train==1)
    y_false = np.where(y_train==0)
    
    X_true = X_train[y_true]
    X_false = X_train[y_false]
    
    node_indicator_false = estimator.decision_path(X_false).toarray()
    node_indicator_true = estimator.decision_path(X_true).toarray()
    
    total_passing_true = node_indicator_true.sum(axis=0)
    total_passing_false = node_indicator_false.sum(axis=0)
    
    n_nodes = len(total_passing_true)
    
    entropies = np.zeros(n_nodes)
    
    left_child = estimator.tree_.children_left
    right_child = estimator.tree_.children_right
    
    for i in range(n_nodes):
        
        true = total_passing_true[i]
        false = total_passing_false[i]
        
        if true==0 and false==0:
            
            print('Node Not Passed')
            
            continue
            
        if true==0 or false==0:
            
            entropies[i] = 0
            
        else:
            
            entropies[i] = -(true/(true+false)) * np.log2(true/(true+false)) \
                            - (false/(true+false))*np.log2(false/(true+false))
        

            
    # use the entropy to calculate the information gain for each node
            
    info_gain = np.zeros(n_nodes)
    
    
    total_passing = estimator.decision_path(X_train).toarray().sum(axis=0)
    
    for i in range(1,n_nodes):
        

        loc_left = np.where(left_child==i)[0]
        loc_right = np.where(right_child==i)[0]
        
        if len(loc_left==1):
            
            parent = loc_left[0]
            
            
        elif len(loc_right==1):
            
            parent = loc_right[0]
            
        else:
            
            raise Exception("Cant find parent")
            
            
        info_gain[i] = entropies[parent] - \
                        total_passing[i]*entropies[i]/total_passing[parent]

    
        
    return info_gain
def find_node_weight(X_train, y_train, estimator):
    
    """
    Function to find the weight of a node. Weight is defined as the product
    of the number of data points passing the node (flow) and the information
    gain at the node.
    
    Input
    
    X_train: (numpy array) features used to train the DecisionTree classifier
    y_train: (numpy array) target used to train the DecisionTree classifier
    estimator: (sklearn model) trained DecisionTree classifier
    
    Output
    
    weight: (numpy array) weight (flows * info_gain) of each node
    
    """
    
    node_indicator = estimator.decision_path(X_train).toarray()
    
    flows = node_indicator.sum(axis=0)

    info_gain = find_information_gain(X_train, y_train, estimator)
    
    if len(flows) != len(info_gain):
        
        raise Exception("Length mismatch")
        
    return flows*info_gain
def find_forest_weight(X_train, y_train, forest):
    
    trees = forest.estimators_
    
    results = []
    
    for i,estimator in enumerate(trees):
        
        weight = find_node_weight(X_train, y_train, estimator)
        
        results.append(weight)
        
    return np.array(results)

Caclulate Similarity#

def find_similar_patients_tree(patient, test_set, estimator, weights):
    
    """
    Function to calculate the similarity between a single patient and all other
    patients in a data set using a single DecisionTree classifier
    
    Input
    
    patient:    index of comparison patient in test_set
    test_set:   sub-set of n data points unseen by estimator. Must include 
                patient
    estimator:  trained sklearn DecisionTreeClassifier
    weights:    array of length n-nodes. 
    
    Output
    
    similarity: (n-dim array) index of patients and similarity
    
    """
    
    similarity = np.zeros(test_set.shape[0])
    
    paths = estimator.decision_path(test_set)
    
    patient_nodes = paths.toarray()[patient]
    
    for n in range(test_set.shape[0]):
        
        sample_ids = [patient, n]
        
        patient_paths = paths.toarray()[sample_ids]
    
        common_nodes = (patient_paths.sum(axis=0) ==
                    len(sample_ids))
        
        common_node_weights = weights[common_nodes]
        
        path_length = sum(common_node_weights)
        
        p0_path_length = sum(patient_paths[0]*weights)
        p1_path_length = sum(patient_paths[1]*weights)
        
        similar = path_length / np.sqrt(p0_path_length*p1_path_length)
        
        similarity[n] = similar
        
    return similarity
def find_similar_patients(patient, test_set, forest, weights):
    
    """
    Function to calculate the similarity between a single patient and all other
    patients in a data set using a RandomForest classifier
    
    Input
    
    patient:    index of comparison patient in test_set
    test_set:   sub-set of n data points unseen by estimator. Must include 
                patient
    forest:     trained sklearn RandomForestClassifier
    weights:    array of length n-nodes. 
    
    Output
    
    results: (n-dim array) similarity between patient and test_set for each tree
                           in the RandomForest classifier  
    """
    
    trees = forest.estimators_
    
    results = np.zeros((len(trees), test_set.shape[0]))
    
    for i,estimator in enumerate(trees):
        
        similarity = find_similar_patients_tree(patient, test_set, estimator, weights[i])
        
        results[i] = similarity
        
    return results

Example hospital#

forest, threshold, train_patients, train_outcomes = hospital2model[hospital]
weights = find_forest_weight(train_patients, train_outcomes, forest)
weights.shape
(100,)

Pairwise similarity#

test_patients = test_patients_df.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).values
test_outcomes = test_patients_df['S2Thrombolysis'].values
similarity = find_similar_patients(0, test_patients, forest, weights)
similarity.shape
(100, 42)

similarity contains one row for each decision tree in the random forest classifier. Each row contains the pairwise similarity between patient and all other patients.

Note that the similarity between patient and itself (in this case, element 0 of each row) is always equal to 1.

# average similarity across all trees
results = similarity.sum(axis=0)/100
results
array([1.        , 0.41878349, 0.41385524, 0.30315102, 0.28188906,
       0.31645625, 0.44306756, 0.17333223, 0.35602158, 0.46386653,
       0.2490115 , 0.33010815, 0.50437351, 0.12329451, 0.17538473,
       0.19784455, 0.35274581, 0.29956862, 0.29864365, 0.2882998 ,
       0.18215119, 0.39094488, 0.35923549, 0.39327842, 0.41762123,
       0.48577199, 0.11948557, 0.21997094, 0.40880923, 0.50791922,
       0.38785235, 0.30787927, 0.53550742, 0.42006776, 0.23841425,
       0.3628097 , 0.20345056, 0.33015945, 0.16756539, 0.24487966,
       0.16356513, 0.29071159])

Most similar patients with a different outcome#

Identify patients with a predicted outcome different to their true outcome#

predicted_outcome = forest.predict(test_patients)

patient_index = np.where(predicted_outcome!=test_outcomes)[0]
patient_index
array([ 8, 17, 20, 29, 39])

Show actual outcomes for patients:

actual = [test_outcomes[i] for i in patient_index]

print ('Actual:', actual)
Actual: [0, 1, 1, 1, 1]

For each patient, find 5 most similar patients with a different outcome in training set#

%%time

most_similar = []

for i in patient_index:
    
    patient = test_patients[i]
    
    outcome = test_outcomes[i]
    
    opposite = np.where(train_outcomes != outcome)[0]
    

    X = np.vstack((patient, train_patients[opposite]))
    y = np.concatenate(([outcome], train_outcomes[opposite]))

    results = find_similar_patients(0, X, forest, weights)
            
    summed_results = results.sum(axis=0)[1:] #remove similarity between patient and itself
            
    five_indices = np.argpartition(summed_results, -5)[-5:]

    similar_patients = train_patients[opposite][five_indices]

    S = summed_results[five_indices]

    most_similar.append([S, summed_results, outcome, i, patient, similar_patients])
        
CPU times: user 7.32 s, sys: 0 ns, total: 7.32 s
Wall time: 7.32 s

Get feature importances for hospital#

features = cohort.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).columns.values
importances = forest.feature_importances_

Example where thrombolysis was not given when expected#

patient_of_interest = 0

Create dataframe for patient#

S, _, _, _, patient, similar_patients = most_similar[patient_of_interest]
S_index = np.argsort(S)
S_index
array([0, 1, 3, 4, 2])
patient_df = pd.DataFrame(index = features)
patient_df['Importance'] = importances
patient_df['Patient'] = patient

for i,s in enumerate(S_index):
    
    patient_df[str(i+1)] = similar_patients[s]

Sort dataframe according to feature importance#

patient_df = patient_df.sort_values(by='Importance', ascending=False)
patient_df.drop(['Importance'], axis=1, inplace=True)
patient_df.to_csv('./output/similar_patients_treated_differently_1.csv')

patient_df
Patient 1 2 3 4 5
S2BrainImagingTime_min 23.0 19.0 11.0 33.0 30.0 12.0
S2NihssArrival 4.0 5.0 6.0 9.0 6.0 8.0
S1OnsetToArrival_min 83.0 149.0 134.0 121.0 79.0 82.0
S2StrokeType_Primary Intracerebral Haemorrhage 0.0 0.0 0.0 0.0 0.0 0.0
S2RankinBeforeStroke 0.0 0.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ...
AFAnticoagulentHeparin_Yes 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetInHospital_Yes 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetInHospital_No 1.0 1.0 1.0 1.0 1.0 1.0
S1Ethnicity_Mixed 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetTimeType_Not known 0.0 0.0 0.0 0.0 0.0 0.0

99 rows × 6 columns

Rescale#

As some features, such as stroke to arrival time and age, have a much larger range than other features, we rescale them so that all features can be visualised on one axis

def rescale(row, df, time=False, age=False):
    
    vals = df.iloc[row].values
    
    if len(set(vals))==1:
        
        rescaled=vals
        
    if time==True:
        
        rescaled=vals/60
        
    elif age==True:
        
        rescaled=vals/10
    
    else:
        #rescaled = [(v - vals[0]) for v in vals]
        rescaled = vals
    
    return rescaled
rescaled_results = patient_df.copy()

indx_list = rescaled_results.index.tolist()

for i in range(len(rescaled_results)):
    
    indx = rescaled_results.index[i]
    
    if indx in ['S1OnsetToArrival_min']:#, 'S2BrainImagingTime_min']:
        
        rescaled = rescale(i,rescaled_results, time=True)
        indx_list[i] = indx.split('_')[0] + '_hours'
        
    elif indx in ['S1AgeOnArrival']:
        
        rescaled = rescale(i,rescaled_results, age=True)
        indx_list[i] = indx + '_decades'
        
    else:
        rescaled = rescale(i,rescaled_results)
    
    rescaled_results.loc[indx]=rescaled
    
rescaled_results.index = indx_list

Plot#

rescaled_results.plot(kind='bar',  stacked=False, alpha=1, figsize=(8,8),
                     xlabel='Feature', ylabel='Feature value')  

plt.xlim(None,10.5)

plt.tight_layout()

plt.savefig('./figures/within_hospital_variability/5_similar_bar_patient_not_treated.jpg', dpi=300)
plt.show()
../_images/within_hospital_variability_64_0.png
fig,ax = plt.subplots(figsize=(8,8))

xvals = rescaled_results.index.values
pvals = rescaled_results['Patient']
mvals = [np.mean(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
lvals = [min(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
hvals = [max(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]

plt.plot(xvals, mvals, 'r*', ms=10, label='Mean')
plt.plot(xvals, lvals, 'b*', ms=10, label='Low')
plt.plot(xvals, hvals, 'y*', ms=10, label='High')
plt.plot(xvals, pvals, 'k*', ms=10, label='Patient')
plt.fill_between(xvals, lvals, hvals, color='g', alpha=0.5)
plt.xlim(-0.5,10.5)
plt.xticks(rotation=90)
plt.legend(loc='best')
ax.set_xlabel('Feature')
ax.set_ylabel('Feature value')
plt.tight_layout()

plt.savefig('./figures/within_hospital_variability/5_similar_range_patient_not_treated.jpg', dpi=300)

plt.show()
../_images/within_hospital_variability_65_0.png

Example where thrombolysis was given when not expected#

patient_of_interest = 1

S, _, _, _, patient, similar_patients = most_similar[patient_of_interest]
S_index = np.argsort(S)
S_index
array([0, 1, 2, 3, 4])
patient_df = pd.DataFrame(index = features)
patient_df['Importance'] = importances
patient_df['Patient'] = patient

for i,s in enumerate(S_index):
    patient_df[str(i+1)] = similar_patients[s]
    
patient_df = patient_df.sort_values(by='Importance', ascending=False)
patient_df.drop(['Importance'], axis=1, inplace=True)
patient_df.to_csv('./output/similar_patients_treated_differently_2.csv')

patient_df
Patient 1 2 3 4 5
S2BrainImagingTime_min 40.0 88.0 40.0 27.0 51.0 35.0
S2NihssArrival 9.0 11.0 10.0 6.0 6.0 6.0
S1OnsetToArrival_min 105.0 67.0 122.0 83.0 75.0 160.0
S2StrokeType_Primary Intracerebral Haemorrhage 0.0 0.0 0.0 0.0 0.0 0.0
S2RankinBeforeStroke 0.0 0.0 4.0 0.0 3.0 0.0
... ... ... ... ... ... ...
AFAnticoagulentHeparin_Yes 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetInHospital_Yes 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetInHospital_No 1.0 1.0 1.0 1.0 1.0 1.0
S1Ethnicity_Mixed 0.0 0.0 0.0 0.0 0.0 0.0
S1OnsetTimeType_Not known 0.0 0.0 0.0 0.0 0.0 0.0

99 rows × 6 columns

Rescale:

rescaled_results = patient_df.copy()

indx_list = rescaled_results.index.tolist()

for i in range(len(rescaled_results)):
    
    indx = rescaled_results.index[i]
    
    if indx in ['S1OnsetToArrival_min']:#, 'S2BrainImagingTime_min']:
        
        rescaled = rescale(i,rescaled_results, time=True)
        indx_list[i] = indx.split('_')[0] + '_hours'
        
    elif indx in ['S1AgeOnArrival']:
        
        rescaled = rescale(i,rescaled_results, age=True)
        indx_list[i] = indx + '_decades'
        
    else:
        rescaled = rescale(i,rescaled_results)
    
    rescaled_results.loc[indx]=rescaled
    
rescaled_results.index = indx_list

Plot

rescaled_results.plot(kind='bar',  stacked=False, alpha=1, figsize=(8,8),
                      xlabel='Feature', ylabel='Feature value')

plt.xlim(None,10.5)

plt.tight_layout()

plt.savefig('./figures/within_hospital_variability/5_similar_bar_patient_treated.jpg', dpi=300)
plt.show()
../_images/within_hospital_variability_72_0.png
fig,ax = plt.subplots(figsize=(8,8))

xvals = rescaled_results.index.values
pvals = rescaled_results['Patient']
mvals = [np.mean(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
lvals = [min(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
hvals = [max(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]

plt.plot(xvals, mvals, 'r*', ms=10, label='Mean')
plt.plot(xvals, lvals, 'b*', ms=10, label='Low')
plt.plot(xvals, hvals, 'y*', ms=10, label='High')
plt.plot(xvals, pvals, 'k*', ms=10, label='Patient')
plt.fill_between(xvals, lvals, hvals, color='g', alpha=0.5)
plt.xlim(-0.5,10.5)
plt.xticks(rotation=90)
plt.legend(loc='best')
ax.set_xlabel('Feature')
ax.set_ylabel('Feature value')

plt.tight_layout()

plt.savefig('./figures/within_hospital_variability/5_similar_range_patient_treated.jpg', dpi=300)

plt.show()
../_images/within_hospital_variability_73_0.png

Observations#

  • Misclassifications indicate that there is variability in clinical decision making within a hospital

  • We can use the structure of the Random Forest classifier to define a similarity metric

  • For patients that are incorrectly classified, we can find similar patients that were treated differently, further demonstrating variation in decision making within a hospital