# Find similar patients who are treated differently within the same hospital

## Contents

# Find similar patients who are treated differently within the same hospital#

## Aims#

Investigate the number of misclassifications within a hospital

Define a similarity measure using the decision tree structure of a random forest classifier

For patients with a predicted outcome different to their true outcome, find similar patients that were treated differently

## Code#

### Import libraries#

```
import os
import json
import pickle as pkl
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
%matplotlib inline
import sklearn as sk
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings("ignore")
```

### Load pre-trained hospital models into dictionary `hospital2model`

#

keys = hospitals

values = trained_classifier, threshold, patients, outcomes

Note: patients is a numpy array.

```
with open ('./models/trained_hospital_models_for _cohort.pkl', 'rb') as f:
hospital2model = pkl.load(f)
```

### Select a hospital for investigation#

```
hospital = list(hospital2model.keys())[1]
hospital
```

```
'TPFFP4410O'
```

### Load test cohort and extract hospital patients#

```
cohort = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
test_patients_df = cohort.loc[cohort['StrokeTeam']==hospital]
```

```
test_patients_df.head(5)
```

StrokeTeam | S1AgeOnArrival | S1OnsetToArrival_min | S2RankinBeforeStroke | Loc | LocQuestions | LocCommands | BestGaze | Visual | FacialPalsy | ... | S2NewAFDiagnosis_Yes | S2NewAFDiagnosis_missing | S2StrokeType_Infarction | S2StrokeType_Primary Intracerebral Haemorrhage | S2StrokeType_missing | S2TIAInLastMonth_No | S2TIAInLastMonth_No but | S2TIAInLastMonth_Yes | S2TIAInLastMonth_missing | S2Thrombolysis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

105 | TPFFP4410O | 72.5 | 68.0 | 2 | 2 | 2.0 | 2.0 | 0.0 | 2.0 | 2.0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |

720 | TPFFP4410O | 87.5 | 108.0 | 0 | 3 | 2.0 | 2.0 | 2.0 | 3.0 | 3.0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |

759 | TPFFP4410O | 82.5 | 131.0 | 4 | 1 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

779 | TPFFP4410O | 57.5 | 145.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

898 | TPFFP4410O | 67.5 | 162.0 | 1 | 0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

5 rows × 101 columns

## Misclassifications#

If a patient is misclassified by the hospital model for the hospital that they attended, this suggests there is a patient in the data set used to train the model that is similar yet was treated differently

### Find misclassified patients in test cohort#

```
forest, threshold, _, _ = hospital2model[hospital]
```

```
y_test = test_patients_df['S2Thrombolysis'].values
X_test = test_patients_df.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).values
y_prob = forest.predict_proba(X_test)[:,1]
y_pred = np.array([1 if p >= threshold else 0 for p in y_prob])
```

```
misclassified = np.where(y_pred!=y_test)[0]
misclassified
```

```
array([ 8, 16, 17, 20, 21, 22, 33, 39])
```

### Posterior probability distribution#

```
probs = y_prob[misclassified]
plt.hist(probs, 5, width=0.07)
plt.show()
```

## Similarity Metric#

When a decision tree is fit to the training data, each internal node in the tree will be split into two further nodes based on the feature that maximises the information gain (or minimises the entropy). Once a split results in two nodes that are pure (only containing samples from a single class) no further information gain is possible: the two nodes are leaves, representing the ends of two paths through the tree. As each patient can take only one path through the decision tree, and all patients start in the same node (the root node), we can use these paths to find the similarity between any pair of patients:

\(S(i,j) = \frac{P(i,j)}{\sqrt{P(i)P(j)}}\).

Here, \(S(i,j)\) represents the similarity between patient \(i\) and patient \(j\) as a function of each patient’s path, \(P\), through the decision tree, where \(P(i)\) correspond to the path length of patient \(i\), \(P(j)\) the path length of \(j\) and \(P(i,j)\) the length of the shared path of \(i\) and \(j\).

Path length is measured using the information gain (change in entropy) at each split. Consider patient \(z\), who takes a path through the decision tree which is comprised of a set of \(N\) nodes, each passed sequentially and with index \(n \in [1,N]\), where \(n=1\) represents the root node and \(n=N\) the leaf node. If each node with index \(n\) has \(X_n\) patients passing through it, the length of the path of patient \(z\), \(P(z)\), is given by:

\(P(z)= \sum_{n=1}^{N-1} X_{n+1} (H_n - H_{n+1} \frac{X_{n+1}}{X_n})\),

where \(H_n\) is the entropy at node \(n\) and the information gain at each split has been weighted by \(X_{n+1}\), the total number of patients that move from node \(n\) to node \(n+1\).

From the second equation it is clear that if patients \(i\) and \(j\) take exactly the same path through the decision tree their path lengths are equal: \(P(i) = P(j) = P(i,j)\). Substituting this into the first equation it is clear that if two patients take identical paths the similarity between them is one. Conversely, if patients \(i\) and \(j\) diverge at the root node (\(n=1\)), their shared path contains only this node: \(P(i,j) = 0\) and hence the similarity between them is equal to zero.

As a random forest is composed of many decision trees, for each pair of patients we take the distance between them as the average over all trees in the forest. Using this measure of similarity, for each patient in each hospital we found the most similar patients that are treated differently.

### Functions#

#### Node weights#

```
def find_information_gain(X_train, y_train, estimator):
"""
Function to find the information gain for each branch in a tree
Input
X_train: (numpy array) features used to train the DecisionTree classifier
y_train: (numpy array) target used to train the DecisionTree classifier
estimator: (sklearn model) trained DecisionTree classifier
Output
info_gain: (numpy array) information gain associated with move to child
node (node i) from parent node.
"""
# claculate the entropy of each node
y_true = np.where(y_train==1)
y_false = np.where(y_train==0)
X_true = X_train[y_true]
X_false = X_train[y_false]
node_indicator_false = estimator.decision_path(X_false).toarray()
node_indicator_true = estimator.decision_path(X_true).toarray()
total_passing_true = node_indicator_true.sum(axis=0)
total_passing_false = node_indicator_false.sum(axis=0)
n_nodes = len(total_passing_true)
entropies = np.zeros(n_nodes)
left_child = estimator.tree_.children_left
right_child = estimator.tree_.children_right
for i in range(n_nodes):
true = total_passing_true[i]
false = total_passing_false[i]
if true==0 and false==0:
print('Node Not Passed')
continue
if true==0 or false==0:
entropies[i] = 0
else:
entropies[i] = -(true/(true+false)) * np.log2(true/(true+false)) \
- (false/(true+false))*np.log2(false/(true+false))
# use the entropy to calculate the information gain for each node
info_gain = np.zeros(n_nodes)
total_passing = estimator.decision_path(X_train).toarray().sum(axis=0)
for i in range(1,n_nodes):
loc_left = np.where(left_child==i)[0]
loc_right = np.where(right_child==i)[0]
if len(loc_left==1):
parent = loc_left[0]
elif len(loc_right==1):
parent = loc_right[0]
else:
raise Exception("Cant find parent")
info_gain[i] = entropies[parent] - \
total_passing[i]*entropies[i]/total_passing[parent]
return info_gain
```

```
def find_node_weight(X_train, y_train, estimator):
"""
Function to find the weight of a node. Weight is defined as the product
of the number of data points passing the node (flow) and the information
gain at the node.
Input
X_train: (numpy array) features used to train the DecisionTree classifier
y_train: (numpy array) target used to train the DecisionTree classifier
estimator: (sklearn model) trained DecisionTree classifier
Output
weight: (numpy array) weight (flows * info_gain) of each node
"""
node_indicator = estimator.decision_path(X_train).toarray()
flows = node_indicator.sum(axis=0)
info_gain = find_information_gain(X_train, y_train, estimator)
if len(flows) != len(info_gain):
raise Exception("Length mismatch")
return flows*info_gain
```

```
def find_forest_weight(X_train, y_train, forest):
trees = forest.estimators_
results = []
for i,estimator in enumerate(trees):
weight = find_node_weight(X_train, y_train, estimator)
results.append(weight)
return np.array(results)
```

#### Caclulate Similarity#

```
def find_similar_patients_tree(patient, test_set, estimator, weights):
"""
Function to calculate the similarity between a single patient and all other
patients in a data set using a single DecisionTree classifier
Input
patient: index of comparison patient in test_set
test_set: sub-set of n data points unseen by estimator. Must include
patient
estimator: trained sklearn DecisionTreeClassifier
weights: array of length n-nodes.
Output
similarity: (n-dim array) index of patients and similarity
"""
similarity = np.zeros(test_set.shape[0])
paths = estimator.decision_path(test_set)
patient_nodes = paths.toarray()[patient]
for n in range(test_set.shape[0]):
sample_ids = [patient, n]
patient_paths = paths.toarray()[sample_ids]
common_nodes = (patient_paths.sum(axis=0) ==
len(sample_ids))
common_node_weights = weights[common_nodes]
path_length = sum(common_node_weights)
p0_path_length = sum(patient_paths[0]*weights)
p1_path_length = sum(patient_paths[1]*weights)
similar = path_length / np.sqrt(p0_path_length*p1_path_length)
similarity[n] = similar
return similarity
```

```
def find_similar_patients(patient, test_set, forest, weights):
"""
Function to calculate the similarity between a single patient and all other
patients in a data set using a RandomForest classifier
Input
patient: index of comparison patient in test_set
test_set: sub-set of n data points unseen by estimator. Must include
patient
forest: trained sklearn RandomForestClassifier
weights: array of length n-nodes.
Output
results: (n-dim array) similarity between patient and test_set for each tree
in the RandomForest classifier
"""
trees = forest.estimators_
results = np.zeros((len(trees), test_set.shape[0]))
for i,estimator in enumerate(trees):
similarity = find_similar_patients_tree(patient, test_set, estimator, weights[i])
results[i] = similarity
return results
```

### Example hospital#

```
forest, threshold, train_patients, train_outcomes = hospital2model[hospital]
```

```
weights = find_forest_weight(train_patients, train_outcomes, forest)
```

```
weights.shape
```

```
(100,)
```

### Pairwise similarity#

```
test_patients = test_patients_df.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).values
test_outcomes = test_patients_df['S2Thrombolysis'].values
```

```
similarity = find_similar_patients(0, test_patients, forest, weights)
```

```
similarity.shape
```

```
(100, 42)
```

`similarity`

contains one row for each decision tree in the random forest classifier. Each row contains the pairwise similarity between `patient`

and all other patients.

Note that the similarity between `patient`

and itself (in this case, element `0`

of each row) is always equal to 1.

```
# average similarity across all trees
results = similarity.sum(axis=0)/100
```

```
results
```

```
array([1. , 0.41878349, 0.41385524, 0.30315102, 0.28188906,
0.31645625, 0.44306756, 0.17333223, 0.35602158, 0.46386653,
0.2490115 , 0.33010815, 0.50437351, 0.12329451, 0.17538473,
0.19784455, 0.35274581, 0.29956862, 0.29864365, 0.2882998 ,
0.18215119, 0.39094488, 0.35923549, 0.39327842, 0.41762123,
0.48577199, 0.11948557, 0.21997094, 0.40880923, 0.50791922,
0.38785235, 0.30787927, 0.53550742, 0.42006776, 0.23841425,
0.3628097 , 0.20345056, 0.33015945, 0.16756539, 0.24487966,
0.16356513, 0.29071159])
```

## Most similar patients with a different outcome#

### Identify patients with a predicted outcome different to their true outcome#

```
predicted_outcome = forest.predict(test_patients)
patient_index = np.where(predicted_outcome!=test_outcomes)[0]
```

```
patient_index
```

```
array([ 8, 17, 20, 29, 39])
```

Show actual outcomes for patients:

```
actual = [test_outcomes[i] for i in patient_index]
print ('Actual:', actual)
```

```
Actual: [0, 1, 1, 1, 1]
```

### For each patient, find 5 most similar patients with a different outcome in training set#

```
%%time
most_similar = []
for i in patient_index:
patient = test_patients[i]
outcome = test_outcomes[i]
opposite = np.where(train_outcomes != outcome)[0]
X = np.vstack((patient, train_patients[opposite]))
y = np.concatenate(([outcome], train_outcomes[opposite]))
results = find_similar_patients(0, X, forest, weights)
summed_results = results.sum(axis=0)[1:] #remove similarity between patient and itself
five_indices = np.argpartition(summed_results, -5)[-5:]
similar_patients = train_patients[opposite][five_indices]
S = summed_results[five_indices]
most_similar.append([S, summed_results, outcome, i, patient, similar_patients])
```

```
CPU times: user 7.32 s, sys: 0 ns, total: 7.32 s
Wall time: 7.32 s
```

### Get feature importances for hospital#

```
features = cohort.drop(['StrokeTeam', 'S2Thrombolysis'], axis=1).columns.values
importances = forest.feature_importances_
```

## Example where thrombolysis was not given when expected#

```
patient_of_interest = 0
```

### Create dataframe for patient#

```
S, _, _, _, patient, similar_patients = most_similar[patient_of_interest]
```

```
S_index = np.argsort(S)
S_index
```

```
array([0, 1, 3, 4, 2])
```

```
patient_df = pd.DataFrame(index = features)
patient_df['Importance'] = importances
patient_df['Patient'] = patient
for i,s in enumerate(S_index):
patient_df[str(i+1)] = similar_patients[s]
```

### Sort dataframe according to feature importance#

```
patient_df = patient_df.sort_values(by='Importance', ascending=False)
patient_df.drop(['Importance'], axis=1, inplace=True)
patient_df.to_csv('./output/similar_patients_treated_differently_1.csv')
patient_df
```

Patient | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|

S2BrainImagingTime_min | 23.0 | 19.0 | 11.0 | 33.0 | 30.0 | 12.0 |

S2NihssArrival | 4.0 | 5.0 | 6.0 | 9.0 | 6.0 | 8.0 |

S1OnsetToArrival_min | 83.0 | 149.0 | 134.0 | 121.0 | 79.0 | 82.0 |

S2StrokeType_Primary Intracerebral Haemorrhage | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S2RankinBeforeStroke | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |

... | ... | ... | ... | ... | ... | ... |

AFAnticoagulentHeparin_Yes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetInHospital_Yes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetInHospital_No | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |

S1Ethnicity_Mixed | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetTimeType_Not known | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

99 rows × 6 columns

### Rescale#

As some features, such as stroke to arrival time and age, have a much larger range than other features, we rescale them so that all features can be visualised on one axis

```
def rescale(row, df, time=False, age=False):
vals = df.iloc[row].values
if len(set(vals))==1:
rescaled=vals
if time==True:
rescaled=vals/60
elif age==True:
rescaled=vals/10
else:
#rescaled = [(v - vals[0]) for v in vals]
rescaled = vals
return rescaled
```

```
rescaled_results = patient_df.copy()
indx_list = rescaled_results.index.tolist()
for i in range(len(rescaled_results)):
indx = rescaled_results.index[i]
if indx in ['S1OnsetToArrival_min']:#, 'S2BrainImagingTime_min']:
rescaled = rescale(i,rescaled_results, time=True)
indx_list[i] = indx.split('_')[0] + '_hours'
elif indx in ['S1AgeOnArrival']:
rescaled = rescale(i,rescaled_results, age=True)
indx_list[i] = indx + '_decades'
else:
rescaled = rescale(i,rescaled_results)
rescaled_results.loc[indx]=rescaled
rescaled_results.index = indx_list
```

### Plot#

```
rescaled_results.plot(kind='bar', stacked=False, alpha=1, figsize=(8,8),
xlabel='Feature', ylabel='Feature value')
plt.xlim(None,10.5)
plt.tight_layout()
plt.savefig('./figures/within_hospital_variability/5_similar_bar_patient_not_treated.jpg', dpi=300)
plt.show()
```

```
fig,ax = plt.subplots(figsize=(8,8))
xvals = rescaled_results.index.values
pvals = rescaled_results['Patient']
mvals = [np.mean(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
lvals = [min(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
hvals = [max(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
plt.plot(xvals, mvals, 'r*', ms=10, label='Mean')
plt.plot(xvals, lvals, 'b*', ms=10, label='Low')
plt.plot(xvals, hvals, 'y*', ms=10, label='High')
plt.plot(xvals, pvals, 'k*', ms=10, label='Patient')
plt.fill_between(xvals, lvals, hvals, color='g', alpha=0.5)
plt.xlim(-0.5,10.5)
plt.xticks(rotation=90)
plt.legend(loc='best')
ax.set_xlabel('Feature')
ax.set_ylabel('Feature value')
plt.tight_layout()
plt.savefig('./figures/within_hospital_variability/5_similar_range_patient_not_treated.jpg', dpi=300)
plt.show()
```

## Example where thrombolysis was given when not expected#

```
patient_of_interest = 1
S, _, _, _, patient, similar_patients = most_similar[patient_of_interest]
S_index = np.argsort(S)
S_index
```

```
array([0, 1, 2, 3, 4])
```

```
patient_df = pd.DataFrame(index = features)
patient_df['Importance'] = importances
patient_df['Patient'] = patient
for i,s in enumerate(S_index):
patient_df[str(i+1)] = similar_patients[s]
patient_df = patient_df.sort_values(by='Importance', ascending=False)
patient_df.drop(['Importance'], axis=1, inplace=True)
patient_df.to_csv('./output/similar_patients_treated_differently_2.csv')
patient_df
```

Patient | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|

S2BrainImagingTime_min | 40.0 | 88.0 | 40.0 | 27.0 | 51.0 | 35.0 |

S2NihssArrival | 9.0 | 11.0 | 10.0 | 6.0 | 6.0 | 6.0 |

S1OnsetToArrival_min | 105.0 | 67.0 | 122.0 | 83.0 | 75.0 | 160.0 |

S2StrokeType_Primary Intracerebral Haemorrhage | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S2RankinBeforeStroke | 0.0 | 0.0 | 4.0 | 0.0 | 3.0 | 0.0 |

... | ... | ... | ... | ... | ... | ... |

AFAnticoagulentHeparin_Yes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetInHospital_Yes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetInHospital_No | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |

S1Ethnicity_Mixed | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

S1OnsetTimeType_Not known | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

99 rows × 6 columns

Rescale:

```
rescaled_results = patient_df.copy()
indx_list = rescaled_results.index.tolist()
for i in range(len(rescaled_results)):
indx = rescaled_results.index[i]
if indx in ['S1OnsetToArrival_min']:#, 'S2BrainImagingTime_min']:
rescaled = rescale(i,rescaled_results, time=True)
indx_list[i] = indx.split('_')[0] + '_hours'
elif indx in ['S1AgeOnArrival']:
rescaled = rescale(i,rescaled_results, age=True)
indx_list[i] = indx + '_decades'
else:
rescaled = rescale(i,rescaled_results)
rescaled_results.loc[indx]=rescaled
rescaled_results.index = indx_list
```

Plot

```
rescaled_results.plot(kind='bar', stacked=False, alpha=1, figsize=(8,8),
xlabel='Feature', ylabel='Feature value')
plt.xlim(None,10.5)
plt.tight_layout()
plt.savefig('./figures/within_hospital_variability/5_similar_bar_patient_treated.jpg', dpi=300)
plt.show()
```

```
fig,ax = plt.subplots(figsize=(8,8))
xvals = rescaled_results.index.values
pvals = rescaled_results['Patient']
mvals = [np.mean(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
lvals = [min(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
hvals = [max(rescaled_results.loc[feature].values[1:]) for feature in rescaled_results.index]
plt.plot(xvals, mvals, 'r*', ms=10, label='Mean')
plt.plot(xvals, lvals, 'b*', ms=10, label='Low')
plt.plot(xvals, hvals, 'y*', ms=10, label='High')
plt.plot(xvals, pvals, 'k*', ms=10, label='Patient')
plt.fill_between(xvals, lvals, hvals, color='g', alpha=0.5)
plt.xlim(-0.5,10.5)
plt.xticks(rotation=90)
plt.legend(loc='best')
ax.set_xlabel('Feature')
ax.set_ylabel('Feature value')
plt.tight_layout()
plt.savefig('./figures/within_hospital_variability/5_similar_range_patient_treated.jpg', dpi=300)
plt.show()
```

## Observations#

Misclassifications indicate that there is variability in clinical decision making within a hospital

We can use the structure of the Random Forest classifier to define a similarity metric

For patients that are incorrectly classified, we can find similar patients that were treated differently, further demonstrating variation in decision making within a hospital