Patient level decisions at own hospital, and consensus decisions at other other hospitals
Contents
Patient level decisions at own hospital, and consensus decisions at other other hospitals#
This notebook creates two outputs:
Table of misclassified patients: patients with a predicted probability of being given thrombolysis (using a model of its own hospital) of <0.1 or >0.9, but where the hospital decided on the opposite.
Vignette data: data for each patient including:
Patient feature data
Decision at own hospital
Predicted decision and probability at own hospital
Predicted consensus decision at the 30 benchmark hospitals
Consensus of whether 90% of other hospitals would thrombolyse
Consensus of whether 80% of other hospitals would thrombolyse
Average probability of receiving thrombolysis at other hospitals
Import libraries#
import os
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
# Load arrivals within 4 hours (load train and test cohort sets and combine)
train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)
Fit models and find confident misclassifications#
Models are first fitted using k-fold cross-validation. ‘Confident misclassifications’ are patients with a predicted probability of being given thrombolysis (using a model of its own hospital) of <0.1 or >0.9, but where the hospital decided on the opposite. These are taken from test sets only.
The k-fold splits are also used to predict and record the probability of being given thrombolysis using the test set patients only (which is all patients across the 5 splits). Whether a patient is given thrombolysis or not is predicted from the threshold that matches thrombolysis rate in the test set to the observed rate for that hospital.
After the k-fold validation prediction, a model is fitted to all data for a hospital. This is used for counter-factual predictions later (hat treatment patients from other hospitals would be predicted to receive at that hospital).
hospitals = list(set(data['StrokeTeam'].values))
def find_threshold(probabilities, true_rate):
"""
Find classification threshold to calibrate model
"""
index = (1-true_rate)*len(probabilities)
threshold = sorted(probabilities)[int(index)]
return threshold
%%time
hospital2model={}
misclassifications = pd.DataFrame(columns = data.columns.values)
for hospital in hospitals:
y_probs, test_idx = [],[]
patients = data.loc[data['StrokeTeam'] == hospital]
y = patients['S2Thrombolysis']
X = patients.drop(['StrokeTeam','S2Thrombolysis'], axis=1)
skf = StratifiedKFold(n_splits = 5)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1,
class_weight='balanced', random_state=0)
forest.fit(X_train, y_train)
y_prob = forest.predict_proba(X_test)[:,1]
y_probs.extend(y_prob)
test_idx.extend(test_index)
for i,p in enumerate(y_prob):
if p>=0.9 and y_test.values[i]==0:
misclassifications = misclassifications.append(
patients.iloc[test_index[i]], ignore_index=True)
if p<=0.1 and y_test.values[i]==1:
misclassifications = misclassifications.append(
patients.iloc[test_index[i]], ignore_index=True)
true_rate = sum(y)/len(y)
threshold= find_threshold(y_probs, true_rate)
y_preds = [1 if p >= threshold else 0 for p in y_probs]
forest = RandomForestClassifier(n_estimators=100, n_jobs=-1,
class_weight='balanced', random_state=0)
forest.fit(X.values, y.values)
hospital2model[hospital]= [forest, threshold, y_probs, y_preds, test_idx]
CPU times: user 8min 48s, sys: 2min 55s, total: 11min 43s
Wall time: 5min 34s
misclassifications.head()
StrokeTeam | S1AgeOnArrival | S1OnsetToArrival_min | S2RankinBeforeStroke | Loc | LocQuestions | LocCommands | BestGaze | Visual | FacialPalsy | ... | S2NewAFDiagnosis_Yes | S2NewAFDiagnosis_missing | S2StrokeType_Infarction | S2StrokeType_Primary Intracerebral Haemorrhage | S2StrokeType_missing | S2TIAInLastMonth_No | S2TIAInLastMonth_No but | S2TIAInLastMonth_Yes | S2TIAInLastMonth_missing | S2Thrombolysis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HREGJ0143U | 97.5 | 3.0 | 3 | 0 | 2.0 | 0.0 | 0.0 | 3.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
1 | HREGJ0143U | 87.5 | 109.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
2 | HONZP0443O | 87.5 | 165.0 | 3 | 0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
3 | HONZP0443O | 52.5 | 124.0 | 0 | 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
4 | KECZG3964M | 87.5 | 131.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 rows × 101 columns
misclassifications.to_csv('./output/misclassifications.csv')
Combine consensus, benchmark and predicted outcomes for all patients#
Each patient has the predicted probability of receiving thrombolysis at all hospitals other than the one they actually attended.
Benchmark: The consensus decision of the 30 benchmark hospitals (thrombolysis predicted to be given if at least 15 of the 30 benchmark hospitals have predicted thrombolysis use for that hospital).
Majority_90: True if at least 90% of other hospitals are predicted to give thrombolysis.
Majority_80: True if at least 80% of other hospitals are predicted to give thrombolysis.
Predicted: Whether the patient’s own hospital is predicted to give thrombolysis (as recorded in the k-fold predictions above). Depends on both probability of receiving thrombolysis and the threshold of prediction in that k-fold split that would give the same net thrombolysis use as the hospital overall.
Probability: Whether the patient’s own hospital is predicted to give thrombolysis (as recorded in the k-fold predictions above).
AvgProbability: The average probability of other hospitals giving thrombolysis to a patient.
benchmark_hospitals = np.load(
'./models/benchmark_hospitals.npy', allow_pickle=True)
%%time
#create df for results
results = pd.DataFrame(columns = np.concatenate((data.columns.values,\
['Benchmark','Majority_90','Majority_80','Predicted'])))
#loop through patients from each hospital
for hospital in hospitals:
# df for each hospital's results
hospital_results = pd.DataFrame(columns = hospitals)
hospital_results_prob = pd.DataFrame(columns = hospitals)
patients = data.loc[data['StrokeTeam'] == hospital]
_, _, _, _, patient_order = hospital2model[hospital]
patients = patients.iloc[patient_order]
y = patients['S2Thrombolysis']
X = patients.drop(['StrokeTeam','S2Thrombolysis'], axis=1)
#put patients from hospital through all other hospital models
for hospital_train in hospitals:
forest, threshold, _, _, _ = hospital2model[hospital_train]
if hospital==hospital_train:
continue
y_prob = forest.predict_proba(X)[:,1]
y_pred = [1 if p >= threshold else 0 for p in y_prob]
#add results from hospital_train to hospital df
hospital_results[hospital_train] = y_pred
hospital_results_prob[hospital_train] = y_prob
patients_copy = patients.copy()
#add outcome from benchmark hospitals
benchmark_results = hospital_results[benchmark_hospitals].copy()
benchmark_results['sum'] = benchmark_results.sum(axis=1)
benchmark_results['percent'] = benchmark_results['sum']*100/len(benchmark_hospitals)
benchmark_outcome = [1 if p>=50 else 0 for p in benchmark_results['percent']]
patients_copy['Benchmark'] = benchmark_outcome
#majority outcomes
hospital_results['sum'] = hospital_results.sum(axis=1)
hospital_results['percent'] = hospital_results['sum']*100/len(hospitals)
#90% majority
majority_90 = [1 if p>=90 else 0 for p in hospital_results['percent']]
patients_copy['Majority_90'] = majority_90
#80% majority
majority_80 = [1 if p>=80 else 0 for p in hospital_results['percent']]
patients_copy['Majority_80'] = majority_80
#predicted outcomes
_, _, probability, predicted, _ = hospital2model[hospital]
patients_copy['Predicted'] = predicted
patients_copy['Probability'] = probability
#average probability
hospital_results_prob[hospital] = probability
patients_copy['AvgProbability'] = hospital_results_prob.mean(axis=1).round(2).values
#append to results df
results = results.append(patients_copy, ignore_index=True)
CPU times: user 24min 47s, sys: 6min 49s, total: 31min 37s
Wall time: 19min 50s
results
StrokeTeam | S1AgeOnArrival | S1OnsetToArrival_min | S2RankinBeforeStroke | Loc | LocQuestions | LocCommands | BestGaze | Visual | FacialPalsy | ... | S2TIAInLastMonth_No but | S2TIAInLastMonth_Yes | S2TIAInLastMonth_missing | S2Thrombolysis | Benchmark | Majority_90 | Majority_80 | Predicted | Probability | AvgProbability | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HREGJ0143U | 77.5 | 77.0 | 0 | 0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | ... | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0.78 | 0.78 |
1 | HREGJ0143U | 72.5 | 85.0 | 0 | 0 | 1.0 | 2.0 | 0.0 | 2.0 | 3.0 | ... | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0.91 | 0.72 |
2 | HREGJ0143U | 62.5 | 91.0 | 3 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.13 | 0.28 |
3 | HREGJ0143U | 62.5 | 85.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.18 | 0.38 |
4 | HREGJ0143U | 52.5 | 59.0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.10 | 0.16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
88923 | NZECY6641V | 72.5 | 73.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.11 | 0.14 |
88924 | NZECY6641V | 77.5 | 121.0 | 4 | 0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.07 | 0.09 |
88925 | NZECY6641V | 87.5 | 123.0 | 3 | 0 | 2.0 | 1.0 | 2.0 | 2.0 | 2.0 | ... | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0.35 | 0.49 |
88926 | NZECY6641V | 87.5 | 134.0 | 2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.03 |
88927 | NZECY6641V | 62.5 | 82.0 | 0 | 0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.32 | 0.22 |
88928 rows × 107 columns
results.to_csv('./output/vignette_data.csv')