Check correlation between features selected for the model#
Plain English summary#
We have made our model more simple by selecting the 10 features that give the best model performance.
We found that a model with eight features is able to provide 99% of the accuracy obtained when all 84 features are used, however we extended our feature inclusion to the top 10 features based on observing that clinicians, when they were discussing whether a particular patient was suitable to recieve thrombolysis, they would often discuss the patients age. Patient age was the 10th feature to be selected by this process. We decided to extend the feature selected list to include the 9th and 10th selected features: onset during sleep and patient age. This model provided >99% of the accuracy obtained when all 84 features are used.
We are using Shapley values to help to explain how the model has arrived at each prediction. Shapley values require the features to be independent of each other. Any dependencies makes it extremely tricky, and messy, to unpick the allocation to each feature. Here we test the independencies by calculating the correlation between the 9 selected features (after removing stroke team).
These ten features are largely independent of each other. There are only very weak correlations between the selected features with no feature explaining more than 15% of the variability of another feature, and all but two feature pairings (age with prior disability, and onset during sleep with precise onset time) explaining less than 5% of the variablility of another feature.
Data and analysis#
Using the full dataset, calculate the correlation between each of these 9 features:
Arrival-to-scan time: Time from arrival at hospital to scan (mins)
Infarction: Stroke type (1 = infarction, 0 = haemorrhage)
Stroke severity: Stroke severity (NIHSS) on arrival
Precise onset time: Onset time type (1 = precise, 0 = best estimate)
Prior disability level: Disability level (modified Rankin Scale) before stroke
Use of AF anticoagulents: Use of atrial fibrillation anticoagulant (1 = Yes, 0 = No)
Onset-to-arrival time: Time from onset of stroke to arrival at hospital (mins)
Onset during sleep: Did stroke occur in sleep?
Age: Age (as middle of 5 year age bands)
The 9 features included were chosen sequentially as having the single best improvement in the XGBoost model performance (using the ROC AUC), excluding the stroke team feature.
Aim#
Check correlation (by calculating the correlation) between the 9 features selected by feature selection (after removing stroke team).
Observations#
There are only very weak correlations between the selected features with no R-squared being greater than 0.15, and all but two being lower than 0.05.
Load packages#
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
from sklearn.preprocessing import StandardScaler
import json
Set filenames#
# Set up strings (describing the model) to use in filenames
number_of_features_to_use = 10
model_text = f'{number_of_features_to_use}_features'
notebook = '07'
Read in JSON file#
Contains a dictionary for plain English feature names for the 10 features selected in the model. Use these as the column titles in the DataFrame.
with open("./output/01_feature_name_dict.json") as json_file:
feature_name_dict = json.load(json_file)
Load data#
Combine all of the data (create a dataframe that includes every instance)
train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)
Load features to use (drop stroke team if present)#
# Read in the names of the selected features for the model
key_features = pd.read_csv('./output/01_feature_selection.csv')
key_features = list(key_features['feature'])[:number_of_features_to_use]
# Drop stroke team if present
if 'StrokeTeam' in key_features:
key_features.remove('StrokeTeam')
# Restrict data to chosen features
data = data[key_features]
# Rename columns to plain English titles
data.rename(columns=feature_name_dict, inplace=True)
Standardise data#
After scaling data, the reported covariance will be the correlation between data features.
sc = StandardScaler()
sc.fit(data)
data_std = sc.transform(data)
data_std = pd.DataFrame(data_std, columns=list(data))
Calculate correlation between features#
# Get covariance
cov = data_std.cov()
# Convert from wide to tall
cov = cov.melt(ignore_index=False)
# Remove self-correlation
mask = cov.index != cov['variable']
cov = cov[mask]
# Add absolute value
cov['abs_value'] = np.abs(cov['value'])
# Add R-squared
cov['r-squared'] = cov['value'] ** 2
# Sort by absolute covariance
cov.sort_values('abs_value', inplace=True, ascending=False)
# Round to four decimal places
cov = cov.round(4)
# Remove duplicate pairs of features
result = []
for index, values in cov.iterrows():
combination = [index, values['variable']]
combination.sort()
string = combination[0] + "-" + combination[1]
result.append(string)
cov['pair'] = result
cov.sort_values('pair', inplace=True)
cov.drop_duplicates(subset=['pair'], inplace=True)
cov.drop('pair', axis=1, inplace=True)
# Sort by r-squared
cov.sort_values('r-squared', ascending=False, inplace=True)
Display R-squared (sorted by R-squared)
cov[['variable', 'r-squared']]
variable | r-squared | |
---|---|---|
Age | Prior disability level | 0.1462 |
Onset during sleep | Precise onset time | 0.0784 |
Stroke severity | Prior disability level | 0.0454 |
Stroke severity | Infarction | 0.0386 |
Precise onset time | Onset-to-arrival time | 0.0344 |
Stroke severity | Age | 0.0268 |
Age | Use of AF anticoagulants | 0.0207 |
Stroke severity | Onset-to-arrival time | 0.0186 |
Precise onset time | Prior disability level | 0.0131 |
Age | Precise onset time | 0.0090 |
Prior disability level | Use of AF anticoagulants | 0.0070 |
Onset during sleep | Onset-to-arrival time | 0.0043 |
Onset-to-arrival time | Age | 0.0038 |
Use of AF anticoagulants | Infarction | 0.0033 |
Prior disability level | Onset-to-arrival time | 0.0022 |
Precise onset time | Arrival-to-scan time | 0.0021 |
Use of AF anticoagulants | Stroke severity | 0.0019 |
Arrival-to-scan time | Stroke severity | 0.0019 |
Precise onset time | Use of AF anticoagulants | 0.0016 |
Stroke severity | Onset during sleep | 0.0011 |
Infarction | Onset-to-arrival time | 0.0007 |
Infarction | Onset during sleep | 0.0007 |
Infarction | Precise onset time | 0.0006 |
Onset-to-arrival time | Arrival-to-scan time | 0.0004 |
Arrival-to-scan time | Prior disability level | 0.0001 |
Onset-to-arrival time | Use of AF anticoagulants | 0.0001 |
Stroke severity | Precise onset time | 0.0000 |
Arrival-to-scan time | Age | 0.0000 |
Use of AF anticoagulants | Onset during sleep | 0.0000 |
Prior disability level | Onset during sleep | 0.0000 |
Infarction | Age | 0.0000 |
Use of AF anticoagulants | Arrival-to-scan time | 0.0000 |
Onset during sleep | Arrival-to-scan time | 0.0000 |
Arrival-to-scan time | Infarction | 0.0000 |
Age | Onset during sleep | 0.0000 |
Prior disability level | Infarction | 0.0000 |
Observations#
There are only very weak correlations between the selected features with no R-squared being greater than 0.15, and all but two being lower than 0.05.