Check correlation between features selected for the model#

Plain English summary#

We have made our model more simple by selecting the 8 features that give the best model performance. We are using Shapley values to help to explain how the model has arrived at each prediction. Shapley values require the features to be independent of each other. Any dependencies makes it extremely trick, and messy, to unpick the allocation to each feature. We test the independencies by calculating the correlation between the 7 selected features (after removing stroke team). There are only very weak correlations between the selected 8 features, with a feature explaining a maximum of 5% of the variablility of another feature.

Model and data#

Using the full dataset, calculate the correlation between each of these 7 features:

Arrival-to-scan time: Time from arrival at hospital to scan (mins)
Infarction: Stroke type (1 = infarction, 0 = haemorrhage)
Stroke severity: Stroke severity (NIHSS) on arrival
Precise onset time: Onset time type (1 = precise, 0 = best estimate)
Prior disability level: Disability level (modified Rankin Scale) before stroke
Use of AF anticoagulents: Use of atrial fibrillation anticoagulant (1 = Yes, 0 = No)
Onset-to-arrival time: Time from onset of stroke to arrival at hospital (mins)

The 7 features included were chosen sequentially as having the single best improvement in the XGBoost model performance (using the ROC AUC), excluding the stroke team feature.

Aim#

Check correlation (by calculating the correlation) between the 7 features selected by feature selection (after removing stroke team).

Observations#

There are only very weak correlations between the selected features with no R-squared being greater than 0.05.

import libraries#

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
from sklearn.preprocessing import StandardScaler
import json

Read in JSON file#

Contains a dictionary for plain English feature names for the 8 features selected in the model. Use these as the column titles in the DataFrame.

with open("./output/feature_name_dict.json") as json_file:
    feature_name_dict = json.load(json_file)

Load data#

Combine all of the data (create a dataframe that includes every instance)

train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)

Load features to use (drop stroke team if present)#

# Read in the names of the selected features for the model
number_of_features_to_use = 8
key_features = pd.read_csv('./output/feature_selection.csv')
key_features = list(key_features['feature'])[:number_of_features_to_use]

# Drop stroke team if present
if 'StrokeTeam' in key_features:
    key_features.remove('StrokeTeam')

# Restrict data to chosen features
data = data[key_features]

Rename columns to plain English titles#

data.rename(columns=feature_name_dict, inplace=True)

Standardise data#

After scaling data, the reported covariance will be the correlation between data features.

sc = StandardScaler() 
sc.fit(data)
data_std = sc.transform(data)
data_std = pd.DataFrame(data_std, columns=list(data))

Calculate correlation between features#

# Get covariance
cov = data_std.cov()

# Convert from wide to tall
cov = cov.melt(ignore_index=False)

# Remove self-correlation
mask = cov.index != cov['variable']
cov = cov[mask]

# Add absolute value
cov['abs_value'] = np.abs(cov['value'])

# Add R-squared
cov['r-squared'] = cov['value'] ** 2

# Sort by absolute covariance
cov.sort_values('abs_value', inplace=True, ascending=False)

# Round to four decimal places
cov = cov.round(4)

# Remove duplicate pairs of features
result = []
for index, values in cov.iterrows():
    combination = [index, values['variable']]
    combination.sort()
    string = combination[0] + "-" + combination[1]
    result.append(string)
cov['pair'] = result
cov.sort_values('pair', inplace=True)
cov.drop_duplicates(subset=['pair'], inplace=True)
cov.drop('pair', axis=1, inplace=True)

# Sort by r-squared
cov.sort_values('r-squared', ascending=False, inplace=True)

Display R-squared (sorted by R-squared)

cov[['variable', 'r-squared']]

	variable	r-squared
Stroke severity	Prior disability level	0.0454
Stroke severity	Infarction	0.0386
Onset-to-arrival time	Precise onset time	0.0344
Stroke severity	Onset-to-arrival time	0.0186
Precise onset time	Prior disability level	0.0131
Prior disability level	Use of AF anticoagulents	0.0070
Use of AF anticoagulents	Infarction	0.0033
Prior disability level	Onset-to-arrival time	0.0022
Arrival-to-scan time	Precise onset time	0.0021
Stroke severity	Use of AF anticoagulents	0.0019
Stroke severity	Arrival-to-scan time	0.0019
Precise onset time	Use of AF anticoagulents	0.0016
Infarction	Onset-to-arrival time	0.0007
Precise onset time	Infarction	0.0006
Arrival-to-scan time	Onset-to-arrival time	0.0004
Onset-to-arrival time	Use of AF anticoagulents	0.0001
Prior disability level	Arrival-to-scan time	0.0001
Infarction	Prior disability level	0.0000
Use of AF anticoagulents	Arrival-to-scan time	0.0000
Stroke severity	Precise onset time	0.0000
Infarction	Arrival-to-scan time	0.0000

Observations#

There are only very weak correlations between the selected features with no R-squared being greater than 0.05.

SAMueL Stroke Audit Machine Learning 2

Check correlation between features selected for the model

Contents

Check correlation between features selected for the model#

Plain English summary#

Model and data#

Aim#

Observations#

import libraries#

Read in JSON file#

Load data#

Load features to use (drop stroke team if present)#

Rename columns to plain English titles#

Standardise data#

Calculate correlation between features#

Observations#