Measuring the covariance/correlation between features
Contents
Measuring the covariance/correlation between features#
In this notebook we measure the correlation between features.
Import libraries and data#
Data has been restricted to stroke teams with at least 300 admissions, with at least 10 patients receiving thrombolysis, over three years.
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
from sklearn.preprocessing import StandardScaler
# Import data (combine all data)
train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)
data.drop('StrokeTeam', axis=1, inplace=True)
data
S1AgeOnArrival | S1OnsetToArrival_min | S2RankinBeforeStroke | Loc | LocQuestions | LocCommands | BestGaze | Visual | FacialPalsy | MotorArmLeft | ... | S2NewAFDiagnosis_Yes | S2NewAFDiagnosis_missing | S2StrokeType_Infarction | S2StrokeType_Primary Intracerebral Haemorrhage | S2StrokeType_missing | S2TIAInLastMonth_No | S2TIAInLastMonth_No but | S2TIAInLastMonth_Yes | S2TIAInLastMonth_missing | S2Thrombolysis | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72.5 | 49.0 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 4.0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 77.5 | 96.0 | 0 | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 77.5 | 77.0 | 0 | 0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
3 | 82.5 | 142.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
4 | 87.5 | 170.0 | 0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 4.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 57.5 | 99.0 | 0 | 1 | 2.0 | 2.0 | 1.0 | 2.0 | 2.0 | 0.0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
9996 | 87.5 | 159.0 | 3 | 0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
9997 | 67.5 | 142.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
9998 | 72.5 | 101.0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
9999 | 87.5 | 106.0 | 2 | 1 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
88928 rows × 100 columns
Scale data#
After scaling data, the reported covariance will be the correlation between data features.
sc=StandardScaler()
sc.fit(data)
data_std=sc.transform(data)
data_std = pd.DataFrame(data_std, columns =list(data))
Get covariance of scaled data (correlation)#
# Get covariance
cov = data_std.cov()
# Convert from wide to tall
cov = cov.melt(ignore_index=False)
# Remove self-correlation
mask = cov.index != cov['variable']
cov = cov[mask]
# Add absolute value
cov['abs_value'] = np.abs(cov['value'])
# Add R-squared
cov['r-squared'] = cov['value'] ** 2
# Sort by absolute covariance
cov.sort_values('abs_value', inplace=True, ascending=False)
# Round to four decimal places
cov = cov.round(4)
# Label rows where one of the feature pairs tags data as 'missing'
result = []
for index, values in cov.iterrows():
if index[-7:] == 'missing' or values['variable'][-7:] == 'missing':
result.append(True)
else:
result.append(False)
cov['missing'] = result
# Remove duplicate pairs of features
result = []
for index, values in cov.iterrows():
combination = [index, values['variable']]
combination.sort()
string = combination[0] + "-" + combination[1]
result.append(string)
cov['pair'] = result
cov.sort_values('pair', inplace=True)
cov.drop_duplicates(subset=['pair'], inplace=True)
# Sort by r-squared
cov.sort_values('r-squared', ascending=False, inplace=True)
cov
variable | value | abs_value | r-squared | missing | pair | |
---|---|---|---|---|---|---|
AFAnticoagulentHeparin_missing | AFAnticoagulentDOAC_missing | 1.0000 | 1.0000 | 1.0 | True | AFAnticoagulentDOAC_missing-AFAnticoagulentHep... |
Hypertension_Yes | Hypertension_No | -1.0000 | 1.0000 | 1.0 | False | Hypertension_No-Hypertension_Yes |
AFAnticoagulentHeparin_missing | AFAnticoagulentVitK_missing | 1.0000 | 1.0000 | 1.0 | True | AFAnticoagulentHeparin_missing-AFAnticoagulent... |
AFAnticoagulentDOAC_missing | AFAnticoagulentVitK_missing | 1.0000 | 1.0000 | 1.0 | True | AFAnticoagulentDOAC_missing-AFAnticoagulentVit... |
S1ArriveByAmbulance_Yes | S1ArriveByAmbulance_No | -1.0000 | 1.0000 | 1.0 | False | S1ArriveByAmbulance_No-S1ArriveByAmbulance_Yes |
... | ... | ... | ... | ... | ... | ... |
Hypertension_No | S1OnsetInHospital_Yes | 0.0000 | 0.0000 | 0.0 | False | Hypertension_No-S1OnsetInHospital_Yes |
S1OnsetTimeType_Not known | Hypertension_No | 0.0000 | 0.0000 | 0.0 | False | Hypertension_No-S1OnsetTimeType_Not known |
S2BrainImagingTime_min | Hypertension_No | 0.0066 | 0.0066 | 0.0 | False | Hypertension_No-S2BrainImagingTime_min |
Hypertension_No | S2NihssArrival | -0.0009 | 0.0009 | 0.0 | False | Hypertension_No-S2NihssArrival |
StrokeTIA_Yes | Visual | -0.0056 | 0.0056 | 0.0 | False | StrokeTIA_Yes-Visual |
4950 rows × 6 columns
# Save results
cov.to_csv('./output/feature_correlation.csv')
Show histogram and counts of correlations#
# Histogram of covariance/correlation
fig = plt.figure(figsize=(6,5))
ax = fig.add_subplot(111)
bins = np.arange(0, 1.01, 0.01)
ax.hist(cov['r-squared'], bins=bins, rwidth=1)
ax.set_xlabel('R-squared')
ax.set_ylabel('Frequency')
plt.savefig('output/covariance.jpg', dpi=300)
plt.show()
Show proportion of feature correlations (R-sqaured)in key bins
bins = [0, 0.10, 0.25, 0.5, 0.75, 0.99, 1.1]
counts = np.histogram(cov['r-squared'], bins=bins)[0]
counts = counts / counts.sum()
labels = ['<0.10', '0.1 to 0.25', '0.25 to 0.50', '0.50 to 0.75', '0.75 to 0.999', '1']
counts_df = pd.DataFrame(index=labels)
counts_df['Proportion'] = counts
counts_df['Cumulative Proportion'] = counts.cumsum()
counts_df.index.name = 'R-squared'
counts_df
Proportion | Cumulative Proportion | |
---|---|---|
R-squared | ||
<0.10 | 0.960808 | 0.960808 |
0.1 to 0.25 | 0.015556 | 0.976364 |
0.25 to 0.50 | 0.010303 | 0.986667 |
0.50 to 0.75 | 0.006667 | 0.993333 |
0.75 to 0.999 | 0.003232 | 0.996566 |
1 | 0.003434 | 1.000000 |
Observations#
Most of the features show weak correlation (96% of feature pairs have an R-squared of less than 0.1)
Perfectly correlated feature pairs ar epresent due to dichotomised coding of some features.
Many highly correlated features are due to correlaytions between missing data and the value if data is present. There are some ‘more interesting’ highly correlated data such as:
Right leg and arm weakness is are highly correlated, as are left leg and arm weakness.
Right leg weakness is highly correlated with problems in balance and language.