Measuring the covariance/correlation between features#

In this notebook we measure the correlation between features.

Import libraries and data#

Data has been restricted to stroke teams with at least 300 admissions, with at least 10 patients receiving thrombolysis, over three years.

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
from sklearn.preprocessing import StandardScaler

# Import data (combine all data)
train = pd.read_csv('../data/10k_training_test/cohort_10000_train.csv')
test = pd.read_csv('../data/10k_training_test/cohort_10000_test.csv')
data = pd.concat([train, test], axis=0)
data.drop('StrokeTeam', axis=1, inplace=True)
data
S1AgeOnArrival S1OnsetToArrival_min S2RankinBeforeStroke Loc LocQuestions LocCommands BestGaze Visual FacialPalsy MotorArmLeft ... S2NewAFDiagnosis_Yes S2NewAFDiagnosis_missing S2StrokeType_Infarction S2StrokeType_Primary Intracerebral Haemorrhage S2StrokeType_missing S2TIAInLastMonth_No S2TIAInLastMonth_No but S2TIAInLastMonth_Yes S2TIAInLastMonth_missing S2Thrombolysis
0 72.5 49.0 1 0 0.0 0.0 0.0 0.0 3.0 4.0 ... 0 1 0 1 0 0 0 0 1 0
1 77.5 96.0 0 0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 1 0 0 0 0 0 1 0
2 77.5 77.0 0 0 2.0 1.0 1.0 2.0 1.0 0.0 ... 0 1 1 0 0 0 0 0 1 1
3 82.5 142.0 0 0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0 1 1 0 0 0 0 0 1 1
4 87.5 170.0 0 0 0.0 0.0 1.0 1.0 2.0 4.0 ... 0 0 1 0 0 0 0 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 57.5 99.0 0 1 2.0 2.0 1.0 2.0 2.0 0.0 ... 0 1 0 1 0 0 0 0 1 0
9996 87.5 159.0 3 0 2.0 2.0 0.0 0.0 0.0 0.0 ... 0 1 1 0 0 0 0 0 1 1
9997 67.5 142.0 0 0 0.0 0.0 0.0 2.0 0.0 0.0 ... 0 0 1 0 0 0 0 0 1 0
9998 72.5 101.0 0 0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0 1 1 0 0 0 0 0 1 0
9999 87.5 106.0 2 1 1.0 1.0 0.0 0.0 1.0 0.0 ... 0 0 0 1 0 0 0 0 1 0

88928 rows × 100 columns

Scale data#

After scaling data, the reported covariance will be the correlation between data features.

sc=StandardScaler() 
sc.fit(data)
data_std=sc.transform(data)
data_std = pd.DataFrame(data_std, columns =list(data))

Get covariance of scaled data (correlation)#

# Get covariance
cov = data_std.cov()

# Convert from wide to tall
cov = cov.melt(ignore_index=False)

# Remove self-correlation
mask = cov.index != cov['variable']
cov = cov[mask]

# Add absolute value
cov['abs_value'] = np.abs(cov['value'])

# Add R-squared
cov['r-squared'] = cov['value'] ** 2

# Sort by absolute covariance
cov.sort_values('abs_value', inplace=True, ascending=False)

# Round to four decimal places
cov = cov.round(4)

# Label rows where one of the feature pairs tags data as 'missing'
result = []
for index, values in cov.iterrows():
    if index[-7:] == 'missing' or values['variable'][-7:] == 'missing':
        result.append(True)
    else:
        result.append(False)
cov['missing'] = result

# Remove duplicate pairs of features
result = []
for index, values in cov.iterrows():
    combination = [index, values['variable']]
    combination.sort()
    string = combination[0] + "-" + combination[1]
    result.append(string)
cov['pair'] = result
cov.sort_values('pair', inplace=True)
cov.drop_duplicates(subset=['pair'], inplace=True)

# Sort by r-squared
cov.sort_values('r-squared', ascending=False, inplace=True)
cov
variable value abs_value r-squared missing pair
AFAnticoagulentHeparin_missing AFAnticoagulentDOAC_missing 1.0000 1.0000 1.0 True AFAnticoagulentDOAC_missing-AFAnticoagulentHep...
Hypertension_Yes Hypertension_No -1.0000 1.0000 1.0 False Hypertension_No-Hypertension_Yes
AFAnticoagulentHeparin_missing AFAnticoagulentVitK_missing 1.0000 1.0000 1.0 True AFAnticoagulentHeparin_missing-AFAnticoagulent...
AFAnticoagulentDOAC_missing AFAnticoagulentVitK_missing 1.0000 1.0000 1.0 True AFAnticoagulentDOAC_missing-AFAnticoagulentVit...
S1ArriveByAmbulance_Yes S1ArriveByAmbulance_No -1.0000 1.0000 1.0 False S1ArriveByAmbulance_No-S1ArriveByAmbulance_Yes
... ... ... ... ... ... ...
Hypertension_No S1OnsetInHospital_Yes 0.0000 0.0000 0.0 False Hypertension_No-S1OnsetInHospital_Yes
S1OnsetTimeType_Not known Hypertension_No 0.0000 0.0000 0.0 False Hypertension_No-S1OnsetTimeType_Not known
S2BrainImagingTime_min Hypertension_No 0.0066 0.0066 0.0 False Hypertension_No-S2BrainImagingTime_min
Hypertension_No S2NihssArrival -0.0009 0.0009 0.0 False Hypertension_No-S2NihssArrival
StrokeTIA_Yes Visual -0.0056 0.0056 0.0 False StrokeTIA_Yes-Visual

4950 rows × 6 columns

# Save results
cov.to_csv('./output/feature_correlation.csv')

Show histogram and counts of correlations#

# Histogram of covariance/correlation
fig = plt.figure(figsize=(6,5))
ax = fig.add_subplot(111)
bins = np.arange(0, 1.01, 0.01)
ax.hist(cov['r-squared'], bins=bins, rwidth=1)
ax.set_xlabel('R-squared') 
ax.set_ylabel('Frequency')
plt.savefig('output/covariance.jpg', dpi=300)
plt.show()
../_images/07_covariance_11_0.png

Show proportion of feature correlations (R-sqaured)in key bins

bins = [0, 0.10, 0.25, 0.5, 0.75, 0.99, 1.1]
counts = np.histogram(cov['r-squared'], bins=bins)[0]
counts = counts / counts.sum()

labels = ['<0.10', '0.1 to 0.25', '0.25 to 0.50', '0.50 to 0.75', '0.75 to 0.999', '1']
counts_df = pd.DataFrame(index=labels)
counts_df['Proportion'] = counts
counts_df['Cumulative Proportion'] = counts.cumsum()
counts_df.index.name = 'R-squared'
counts_df
Proportion Cumulative Proportion
R-squared
<0.10 0.960808 0.960808
0.1 to 0.25 0.015556 0.976364
0.25 to 0.50 0.010303 0.986667
0.50 to 0.75 0.006667 0.993333
0.75 to 0.999 0.003232 0.996566
1 0.003434 1.000000

Show highly correlated features#

Perfectly correlated features#

# Get perfectly correlated data (covariance > 0.999)
mask = cov['r-squared'] > 0.999
cov[mask]
variable value abs_value r-squared missing pair
AFAnticoagulentHeparin_missing AFAnticoagulentDOAC_missing 1.0 1.0 1.0 True AFAnticoagulentDOAC_missing-AFAnticoagulentHep...
Hypertension_Yes Hypertension_No -1.0 1.0 1.0 False Hypertension_No-Hypertension_Yes
AFAnticoagulentHeparin_missing AFAnticoagulentVitK_missing 1.0 1.0 1.0 True AFAnticoagulentHeparin_missing-AFAnticoagulent...
AFAnticoagulentDOAC_missing AFAnticoagulentVitK_missing 1.0 1.0 1.0 True AFAnticoagulentDOAC_missing-AFAnticoagulentVit...
S1ArriveByAmbulance_Yes S1ArriveByAmbulance_No -1.0 1.0 1.0 False S1ArriveByAmbulance_No-S1ArriveByAmbulance_Yes
MoreEqual80y_No MoreEqual80y_Yes -1.0 1.0 1.0 False MoreEqual80y_No-MoreEqual80y_Yes
Diabetes_Yes Diabetes_No -1.0 1.0 1.0 False Diabetes_No-Diabetes_Yes
S1Gender_Male S1Gender_Female -1.0 1.0 1.0 False S1Gender_Female-S1Gender_Male
CongestiveHeartFailure_Yes CongestiveHeartFailure_No -1.0 1.0 1.0 False CongestiveHeartFailure_No-CongestiveHeartFailu...
AtrialFibrillation_Yes AtrialFibrillation_No -1.0 1.0 1.0 False AtrialFibrillation_No-AtrialFibrillation_Yes
StrokeTIA_No StrokeTIA_Yes -1.0 1.0 1.0 False StrokeTIA_No-StrokeTIA_Yes
AtrialFibrillation_Yes AFAntiplatelet_missing -1.0 1.0 1.0 True AFAntiplatelet_missing-AtrialFibrillation_Yes
AFAntiplatelet_missing AtrialFibrillation_No 1.0 1.0 1.0 True AFAntiplatelet_missing-AtrialFibrillation_No
S1OnsetTimeType_Precise S1OnsetTimeType_Best estimate -1.0 1.0 1.0 False S1OnsetTimeType_Best estimate-S1OnsetTimeType_...

Highly correlated features#

R-squared between 0.5 and 0.999

# Get highly correlated data (covariance between 0.50 and 0.999)
pd.set_option('display.max_rows', None)
mask = (cov['abs_value'] <= 0.999) & (cov['abs_value'] >= 0.50)
cov[mask]
variable value abs_value r-squared missing pair
AFAnticoagulentHeparin_No AFAnticoagulentHeparin_missing -0.9984 0.9984 0.9969 True AFAnticoagulentHeparin_No-AFAnticoagulentHepar...
AFAnticoagulentDOAC_missing AFAnticoagulentHeparin_No -0.9984 0.9984 0.9969 True AFAnticoagulentDOAC_missing-AFAnticoagulentHep...
AFAnticoagulentVitK_missing AFAnticoagulentHeparin_No -0.9984 0.9984 0.9969 True AFAnticoagulentHeparin_No-AFAnticoagulentVitK_...
S2StrokeType_Infarction S2StrokeType_Primary Intracerebral Haemorrhage -0.9940 0.9940 0.9881 False S2StrokeType_Infarction-S2StrokeType_Primary I...
AFAnticoagulentVitK_missing AFAnticoagulentVitK_No -0.9590 0.9590 0.9198 True AFAnticoagulentVitK_No-AFAnticoagulentVitK_mis...
AFAnticoagulentVitK_No AFAnticoagulentHeparin_missing -0.9590 0.9590 0.9198 True AFAnticoagulentHeparin_missing-AFAnticoagulent...
AFAnticoagulentVitK_No AFAnticoagulentDOAC_missing -0.9590 0.9590 0.9198 True AFAnticoagulentDOAC_missing-AFAnticoagulentVit...
AFAnticoagulentHeparin_No AFAnticoagulentVitK_No 0.9575 0.9575 0.9168 False AFAnticoagulentHeparin_No-AFAnticoagulentVitK_No
S2NewAFDiagnosis_No S2NewAFDiagnosis_missing -0.9423 0.9423 0.8880 True S2NewAFDiagnosis_No-S2NewAFDiagnosis_missing
AFAnticoagulentDOAC_No AFAnticoagulentDOAC_missing -0.9339 0.9339 0.8721 True AFAnticoagulentDOAC_No-AFAnticoagulentDOAC_mis...
AFAnticoagulentDOAC_No AFAnticoagulentVitK_missing -0.9339 0.9339 0.8721 True AFAnticoagulentDOAC_No-AFAnticoagulentVitK_mis...
AFAnticoagulentDOAC_No AFAnticoagulentHeparin_missing -0.9339 0.9339 0.8721 True AFAnticoagulentDOAC_No-AFAnticoagulentHeparin_...
AFAnticoagulentDOAC_No AFAnticoagulentHeparin_No 0.9322 0.9322 0.8690 False AFAnticoagulentDOAC_No-AFAnticoagulentHeparin_No
S2TIAInLastMonth_No S2TIAInLastMonth_missing -0.9032 0.9032 0.8158 True S2TIAInLastMonth_No-S2TIAInLastMonth_missing
AFAnticoagulentDOAC_No AFAnticoagulentVitK_No 0.8889 0.8889 0.7902 False AFAnticoagulentDOAC_No-AFAnticoagulentVitK_No
AFAnticoagulentDOAC_missing S1AdmissionYear_2018 -0.8775 0.8775 0.7700 True AFAnticoagulentDOAC_missing-S1AdmissionYear_2018
S1AdmissionYear_2018 AFAnticoagulentHeparin_missing -0.8775 0.8775 0.7700 True AFAnticoagulentHeparin_missing-S1AdmissionYear...
S1AdmissionYear_2018 AFAnticoagulentVitK_missing -0.8775 0.8775 0.7700 True AFAnticoagulentVitK_missing-S1AdmissionYear_2018
S1AdmissionYear_2018 AFAnticoagulentHeparin_No 0.8758 0.8758 0.7671 False AFAnticoagulentHeparin_No-S1AdmissionYear_2018
MotorLegRight MotorArmRight 0.8435 0.8435 0.7115 False MotorArmRight-MotorLegRight
AFAnticoagulentVitK_No S1AdmissionYear_2018 0.8370 0.8370 0.7006 False AFAnticoagulentVitK_No-S1AdmissionYear_2018
AFAnticoagulentDOAC_No S2NewAFDiagnosis_missing -0.8342 0.8342 0.6959 True AFAnticoagulentDOAC_No-S2NewAFDiagnosis_missing
MotorArmLeft MotorLegLeft 0.8326 0.8326 0.6933 False MotorArmLeft-MotorLegLeft
AFAnticoagulentVitK_No S2NewAFDiagnosis_missing -0.8224 0.8224 0.6764 True AFAnticoagulentVitK_No-S2NewAFDiagnosis_missing
S1AdmissionYear_2018 AFAnticoagulentDOAC_No 0.8108 0.8108 0.6574 False AFAnticoagulentDOAC_No-S1AdmissionYear_2018
AFAnticoagulentVitK_missing S2NewAFDiagnosis_missing 0.8046 0.8046 0.6474 True AFAnticoagulentVitK_missing-S2NewAFDiagnosis_m...
AFAnticoagulentHeparin_missing S2NewAFDiagnosis_missing 0.8046 0.8046 0.6474 True AFAnticoagulentHeparin_missing-S2NewAFDiagnosi...
S2NewAFDiagnosis_missing AFAnticoagulentDOAC_missing 0.8046 0.8046 0.6474 True AFAnticoagulentDOAC_missing-S2NewAFDiagnosis_m...
S2NewAFDiagnosis_missing AFAnticoagulentHeparin_No -0.8035 0.8035 0.6456 True AFAnticoagulentHeparin_No-S2NewAFDiagnosis_mis...
S2NewAFDiagnosis_No AFAnticoagulentDOAC_No 0.7859 0.7859 0.6176 False AFAnticoagulentDOAC_No-S2NewAFDiagnosis_No
S1AdmissionYear_2018 S2NewAFDiagnosis_missing -0.7805 0.7805 0.6092 True S1AdmissionYear_2018-S2NewAFDiagnosis_missing
S2NewAFDiagnosis_No AFAnticoagulentVitK_No 0.7749 0.7749 0.6004 False AFAnticoagulentVitK_No-S2NewAFDiagnosis_No
AFAntiplatelet_No AFAntiplatelet_missing -0.7736 0.7736 0.5984 True AFAntiplatelet_No-AFAntiplatelet_missing
AtrialFibrillation_No AFAntiplatelet_No -0.7736 0.7736 0.5984 False AFAntiplatelet_No-AtrialFibrillation_No
AFAntiplatelet_No AtrialFibrillation_Yes 0.7736 0.7736 0.5984 False AFAntiplatelet_No-AtrialFibrillation_Yes
S1Ethnicity_White S1Ethnicity_Other -0.7607 0.7607 0.5787 False S1Ethnicity_Other-S1Ethnicity_White
S2NewAFDiagnosis_No AFAnticoagulentDOAC_missing -0.7582 0.7582 0.5749 True AFAnticoagulentDOAC_missing-S2NewAFDiagnosis_No
S2NewAFDiagnosis_No AFAnticoagulentHeparin_missing -0.7582 0.7582 0.5749 True AFAnticoagulentHeparin_missing-S2NewAFDiagnosi...
AFAnticoagulentVitK_missing S2NewAFDiagnosis_No -0.7582 0.7582 0.5749 True AFAnticoagulentVitK_missing-S2NewAFDiagnosis_No
S2NewAFDiagnosis_No AFAnticoagulentHeparin_No 0.7572 0.7572 0.5734 False AFAnticoagulentHeparin_No-S2NewAFDiagnosis_No
MoreEqual80y_Yes S1AgeOnArrival 0.7556 0.7556 0.5709 False MoreEqual80y_Yes-S1AgeOnArrival
MoreEqual80y_No S1AgeOnArrival -0.7556 0.7556 0.5709 False MoreEqual80y_No-S1AgeOnArrival
S2NewAFDiagnosis_missing AFAnticoagulent_No -0.7502 0.7502 0.5628 True AFAnticoagulent_No-S2NewAFDiagnosis_missing
S1AdmissionYear_2018 S2NewAFDiagnosis_No 0.7357 0.7357 0.5413 False S1AdmissionYear_2018-S2NewAFDiagnosis_No
S1OnsetDateType_Precise S1OnsetDateType_Best estimate -0.7267 0.7267 0.5281 False S1OnsetDateType_Best estimate-S1OnsetDateType_...
AFAnticoagulent_No AFAnticoagulentDOAC_No 0.7212 0.7212 0.5202 False AFAnticoagulentDOAC_No-AFAnticoagulent_No
S1AdmissionYear_2018 AFAnticoagulent_missing -0.7149 0.7149 0.5111 True AFAnticoagulent_missing-S1AdmissionYear_2018
AFAnticoagulentHeparin_missing AFAnticoagulent_missing 0.7093 0.7093 0.5030 True AFAnticoagulentHeparin_missing-AFAnticoagulent...
AFAnticoagulentDOAC_missing AFAnticoagulent_missing 0.7093 0.7093 0.5030 True AFAnticoagulentDOAC_missing-AFAnticoagulent_mi...
AFAnticoagulentVitK_missing AFAnticoagulent_missing 0.7093 0.7093 0.5030 True AFAnticoagulentVitK_missing-AFAnticoagulent_mi...
S2NewAFDiagnosis_No AFAnticoagulent_No 0.7092 0.7092 0.5030 False AFAnticoagulent_No-S2NewAFDiagnosis_No
AFAnticoagulentHeparin_No AFAnticoagulent_missing -0.7080 0.7080 0.5012 True AFAnticoagulentHeparin_No-AFAnticoagulent_missing
AFAnticoagulent_missing AFAnticoagulent_No -0.7035 0.7035 0.4949 True AFAnticoagulent_No-AFAnticoagulent_missing
AFAnticoagulent_No AFAnticoagulentVitK_No 0.6979 0.6979 0.4871 False AFAnticoagulentVitK_No-AFAnticoagulent_No
BestLanguage LocQuestions 0.6979 0.6979 0.4870 False BestLanguage-LocQuestions
S2NihssArrival ExtinctionInattention 0.6795 0.6795 0.4617 False ExtinctionInattention-S2NihssArrival
AFAnticoagulentVitK_No AFAnticoagulent_missing -0.6763 0.6763 0.4574 True AFAnticoagulentVitK_No-AFAnticoagulent_missing
BestGaze S2NihssArrival 0.6756 0.6756 0.4564 False BestGaze-S2NihssArrival
BestLanguage S2NihssArrival 0.6743 0.6743 0.4547 False BestLanguage-S2NihssArrival
AFAnticoagulent_No S1AdmissionYear_2018 0.6694 0.6694 0.4480 False AFAnticoagulent_No-S1AdmissionYear_2018
AFAnticoagulentHeparin_No AFAnticoagulent_No 0.6636 0.6636 0.4403 False AFAnticoagulentHeparin_No-AFAnticoagulent_No
AFAnticoagulent_No AFAnticoagulentDOAC_missing -0.6622 0.6622 0.4386 True AFAnticoagulentDOAC_missing-AFAnticoagulent_No
AFAnticoagulentVitK_missing AFAnticoagulent_No -0.6622 0.6622 0.4386 True AFAnticoagulentVitK_missing-AFAnticoagulent_No
AFAnticoagulentHeparin_missing AFAnticoagulent_No -0.6622 0.6622 0.4386 True AFAnticoagulentHeparin_missing-AFAnticoagulent_No
LocCommands LocQuestions 0.6576 0.6576 0.4325 False LocCommands-LocQuestions
S2NihssArrival LocCommands 0.6560 0.6560 0.4303 False LocCommands-S2NihssArrival
AFAnticoagulent_missing AFAnticoagulentDOAC_No -0.6560 0.6560 0.4303 True AFAnticoagulentDOAC_No-AFAnticoagulent_missing
AtrialFibrillation_Yes AFAnticoagulent_Yes 0.6516 0.6516 0.4245 False AFAnticoagulent_Yes-AtrialFibrillation_Yes
AFAnticoagulent_Yes AtrialFibrillation_No -0.6516 0.6516 0.4245 False AFAnticoagulent_Yes-AtrialFibrillation_No
AFAnticoagulent_Yes AFAntiplatelet_missing -0.6516 0.6516 0.4245 True AFAnticoagulent_Yes-AFAntiplatelet_missing
S2NihssArrival MotorLegRight 0.6493 0.6493 0.4216 False MotorLegRight-S2NihssArrival
S1OnsetDateType_Stroke during sleep S1OnsetDateType_Precise -0.6469 0.6469 0.4185 False S1OnsetDateType_Precise-S1OnsetDateType_Stroke...
LocQuestions S2NihssArrival 0.6357 0.6357 0.4041 False LocQuestions-S2NihssArrival
LocCommands BestLanguage 0.6345 0.6345 0.4026 False BestLanguage-LocCommands
AFAnticoagulent_missing S2NewAFDiagnosis_missing 0.6305 0.6305 0.3976 True AFAnticoagulent_missing-S2NewAFDiagnosis_missing
S2NihssArrival MotorArmRight 0.6212 0.6212 0.3858 False MotorArmRight-S2NihssArrival
AFAnticoagulent_Yes AFAntiplatelet_No 0.6013 0.6013 0.3615 False AFAnticoagulent_Yes-AFAntiplatelet_No
Visual S2NihssArrival 0.5986 0.5986 0.3583 False S2NihssArrival-Visual
AFAnticoagulent_missing S2NewAFDiagnosis_No -0.5942 0.5942 0.3530 True AFAnticoagulent_missing-S2NewAFDiagnosis_No
Dysarthria S2NihssArrival 0.5840 0.5840 0.3410 False Dysarthria-S2NihssArrival
BestGaze ExtinctionInattention 0.5732 0.5732 0.3286 False BestGaze-ExtinctionInattention
Visual ExtinctionInattention 0.5626 0.5626 0.3165 False ExtinctionInattention-Visual
S2NihssArrival Loc 0.5568 0.5568 0.3100 False Loc-S2NihssArrival
S2NihssArrival FacialPalsy 0.5548 0.5548 0.3078 False FacialPalsy-S2NihssArrival
S2NihssArrival MotorLegLeft 0.5507 0.5507 0.3033 False MotorLegLeft-S2NihssArrival
S2NihssArrival Sensory 0.5503 0.5503 0.3028 False S2NihssArrival-Sensory
Visual BestGaze 0.5442 0.5442 0.2961 False BestGaze-Visual
S1AdmissionYear_2016 AFAnticoagulentDOAC_missing 0.5441 0.5441 0.2960 True AFAnticoagulentDOAC_missing-S1AdmissionYear_2016
AFAnticoagulentVitK_missing S1AdmissionYear_2016 0.5441 0.5441 0.2960 True AFAnticoagulentVitK_missing-S1AdmissionYear_2016
S1AdmissionYear_2016 AFAnticoagulentHeparin_missing 0.5441 0.5441 0.2960 True AFAnticoagulentHeparin_missing-S1AdmissionYear...
S1AdmissionYear_2016 AFAnticoagulentHeparin_No -0.5432 0.5432 0.2951 False AFAnticoagulentHeparin_No-S1AdmissionYear_2016
MotorArmRight BestLanguage 0.5239 0.5239 0.2745 False BestLanguage-MotorArmRight
S1AdmissionYear_2016 AFAnticoagulentVitK_No -0.5217 0.5217 0.2722 False AFAnticoagulentVitK_No-S1AdmissionYear_2016
AtrialFibrillation_Yes AFAnticoagulent_missing -0.5171 0.5171 0.2674 True AFAnticoagulent_missing-AtrialFibrillation_Yes
AFAnticoagulent_missing AtrialFibrillation_No 0.5171 0.5171 0.2674 True AFAnticoagulent_missing-AtrialFibrillation_No
AFAnticoagulent_missing AFAntiplatelet_missing 0.5171 0.5171 0.2674 True AFAnticoagulent_missing-AFAntiplatelet_missing
S1AdmissionYear_2016 AFAnticoagulentDOAC_No -0.5080 0.5080 0.2581 False AFAnticoagulentDOAC_No-S1AdmissionYear_2016
ExtinctionInattention Sensory 0.5072 0.5072 0.2573 False ExtinctionInattention-Sensory
StrokeTIA_No S2TIAInLastMonth_missing 0.5056 0.5056 0.2556 True S2TIAInLastMonth_missing-StrokeTIA_No
S2TIAInLastMonth_missing StrokeTIA_Yes -0.5056 0.5056 0.2556 True S2TIAInLastMonth_missing-StrokeTIA_Yes
MotorLegRight BestLanguage 0.5052 0.5052 0.2553 False BestLanguage-MotorLegRight
S1AdmissionYear_2017 S1AdmissionYear_2018 -0.5026 0.5026 0.2526 False S1AdmissionYear_2017-S1AdmissionYear_2018
S1AdmissionYear_2016 S1AdmissionYear_2017 -0.5002 0.5002 0.2502 False S1AdmissionYear_2016-S1AdmissionYear_2017

Repeat (covariance between 0.50 and 0.999), but exclude when one of the data pairs is tagged as ‘missing’ data.

# Get highly correlated data (covariance between 0.50 and 0.999)
pd.set_option('display.max_rows', None)
mask = (cov['abs_value'] <= 0.999) & (cov['abs_value'] >= 0.50) & (cov['missing'] == False)
cov[mask]
variable value abs_value r-squared missing pair
S2StrokeType_Infarction S2StrokeType_Primary Intracerebral Haemorrhage -0.9940 0.9940 0.9881 False S2StrokeType_Infarction-S2StrokeType_Primary I...
AFAnticoagulentHeparin_No AFAnticoagulentVitK_No 0.9575 0.9575 0.9168 False AFAnticoagulentHeparin_No-AFAnticoagulentVitK_No
AFAnticoagulentDOAC_No AFAnticoagulentHeparin_No 0.9322 0.9322 0.8690 False AFAnticoagulentDOAC_No-AFAnticoagulentHeparin_No
AFAnticoagulentDOAC_No AFAnticoagulentVitK_No 0.8889 0.8889 0.7902 False AFAnticoagulentDOAC_No-AFAnticoagulentVitK_No
S1AdmissionYear_2018 AFAnticoagulentHeparin_No 0.8758 0.8758 0.7671 False AFAnticoagulentHeparin_No-S1AdmissionYear_2018
MotorLegRight MotorArmRight 0.8435 0.8435 0.7115 False MotorArmRight-MotorLegRight
AFAnticoagulentVitK_No S1AdmissionYear_2018 0.8370 0.8370 0.7006 False AFAnticoagulentVitK_No-S1AdmissionYear_2018
MotorArmLeft MotorLegLeft 0.8326 0.8326 0.6933 False MotorArmLeft-MotorLegLeft
S1AdmissionYear_2018 AFAnticoagulentDOAC_No 0.8108 0.8108 0.6574 False AFAnticoagulentDOAC_No-S1AdmissionYear_2018
S2NewAFDiagnosis_No AFAnticoagulentDOAC_No 0.7859 0.7859 0.6176 False AFAnticoagulentDOAC_No-S2NewAFDiagnosis_No
S2NewAFDiagnosis_No AFAnticoagulentVitK_No 0.7749 0.7749 0.6004 False AFAnticoagulentVitK_No-S2NewAFDiagnosis_No
AtrialFibrillation_No AFAntiplatelet_No -0.7736 0.7736 0.5984 False AFAntiplatelet_No-AtrialFibrillation_No
AFAntiplatelet_No AtrialFibrillation_Yes 0.7736 0.7736 0.5984 False AFAntiplatelet_No-AtrialFibrillation_Yes
S1Ethnicity_White S1Ethnicity_Other -0.7607 0.7607 0.5787 False S1Ethnicity_Other-S1Ethnicity_White
S2NewAFDiagnosis_No AFAnticoagulentHeparin_No 0.7572 0.7572 0.5734 False AFAnticoagulentHeparin_No-S2NewAFDiagnosis_No
MoreEqual80y_Yes S1AgeOnArrival 0.7556 0.7556 0.5709 False MoreEqual80y_Yes-S1AgeOnArrival
MoreEqual80y_No S1AgeOnArrival -0.7556 0.7556 0.5709 False MoreEqual80y_No-S1AgeOnArrival
S1AdmissionYear_2018 S2NewAFDiagnosis_No 0.7357 0.7357 0.5413 False S1AdmissionYear_2018-S2NewAFDiagnosis_No
S1OnsetDateType_Precise S1OnsetDateType_Best estimate -0.7267 0.7267 0.5281 False S1OnsetDateType_Best estimate-S1OnsetDateType_...
AFAnticoagulent_No AFAnticoagulentDOAC_No 0.7212 0.7212 0.5202 False AFAnticoagulentDOAC_No-AFAnticoagulent_No
S2NewAFDiagnosis_No AFAnticoagulent_No 0.7092 0.7092 0.5030 False AFAnticoagulent_No-S2NewAFDiagnosis_No
AFAnticoagulent_No AFAnticoagulentVitK_No 0.6979 0.6979 0.4871 False AFAnticoagulentVitK_No-AFAnticoagulent_No
BestLanguage LocQuestions 0.6979 0.6979 0.4870 False BestLanguage-LocQuestions
S2NihssArrival ExtinctionInattention 0.6795 0.6795 0.4617 False ExtinctionInattention-S2NihssArrival
BestGaze S2NihssArrival 0.6756 0.6756 0.4564 False BestGaze-S2NihssArrival
BestLanguage S2NihssArrival 0.6743 0.6743 0.4547 False BestLanguage-S2NihssArrival
AFAnticoagulent_No S1AdmissionYear_2018 0.6694 0.6694 0.4480 False AFAnticoagulent_No-S1AdmissionYear_2018
AFAnticoagulentHeparin_No AFAnticoagulent_No 0.6636 0.6636 0.4403 False AFAnticoagulentHeparin_No-AFAnticoagulent_No
LocCommands LocQuestions 0.6576 0.6576 0.4325 False LocCommands-LocQuestions
S2NihssArrival LocCommands 0.6560 0.6560 0.4303 False LocCommands-S2NihssArrival
AtrialFibrillation_Yes AFAnticoagulent_Yes 0.6516 0.6516 0.4245 False AFAnticoagulent_Yes-AtrialFibrillation_Yes
AFAnticoagulent_Yes AtrialFibrillation_No -0.6516 0.6516 0.4245 False AFAnticoagulent_Yes-AtrialFibrillation_No
S2NihssArrival MotorLegRight 0.6493 0.6493 0.4216 False MotorLegRight-S2NihssArrival
S1OnsetDateType_Stroke during sleep S1OnsetDateType_Precise -0.6469 0.6469 0.4185 False S1OnsetDateType_Precise-S1OnsetDateType_Stroke...
LocQuestions S2NihssArrival 0.6357 0.6357 0.4041 False LocQuestions-S2NihssArrival
LocCommands BestLanguage 0.6345 0.6345 0.4026 False BestLanguage-LocCommands
S2NihssArrival MotorArmRight 0.6212 0.6212 0.3858 False MotorArmRight-S2NihssArrival
AFAnticoagulent_Yes AFAntiplatelet_No 0.6013 0.6013 0.3615 False AFAnticoagulent_Yes-AFAntiplatelet_No
Visual S2NihssArrival 0.5986 0.5986 0.3583 False S2NihssArrival-Visual
Dysarthria S2NihssArrival 0.5840 0.5840 0.3410 False Dysarthria-S2NihssArrival
BestGaze ExtinctionInattention 0.5732 0.5732 0.3286 False BestGaze-ExtinctionInattention
Visual ExtinctionInattention 0.5626 0.5626 0.3165 False ExtinctionInattention-Visual
S2NihssArrival Loc 0.5568 0.5568 0.3100 False Loc-S2NihssArrival
S2NihssArrival FacialPalsy 0.5548 0.5548 0.3078 False FacialPalsy-S2NihssArrival
S2NihssArrival MotorLegLeft 0.5507 0.5507 0.3033 False MotorLegLeft-S2NihssArrival
S2NihssArrival Sensory 0.5503 0.5503 0.3028 False S2NihssArrival-Sensory
Visual BestGaze 0.5442 0.5442 0.2961 False BestGaze-Visual
S1AdmissionYear_2016 AFAnticoagulentHeparin_No -0.5432 0.5432 0.2951 False AFAnticoagulentHeparin_No-S1AdmissionYear_2016
MotorArmRight BestLanguage 0.5239 0.5239 0.2745 False BestLanguage-MotorArmRight
S1AdmissionYear_2016 AFAnticoagulentVitK_No -0.5217 0.5217 0.2722 False AFAnticoagulentVitK_No-S1AdmissionYear_2016
S1AdmissionYear_2016 AFAnticoagulentDOAC_No -0.5080 0.5080 0.2581 False AFAnticoagulentDOAC_No-S1AdmissionYear_2016
ExtinctionInattention Sensory 0.5072 0.5072 0.2573 False ExtinctionInattention-Sensory
MotorLegRight BestLanguage 0.5052 0.5052 0.2553 False BestLanguage-MotorLegRight
S1AdmissionYear_2017 S1AdmissionYear_2018 -0.5026 0.5026 0.2526 False S1AdmissionYear_2017-S1AdmissionYear_2018
S1AdmissionYear_2016 S1AdmissionYear_2017 -0.5002 0.5002 0.2502 False S1AdmissionYear_2016-S1AdmissionYear_2017

Observations#

  • Most of the features show weak correlation (96% of feature pairs have an R-squared of less than 0.1)

  • Perfectly correlated feature pairs ar epresent due to dichotomised coding of some features.

  • Many highly correlated features are due to correlaytions between missing data and the value if data is present. There are some ‘more interesting’ highly correlated data such as:

    • Right leg and arm weakness is are highly correlated, as are left leg and arm weakness.

    • Right leg weakness is highly correlated with problems in balance and language.