A comparison of actual thrombolysis rates across hospitals: subgroup analysis

A comparison of actual thrombolysis rates across hospitals: subgroup analysis#

Plain English summary#

We have previously predicted the thrombolysis use in subgroups, if all hospitals saw the same 10 thousand patients. In this notebook we look at the same subgroup analysis, this time for observed thrombolysis use. One difference between this and the previous work, is that when we look at observed thrombolysis use, then the patients in the subgroups are different at each hospital, but we can still look for agreement in general patterns.

Informed by the SHAP values, we analysed the observed use of thrombolysis in subgroups of patients: one ‘ideally’ thrombolysable patient, nine ‘sub-optimal’ thrombolysable patient subgroups (one subgroup per feature), and subgroups with combinations of sub-optimal features. We based the ideally thrombolysable definition on observing the relationships between feature values and thrombolysis use, and for each ‘sub optimal’ subgroup we chose the feature value that is less desirable for choosing to use thrombolysis (this is either a feature value that corresponds with a SHAP value of zero, or the least favourable value for binary features).

The patient subgroups are defined as:

An ‘ideally’ thrombolysable patient:
- Mid-level stroke severity (NIHSS in range 10-25)
- Short arrival-to-scan time (less than 30 minutes)
- Stroke caused by infarction
- Precise stroke onset time known
- No pre-stroke disability (mRS 0)
- Not taking atrial fibrillation anticoagulants
- Short onset-to-arrival time (less than 90 minutes)
- Younger than 80 years old
- Onset not during sleep
Patients with a milder stroke severity (NIHSS less than 5)
Patients where stroke onset time known imprecisely
Patients with existing pre-stroke disability (mRS greater than 2)
Patients with a haemorrhagic stroke
Patients with 60-90 minutes arrival-to-scan time
Patients with use of AF anticoagulants
Patients with 150-180 minutes onset-to-arrival time
Patients with onset during sleep
Patients aged over 80 years old

As with the predicted thrombolysis use (if all hopsitals saw the same patients), we see the same patterns. All stroke units show high expected thrombolysis in a set of ‘ideal’ thrombolysis patients, but vary in expected use in subgroups with low stroke severity, no precise onset time, or existing pre-stroke disability. If a stroke unit showed lower thombolysis in one of these subgroups they also tended to show lower thrombolysis rates in the other subgroups - suggesting a shared caution in use of thrombolysis in ‘less ideal’ patients.

Data#

This analysis is for patients who arrive within 4 hours of known stroke onset. The data is the observed thrombolsyis use at each hopsital (not modelled/predicted thrombolysis use).

Aims#

Look at actual thrombolysis use in these subgroups of patients (in addiitonal to total thrombolysis use):
- An ideal thrombolysable patient:
  - Stroke severity NIHSS in range 10-25
  - Arrival-to-scan time < 30 minutes
  - Stroke type = infarction
  - Precise onset time = True
  - Prior diability level (mRS) = 0
  - No use of AF anticoagulants
  - Onset-to-arrival time < 90 minutes
  - Age < 80 years
  - Onset during sleep = False
- Mild stroke severity (NIHSS < 5)
- No precise onset time
- Existing pre-stroke disability (mRS > 2)
- Older than 80 years old
- A haemorrhagic stroke
- Arrival-to-scan time 60-90 minutes
- Onset-to-arrival time 150-180 minutes
- Use of AF anticoagulants
- Onset during sleep

Observations#

The three subgroups of NIHSS <5, no precise stroke onset time, and prestroke mRS > 2, all had reduced thrombolysis use (compared to all patients), and combining these non-ideal features reduced thrombolysis use further.
The three subgroups of NIHSS <5, no precise stroke onset time, and prestroke mRS > 2, tended to reduce in parallel, along with total thrombolysis use, suggesting a shared caution in use of thrombolysis in ‘less ideal’ patients.
The observed and predicted subgroup analysis show very similar general patterns (with r-sqaured=0.93). Some differences exist:
- The use of thrombolysis in ideal patients is a little low in the observed vs predicted results (mean hospital thrombolysis use = 89% vs 99%).
- The predicted results show a stronger effect of combining non-ideal features.
- The observed thrombolysis rate shows higher between-hospital variation than the predicted thrombolysis rate. This may be partly explained by the observed thrombolysis rate being on different patients at each hospital, but may also be partly explained by actual use of thrombolysis being slightly more variable than predicted thrombolysis use (which will follow general hospital patterns, and will not include, for example, between-clinician variation at each hospital).

Import packages#

import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats

Set filenames#

# Set up strings (describing the model) to use in filenames
number_key_features = 10
model_text = f'xgb_{number_key_features}_features_10k_cohort'
notebook = '04c'

Import data#

# Load one k-fold, and join to get full data
data_loc = '../data/kfold_5fold/'
train = pd.read_csv(data_loc + 'train_0.csv')
test = pd.read_csv(data_loc + 'test_0.csv')
data = pd.concat([train, test])

# Restrict data fields and rename
with open("./output/01_feature_name_dict.json") as json_file:
    feature_name_dict = json.load(json_file)
number_of_features_to_use = 10
key_features = pd.read_csv('./output/01_feature_selection.csv')
key_features = list(key_features['feature'])[:number_of_features_to_use]
key_features.append('S2Thrombolysis')
data = data[key_features]
data.rename(columns=feature_name_dict, inplace=True)

data.head()

	Arrival-to-scan time	Infarction	Stroke severity	Precise onset time	Prior disability level	Stroke team	Use of AF anticoagulants	Onset-to-arrival time	Age	Thrombolysis
0	14.0	0	24.0	1	0	APXEE8191H	0	106.0	92.5	0
1	8.0	1	27.0	1	0	EQZZZ5658G	0	144.0	72.5	1
2	18.0	1	2.0	0	0	QOAPO4699N	0	83.0	72.5	1
3	19.0	1	12.0	1	4	JINXD0311F	0	78.0	57.5	1
4	22.0	1	27.0	1	3	HBFCN1575G	1	145.0	82.5	1

Get thrombolysis use in subgroups at each hospital#

Store in dictionary

# Initialise dictionary to store a dictionary per hospital 
#   (using hosptial name as key)
dict_results = dict()

# Group data by hospital
grouped_data = data.groupby('Stroke team')

# Loop through hospitals
for hospital, hosp_data in grouped_data:
    
    # Initialise dictionary for this hospital
    dict_hosp_results = dict()
    
    # Calculate thrombolysis rate for all patients attending hospital
    dict_hosp_results['All patients'] = hosp_data['Thrombolysis'].mean()

    # Calculate thrombolysis rate for ideal patients
    mask = ((hosp_data['Stroke severity'] <= 25) &
        (hosp_data['Stroke severity'] >= 10) & 
        (hosp_data['Arrival-to-scan time'] <= 30) &
        (hosp_data['Infarction'] == 1) &
        (hosp_data['Precise onset time'] == 1) &
        (hosp_data['Prior disability level'] == 0) &
        (hosp_data['Use of AF anticoagulants'] == 0) &
        (hosp_data['Onset-to-arrival time'] <= 90) &
        (hosp_data['Age'] < 80) &
        (hosp_data['Onset during sleep'] == 0)
        )
    dict_hosp_results['Ideal'] = hosp_data[mask]['Thrombolysis'].mean()

    # Calculate thrombolysis rate for mild stroke (NIHSS < 5)
    mask = hosp_data['Stroke severity'] < 5
    dict_hosp_results['NIHSS < 5'] = hosp_data[mask]['Thrombolysis'].mean()

    # Calculate thrombolysis rate for no precise onset time
    mask = hosp_data['Precise onset time'] == 0
    dict_hosp_results['Estimated onset time'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for patients with prior disability
    mask = hosp_data['Prior disability level'] > 2
    dict_hosp_results['mRS > 2'] = hosp_data[mask]['Thrombolysis'].mean()

    # Calculate thrombolysis rate for > 80 year olds
    mask = hosp_data['Age'] > 80 
    dict_hosp_results['Age > 80'] = hosp_data[mask]['Thrombolysis'].mean()
    
    # Calculate thrombolysis rate for arrival-to-scan time 60-90 mins
    mask = ((hosp_data['Arrival-to-scan time'] > 60) & 
            (hosp_data['Arrival-to-scan time'] <= 90))
    dict_hosp_results['Arrival-to-scan 60-90'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())
    
    # Calculate thrombolysis rate for haemorrhagic stroke
    mask = hosp_data['Infarction'] == 0
    dict_hosp_results['Haemorrhagic'] = hosp_data[mask]['Thrombolysis'].mean()

    # Calculate thrombolysis rate for patients who use AF anticoagulants
    mask = hosp_data['Use of AF anticoagulants'] == 1
    dict_hosp_results['Used AF anticoagulants'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for onset-to-arrival time 150-180 mins
    mask = ((hosp_data['Onset-to-arrival time'] > 150) & 
            (hosp_data['Onset-to-arrival time'] <= 180))
    dict_hosp_results['Onset-to-arrival time 150-180'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for onset during sleep
    mask = hosp_data['Onset during sleep'] == 1
    dict_hosp_results['Onset during sleep'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for mild stroke (NIHSS < 5) & no precise onset
    mask = ((hosp_data['Stroke severity'] < 5) & 
            (hosp_data['Precise onset time'] == 0))
    dict_hosp_results['NIHSS + Precise'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for mild stroke (NIHSS < 5) & prestroke 
    #   disability
    mask = ((hosp_data['Stroke severity'] < 5) & 
            (hosp_data['Prior disability level'] > 2))
    dict_hosp_results['NIHSS + Disability'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for no precise onset & prestroke disability
    mask = ((hosp_data['Precise onset time'] == 0) & 
            (hosp_data['Prior disability level'] > 2))
    dict_hosp_results['Precise + Disability'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Calculate thrombolysis rate for NIHSS < 5, no precise onset, prestroke 
    #   disability
    mask = ((hosp_data['Stroke severity'] < 5) &
            (hosp_data['Precise onset time'] == 0) & 
            (hosp_data['Prior disability level'] > 2))
    dict_hosp_results['NIHSS + Precise + Disability'] = (
                                        hosp_data[mask]['Thrombolysis'].mean())

    # Add dictionary of hospital results to main dictionary (using hospital
    #   name as key)
    dict_results[hospital] = dict_hosp_results

# Create dataframe of results (hospital per row, patient subgroup thrombolysis
#   rate per column)
df_results = pd.DataFrame(dict_results).T

# Convert to percent
df_results *= 100 
df_results

	All patients	Ideal	NIHSS < 5	Estimated onset time	mRS > 2	Age > 80	Arrival-to-scan 60-90	Haemorrhagic	Used AF anticoagulants	Onset-to-arrival time 150-180	Onset during sleep	NIHSS + Precise	NIHSS + Disability	Precise + Disability	NIHSS + Precise + Disability
AGNOF1041H	35.246843	82.051282	15.210356	18.623482	21.379310	29.189189	11.538462	0.0	8.849558	31.958763	2.777778	2.409639	0.000000	8.771930	0.000000
AKCGO9726K	36.974790	86.206897	23.049645	18.539326	33.887043	35.044248	16.981132	0.0	22.155689	33.333333	4.166667	8.275862	10.606061	17.177914	4.761905
AOBTM3098N	21.880342	80.000000	2.690583	5.813953	11.926606	18.309859	10.843373	0.0	5.128205	13.461538	2.040816	0.000000	0.000000	5.555556	0.000000
APXEE8191H	22.648084	100.000000	6.329114	12.356322	13.636364	20.270270	2.941176	0.0	3.061224	21.153846	3.333333	2.597403	0.000000	5.333333	0.000000
ATDID5461S	24.038462	100.000000	14.150943	2.727273	9.756098	16.265060	25.352113	0.0	0.000000	15.000000	0.000000	0.000000	7.142857	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
YPKYH1768F	24.605678	87.500000	5.185185	2.409639	11.475410	21.568627	21.621622	0.0	6.250000	20.000000	0.000000	0.000000	5.555556	0.000000	0.000000
YQMZV4284N	23.617021	100.000000	7.389163	6.074766	20.754717	22.594142	3.389831	0.0	5.357143	12.500000	0.000000	0.000000	2.439024	5.681818	0.000000
ZBVSO0975W	25.000000	76.923077	12.435233	14.705882	20.512821	23.560209	0.000000	0.0	1.886792	28.205128	0.000000	22.222222	0.000000	4.166667	0.000000
ZHCLE1578P	22.363946	100.000000	6.067416	22.070145	18.181818	20.000000	1.587302	0.0	6.410256	16.363636	NaN	6.067416	2.325581	18.279570	2.325581
ZRRCV7012C	15.767045	100.000000	6.250000	2.380952	6.000000	12.121212	5.681818	0.0	1.886792	13.114754	0.854701	0.000000	0.000000	0.000000	0.000000

132 rows × 15 columns

Plot results#

Plot data for 6 patient subgroups (all patients, ideal candidate for thrombolysis, and four suboptimal individual features)

df_sorted_results = df_results.sort_values('All patients', ascending=False)
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax1.plot(df_sorted_results['All patients'], label='All', c='k')
ax1.plot(df_sorted_results['Ideal'], label='Ideal', c='r')
ax1.plot(df_sorted_results['NIHSS < 5'], label='NIHSS < 5', c='b')
ax1.plot(df_sorted_results['Estimated onset time'], 
         label='Estimated onset time', c='g')
ax1.plot(df_sorted_results['mRS > 2'], label='mRS > 2', c='orange')
ax1.plot(df_sorted_results['Age > 80'], label='Age > 80', c='m')
ax1.set_xticklabels([])
ax1.set_xlabel('Hospital\n(ordered by thrombolysis rate for all 10k patients)')
ax1.set_ylabel('Percent of 10k subgroup predicted to receive thrombolysis')
ax1.legend(loc='lower center', ncol=3, bbox_to_anchor=(1, -0.3))

ax2 = fig.add_subplot(122)
ax2.plot(df_sorted_results['NIHSS < 5'], label='NIHSS < 5', c='b')
ax2.plot(df_sorted_results['Estimated onset time'], 
         label='Estimated onset time', c='g')
ax2.plot(df_sorted_results['mRS > 2'], label='mRS > 2', c='orange')
ax2.plot(df_sorted_results['Age > 80'], label='Age > 80', c='m')
ax2.set_xticklabels([])
ax2.set_xlabel('Hospital\n(ordered by thrombolysis rate for all 10k patients)')
ax2.set_ylabel('Observed percent of patients receiving thrombolysis')

plt.savefig(f'./output/{notebook}_{model_text}_selected_subgroups.jpg', dpi=300, 
            bbox_inches='tight')

plt.show()

../_images/f44c4823918e9dde2c8f7e8b5f82ae21705b6fa9b7f53e527b4483ee5cfb1530.png

Plot data for all patient subgroups (all patients, ideal candidate for thrombolysis, and all nine individual features)

df_sorted_results = df_results.sort_values('All patients', ascending=False)
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax1.plot(df_sorted_results['All patients'], label='All', c='k')
ax1.plot(df_sorted_results['Ideal'], label='Ideal', c='r')
ax1.plot(df_sorted_results['NIHSS < 5'], label='NIHSS < 5', c='b')
ax1.plot(df_sorted_results['Estimated onset time'], 
         label='Estimated onset time', c='g')
ax1.plot(df_sorted_results['mRS > 2'], label='mRS > 2', c='orange')
ax1.plot(df_sorted_results['Age > 80'], label='Age > 80', c='m')
ax1.plot(df_sorted_results['Arrival-to-scan 60-90'], 
         label='Arrival-to-scan 60-90', c='peru')
ax1.plot(df_sorted_results['Haemorrhagic'], label='Haemorrhagic', c='y')
ax1.plot(df_sorted_results['Used AF anticoagulants'], 
         label='Used AF anticoagulants', c='lime')
ax1.plot(df_sorted_results['Onset-to-arrival time 150-180'], 
         label='Onset-to-arrival time 150-180', c='navy')
ax1.plot(df_sorted_results['Onset during sleep'],
         label='Onset during sleep', c='aqua')
ax1.set_xticklabels([])
ax1.set_xlabel('Hospital\n(ordered by thrombolysis rate for all 10k patients)')
ax1.set_ylabel('Observed percent of patients receiving thrombolysis')
ax1.legend(loc='lower center', ncol=3, bbox_to_anchor=(1, -0.4))

ax2 = fig.add_subplot(122)
ax2.plot(df_sorted_results['NIHSS < 5'], label='NIHSS < 5', c='b')
ax2.plot(df_sorted_results['Estimated onset time'], 
         label='Estimated onset time', c='g')
ax2.plot(df_sorted_results['mRS > 2'], label='mRS > 2', c='orange')
ax2.plot(df_sorted_results['Age > 80'], label='Age > 80', c='m')
ax2.plot(df_sorted_results['Arrival-to-scan 60-90'], 
         label='Arrival-to-scan 60-90', c='peru')
ax2.plot(df_sorted_results['Haemorrhagic'], label='Haemorrhagic', c='y')
ax2.plot(df_sorted_results['Used AF anticoagulants'], 
         label='Used AF anticoagulants', c='lime')
ax2.plot(df_sorted_results['Onset-to-arrival time 150-180'], 
         label='Onset-to-arrival time 150-180', c='navy')
ax2.plot(df_sorted_results['Onset during sleep'], 
         label='Onset during sleep', c='aqua')
ax2.set_xticklabels([])
ax2.set_xlabel('Hospital\n(ordered by thrombolysis rate for all 10k patients)')
ax2.set_ylabel('Percent of 10k subgroup predicted to receive thrombolysis')

plt.savefig(f'./output/{notebook}_{model_text}_all_subgroups.jpg', 
            dpi=300, bbox_inches='tight')

plt.show()

../_images/bfb78fe86646af93bf5a5843f87bf666cbc50bedff122b921150f983d731f115.png

Show a summary of results as table (observed dataset)#

df_results.describe().T

	count	mean	std	min	25%	50%	75%	max
All patients	132.0	28.618779	7.645344	6.250000	23.513019	28.007450	33.659365	48.756219
Ideal	132.0	89.175667	9.197675	50.000000	83.982143	90.454545	96.038462	100.000000
NIHSS < 5	132.0	11.422620	6.965480	0.490196	6.420988	10.572275	14.489520	40.845070
Estimated onset time	132.0	11.576738	7.882359	0.000000	5.263158	10.027830	16.461749	33.949192
mRS > 2	132.0	17.225861	8.150170	0.000000	11.085627	17.004327	22.443219	39.166667
Age > 80	132.0	24.086885	8.144390	4.545455	18.956249	23.205515	29.754550	52.307692
Arrival-to-scan 60-90	132.0	10.242127	6.820131	0.000000	5.119243	10.741758	14.829545	27.272727
Haemorrhagic	132.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
Used AF anticoagulants	132.0	8.686992	5.319644	0.000000	4.875579	8.333333	12.007519	25.373134
Onset-to-arrival time 150-180	132.0	21.231160	8.633019	0.000000	14.787582	20.521390	28.007164	45.454545
Onset during sleep	125.0	3.355204	7.368548	0.000000	0.000000	0.000000	2.777778	37.500000
NIHSS + Precise	132.0	4.185051	4.891655	0.000000	0.000000	2.667141	6.274700	25.925926
NIHSS + Disability	132.0	4.089944	5.230294	0.000000	0.000000	2.964427	6.250000	28.571429
Precise + Disability	132.0	7.266829	6.822990	0.000000	1.895749	5.555556	10.638639	30.000000
NIHSS + Precise + Disability	131.0	1.769021	4.384437	0.000000	0.000000	0.000000	0.000000	33.333333

Show a summary of results as violin plot (observed dataset)#

Each violin represents the range of predicted thrombolysis rate across the 132 hosptials for a patient subgroup (selected from the 10k patient cohort)

cols = ['All patients', 'Ideal', 'NIHSS < 5', 'Estimated onset time', 'mRS > 2', 
        'Age > 80', 'Arrival-to-scan 60-90', 'Haemorrhagic', 
        'Used AF anticoagulants', 'Onset-to-arrival time 150-180', 
        'Onset during sleep', 'NIHSS + Precise', 'NIHSS + Disability', 
        'Precise + Disability', 'NIHSS + Precise + Disability']


labels = ['', 'All', 'Ideal\npatients', 'NIHSS<5', 'No precise\nonset', 'mRS>2',
          'Age>80', 'Arrival-to-scan\n60-90', 'Haemorrhagic', 
          'Used AF\nanticoagulants', 'Onset-to-arrival\ntime 150-180', 
          'Onset during\nsleep','NIHSS+\nPrecise', 'NIHSS+\nDisability', 
          'Precise+\nDisability', 'NIHSS+\nPrecise+\nDisability']

fig = plt.figure(figsize=(15,5))
ax = fig.add_subplot()
ax.violinplot(df_results[cols].dropna(), showmedians=True)
ax.set_ylabel('Actual thrombolysis use (%)')
ax.set_xticks(range(len(labels)))
ax.set_xticklabels(labels, rotation=30)
ax.grid()
plt.savefig(f'./output/{notebook}_{model_text}_'
            f'actual_subgroup_violin_all_features.jpg', 
            dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()

../_images/71b218faf51761227126a39d93fc38aa4701043e9f77a878f302c8881f3596df.png

Compare observed results with model results (predicted hospital thrombolysis rate for each subgroup created from the 10k cohort)#

Display data as boxplots

Read in model results from notebook 15.

model_results = pd.read_csv(f'./output/04b_{model_text}_groups.csv', 
                            index_col='Unnamed: 0')

Define the feature order as order of feature importance (the same order as used in notebook 03a for the violin plots). Add on a combination subgroup as an example.

cols = ['All patients', 'Ideal', 'Haemorrhagic', 'Arrival-to-scan 60-90', 
        'NIHSS < 5', 'Estimated onset time', 'mRS > 2', 
        'Used AF anticoagulants', 'Onset-to-arrival time 150-180', 
        'Age > 80', 'Onset during sleep', 'NIHSS + Precise']
labels = ['', 'All', 'Ideal\npatients', 'Haemorrhagic', 
          'Arrival-to-scan\n60-90', 'NIHSS<5', 'Estimated\nonset', 'mRS>2', 
          'Used AF\nanticoagulants', 'Onset-to-arrival\ntime 150-180', 'Age>80', 
          'Onset during\nsleep', 'NIHSS<5 +\nEstimated onset']

Plot boxplots in subplots.

fig = plt.figure(figsize=(10,10))

ax1 = fig.add_subplot(211)
ax1.boxplot(df_results[cols].dropna(),whis=[0,100])
ax1.set_ylabel('Actual thrombolysis use (%)')
ax1.set_xticks(range(len(labels)))
ax1.set_xticklabels(labels, rotation=30)
ax1.grid()
ax1.set_title('Observed hospital thrombolysis by subgroups\n'
              'Note: Each hospital has their own patients')

ax2 = fig.add_subplot(212)
ax2.boxplot(model_results[cols].dropna(),whis=[0,100])
ax2.set_ylabel('Predicted 10k thrombolysis use (%)')
ax2.set_xticks(range(len(labels)))
ax2.set_xticklabels(labels, rotation=30)
ax2.grid()
ax2.set_title('Predicted 10k thrombolysis by subgroups\n'
              'Note: Each hospital has the same 10k patients')

plt.tight_layout(pad=1)
plt.savefig(f'./output/{notebook}_{model_text}_'
            f'actual_vs_modelled_subgroup_boxplot.jpg', 
            dpi=300, bbox_inches='tight', pad_inches=0.2)
plt.show()

../_images/fb7668498106a4f1261aa993b928851f749c1e01f36f7befc4ce2d926a7dddd4.png

Calculate standard deviation and coefficient of variation for each feature.#

Coefficient of variation represents the extent of variability wrt mean.

It is defined as std dev / mean (https://en.wikipedia.org/wiki/Coefficient_of_variation)

Haemorragic has mean = 0, so not possible to calculate coefficient of variation for this feature

cv_cols = ['All patients', 'Ideal', 'Arrival-to-scan 60-90', 
        'NIHSS < 5', 'Estimated onset time', 'mRS > 2', 
        'Used AF anticoagulants', 'Onset-to-arrival time 150-180', 
        'Age > 80', 'Onset during sleep', 'NIHSS + Precise']

# Calculate standard deviation of thrombolysis rate for each patient subgroup
temp_df1 = pd.DataFrame(data=df_results[cols].std(),
                        columns=["standard deviation"])

# Define function to calculate cv
cv = lambda x: np.std(x, ddof=1) / np.mean(x) * 100 
# Calculate cv of thrombolysis rate for each patient subgroup
temp_df2 = pd.DataFrame(data=df_results[cv_cols].apply(cv),
                        columns=["coefficient of variation"])

# Join both dataframes together
temp_df1 = temp_df1.join(temp_df2)

temp_df1

	standard deviation	coefficient of variation
All patients	7.645344	26.714431
Ideal	9.197675	10.314108
Haemorrhagic	0.000000	NaN
Arrival-to-scan 60-90	6.820131	66.589004
NIHSS < 5	6.965480	60.979706
Estimated onset time	7.882359	68.087912
mRS > 2	8.150170	47.313570
Used AF anticoagulants	5.319644	61.236898
Onset-to-arrival time 150-180	8.633019	40.662020
Age > 80	8.144390	33.812551
Onset during sleep	7.368548	219.615516
NIHSS + Precise	4.891655	116.884001

Check correlations between pairs of patient subgroups (from the observed dataset)#

features = ['All patients', 'Ideal', 'NIHSS < 5', 'Estimated onset time', 
            'mRS > 2', 'Age > 80', 'Arrival-to-scan 60-90', 'Haemorrhagic', 
            'Used AF anticoagulants', 'Onset-to-arrival time 150-180',
            'Onset during sleep']

print ('Correlations between subgroups:')
print ('-------------------------------')

# Loop through all features
for feat1 in features:
    # Loop through all features from the chosen feature
    for feat2 in features[features.index(feat1)+1:]:
        # If can, calculate correlation between feature pair
        try:
            slope, intercept, r_value, p_value, std_err = \
                stats.linregress(df_results[feat1],df_results[feat2])
            print (f'{feat1} - {feat2}: r-square = {r_value**2:0.3f}, '
                   f'p = {p_value:0.3f}')

        # Otherwise report not possible
        except:
            print (f'{feat1} - {feat2}: Correlation cannot be calulated')

Correlations between subgroups:
-------------------------------
All patients - Ideal: r-square = 0.006, p = 0.365
All patients - NIHSS < 5: r-square = 0.661, p = 0.000
All patients - Estimated onset time: r-square = 0.344, p = 0.000
All patients - mRS > 2: r-square = 0.528, p = 0.000
All patients - Age > 80: r-square = 0.801, p = 0.000
All patients - Arrival-to-scan 60-90: r-square = 0.158, p = 0.000
All patients - Haemorrhagic: r-square = 0.000, p = 1.000
All patients - Used AF anticoagulants: r-square = 0.321, p = 0.000
All patients - Onset-to-arrival time 150-180: r-square = 0.663, p = 0.000
All patients - Onset during sleep: r-square = nan, p = nan
Ideal - NIHSS < 5: r-square = 0.003, p = 0.520
Ideal - Estimated onset time: r-square = 0.000, p = 0.875
Ideal - mRS > 2: r-square = 0.010, p = 0.258
Ideal - Age > 80: r-square = 0.004, p = 0.491
Ideal - Arrival-to-scan 60-90: r-square = 0.002, p = 0.583
Ideal - Haemorrhagic: r-square = 0.000, p = 1.000
Ideal - Used AF anticoagulants: r-square = 0.000, p = 0.835
Ideal - Onset-to-arrival time 150-180: r-square = 0.001, p = 0.690
Ideal - Onset during sleep: r-square = nan, p = nan
NIHSS < 5 - Estimated onset time: r-square = 0.221, p = 0.000
NIHSS < 5 - mRS > 2: r-square = 0.308, p = 0.000
NIHSS < 5 - Age > 80: r-square = 0.473, p = 0.000
NIHSS < 5 - Arrival-to-scan 60-90: r-square = 0.068, p = 0.003
NIHSS < 5 - Haemorrhagic: r-square = 0.000, p = 1.000
NIHSS < 5 - Used AF anticoagulants: r-square = 0.186, p = 0.000
NIHSS < 5 - Onset-to-arrival time 150-180: r-square = 0.432, p = 0.000
NIHSS < 5 - Onset during sleep: r-square = nan, p = nan
Estimated onset time - mRS > 2: r-square = 0.289, p = 0.000
Estimated onset time - Age > 80: r-square = 0.342, p = 0.000
Estimated onset time - Arrival-to-scan 60-90: r-square = 0.031, p = 0.044
Estimated onset time - Haemorrhagic: r-square = 0.000, p = 1.000
Estimated onset time - Used AF anticoagulants: r-square = 0.206, p = 0.000
Estimated onset time - Onset-to-arrival time 150-180: r-square = 0.330, p = 0.000
Estimated onset time - Onset during sleep: r-square = nan, p = nan
mRS > 2 - Age > 80: r-square = 0.680, p = 0.000
mRS > 2 - Arrival-to-scan 60-90: r-square = 0.040, p = 0.021
mRS > 2 - Haemorrhagic: r-square = 0.000, p = 1.000
mRS > 2 - Used AF anticoagulants: r-square = 0.293, p = 0.000
mRS > 2 - Onset-to-arrival time 150-180: r-square = 0.408, p = 0.000
mRS > 2 - Onset during sleep: r-square = nan, p = nan
Age > 80 - Arrival-to-scan 60-90: r-square = 0.080, p = 0.001
Age > 80 - Haemorrhagic: r-square = 0.000, p = 1.000
Age > 80 - Used AF anticoagulants: r-square = 0.351, p = 0.000
Age > 80 - Onset-to-arrival time 150-180: r-square = 0.598, p = 0.000
Age > 80 - Onset during sleep: r-square = nan, p = nan
Arrival-to-scan 60-90 - Haemorrhagic: r-square = 0.000, p = 1.000
Arrival-to-scan 60-90 - Used AF anticoagulants: r-square = 0.057, p = 0.006
Arrival-to-scan 60-90 - Onset-to-arrival time 150-180: r-square = 0.059, p = 0.005
Arrival-to-scan 60-90 - Onset during sleep: r-square = nan, p = nan
Haemorrhagic - Used AF anticoagulants: r-square = 0.000, p = 1.000
Haemorrhagic - Onset-to-arrival time 150-180: r-square = 0.000, p = 1.000
Haemorrhagic - Onset during sleep: r-square = 0.000, p = 1.000
Used AF anticoagulants - Onset-to-arrival time 150-180: r-square = 0.208, p = 0.000
Used AF anticoagulants - Onset during sleep: r-square = nan, p = nan
Onset-to-arrival time 150-180 - Onset during sleep: r-square = nan, p = nan

/home/kerry/miniconda3/envs/samuel2/lib/python3.8/site-packages/scipy/stats/_stats_mstats_common.py:170: RuntimeWarning: invalid value encountered in double_scalars
  slope = ssxym / ssxm
/home/kerry/miniconda3/envs/samuel2/lib/python3.8/site-packages/scipy/stats/_stats_mstats_common.py:187: RuntimeWarning: divide by zero encountered in double_scalars
  slope_stderr = np.sqrt((1 - r**2) * ssym / ssxm / df)
/home/kerry/miniconda3/envs/samuel2/lib/python3.8/site-packages/scipy/stats/_stats_mstats_common.py:194: RuntimeWarning: invalid value encountered in double_scalars
  intercept_stderr = slope_stderr * np.sqrt(ssxm + xmean**2)

Analyse correlation between observed and predicted subgroup thrombolysis use#

Collate results

df_compare = pd.DataFrame()
df_compare['predicted'] = model_results.values.flatten()
df_compare['observed'] = df_results.values.flatten()
df_compare.dropna(inplace=True)

Create scatter plot

fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot()
ax.scatter(df_compare['predicted'], df_compare['observed'], 
    marker='x', s=10, c='k', alpha=0.5)

ax.set_xlabel('Predicted subgroup thrombolysis use (%)')
ax.set_ylabel('Observed subgroup thrombolysis use (%)')

# Calculate best fit
slope, intercept, r_value, p_value, std_err = \
        stats.linregress(df_compare['predicted'], df_compare['observed'])

# Plot best fit
ax.plot([0, 100], [intercept, intercept + (100 * slope)], 
    linestyle='--', c='k', linewidth=1)

text = (f'r-squared = {r_value**2:0.3f}\n'
    '(predicted and observed subgroup patients are different)')

ax.text(0, 96, text, bbox=dict(facecolor='white', edgecolor='white'))
ax.grid()
plt.savefig(f'./output/{notebook}_{model_text}_subgroup_correlation.jpg', 
            dpi=300)
plt.show()

../_images/2e9803515969a523083c7b10d7a1aa39b9507586dd32015c00aa97293d777518.png

The observed and predicted subgroup analysis show very similar general patterns (with r-squared=0.93).