Detecting at-risk students#
In this chapter, we attempt to replicate the student learning achievement model originally introduced by Al-Shabandar et al. [ASHLK19], which aims to identify students at risk of failure or withdrawal from an online course.
The approach of Al-Shabandar consists of the following steps:
Data aggregation:
Student click interactions are aggregated by activity type. This aggregation process computes both the total sum of clicks and interaction frequency for each distinct activity type.
Data cleaning:
Highly correlated features (>0.8) and near-zero variance predictors are removed.
Data normalization:
Following this, features are normalized with the Yeo-Johnson transformation. Additionally, to address the issue of class imbalance, the Synthetic Minority Over-Sampling (SMOTE) technique is applied to augment the representation of minority classes.
Model training:
Four distinct machine-learning algorithms are trained and fine-tuned. These algorithms undergo optimization through a combination of random search and grid search techniques conducted on the
BBB_2013B
course dataset. The assessment of model performance is achieved through ten-fold cross-validation, with a 70/30 training/test split.Model evaluation:
The model’s predictive capabilities are evaluated on the subsequent
BBB_2013J
course dataset involving several quality metrics, such as the F-measure, sensitivity, specificity, and AUC, to assess the model’s efficacy and generalizability comprehensively.
Raghad Al-Shabandar, Abir Jaafar Hussain, Panos Liatsis, and Robert Keight. Detecting at-risk students with early interventions using machine learning techniques. IEEE Access, 7():149464–149478, 2019. doi:10.1109/ACCESS.2019.2943351.
from shutil import rmtree
from tempfile import mkdtemp
import numpy as np
import pandas as pd
import plotly.express as px
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from IPython.display import display
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
accuracy_score,
f1_score,
recall_score,
roc_auc_score,
roc_curve,
)
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import PowerTransformer
from oulad import get_oulad
%load_ext oulad.capture
%%capture oulad
oulad = get_oulad()
Data aggregation#
We construct the feature_table
DataFrame that aggregates student VLE interactions
by activity type.
Both the total sum of clicks and interaction frequency are computed.
%%capture -ns detecting_at_risk_students feature_table
feature_table = (
oulad.vle.query("code_module == 'BBB' and code_presentation in ['2013B', '2013J']")
.drop(columns=["code_module", "code_presentation"])
.merge(oulad.student_vle[["id_student", "id_site", "sum_click"]], on="id_site")
.groupby(["id_student", "activity_type"])
.agg({"sum_click": ["sum", "count"]})
.pivot_table(
values="sum_click",
index="id_student",
columns="activity_type",
fill_value=0,
)
.pipe(lambda df: setattr(df, "columns", df.columns.map("_".join)) or df)
.join(
oulad.student_info.query(
"code_module == 'BBB' and code_presentation in ['2013B', '2013J']"
)
.loc[:, ["id_student", "final_result", "code_presentation"]]
.assign(final_result=lambda df: df.final_result.isin(["Pass", "Distinction"]))
.set_index("id_student")
)
)
display(feature_table)
count_forumng | count_glossary | count_homepage | count_oucollaborate | count_oucontent | count_ouelluminate | count_quiz | count_resource | count_sharedsubpage | count_subpage | ... | sum_oucollaborate | sum_oucontent | sum_ouelluminate | sum_quiz | sum_resource | sum_sharedsubpage | sum_subpage | sum_url | final_result | code_presentation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id_student | |||||||||||||||||||||
23629 | 24.0 | 0.0 | 16.0 | 0.0 | 0.0 | 0.0 | 15.0 | 2.0 | 0.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 31.0 | 2.0 | 0.0 | 5.0 | 0.0 | False | 2013B |
23798 | 76.0 | 1.0 | 77.0 | 3.0 | 6.0 | 0.0 | 48.0 | 16.0 | 0.0 | 33.0 | ... | 3.0 | 44.0 | 0.0 | 104.0 | 21.0 | 0.0 | 47.0 | 56.0 | True | 2013J |
25107 | 321.0 | 1.0 | 114.0 | 0.0 | 1.0 | 1.0 | 45.0 | 13.0 | 0.0 | 14.0 | ... | 0.0 | 1.0 | 1.0 | 85.0 | 23.0 | 0.0 | 21.0 | 14.0 | True | 2013B |
27759 | 47.0 | 2.0 | 46.0 | 0.0 | 6.0 | 0.0 | 32.0 | 17.0 | 0.0 | 24.0 | ... | 0.0 | 17.0 | 0.0 | 88.0 | 19.0 | 0.0 | 35.0 | 15.0 | False | 2013J |
27891 | 56.0 | 0.0 | 20.0 | 0.0 | 1.0 | 0.0 | 19.0 | 6.0 | 0.0 | 6.0 | ... | 0.0 | 1.0 | 0.0 | 38.0 | 6.0 | 0.0 | 11.0 | 6.0 | False | 2013B |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2685831 | 121.0 | 1.0 | 110.0 | 0.0 | 0.0 | 0.0 | 46.0 | 30.0 | 0.0 | 26.0 | ... | 0.0 | 0.0 | 0.0 | 148.0 | 35.0 | 0.0 | 49.0 | 9.0 | True | 2013B |
2691100 | 18.0 | 0.0 | 62.0 | 0.0 | 6.0 | 0.0 | 40.0 | 49.0 | 0.0 | 38.0 | ... | 0.0 | 22.0 | 0.0 | 102.0 | 70.0 | 0.0 | 104.0 | 56.0 | True | 2013J |
2691566 | 18.0 | 0.0 | 17.0 | 0.0 | 2.0 | 0.0 | 30.0 | 5.0 | 0.0 | 6.0 | ... | 0.0 | 33.0 | 0.0 | 81.0 | 6.0 | 0.0 | 13.0 | 1.0 | False | 2013J |
2692384 | 133.0 | 0.0 | 112.0 | 0.0 | 1.0 | 0.0 | 32.0 | 54.0 | 0.0 | 43.0 | ... | 0.0 | 3.0 | 0.0 | 73.0 | 81.0 | 0.0 | 144.0 | 14.0 | True | 2013B |
2693772 | 6.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 1.0 | 0.0 | False | 2013J |
3421 rows × 24 columns
Data cleaning#
Note
Near-zero variance predictors are removed in the subsequent section. It is implemented in the model training pipeline.
Train test validation split#
To avoid leaking statistics from our validation data into the trained model, we split
the dataset into a train/test set (BBB_2013B
course presentation) and validation
set (BBB_2013J
course presentation), prior to data cleaning.
first_course_mask = feature_table["code_presentation"] == "2013B"
train_test_set = feature_table[first_course_mask].drop(columns="code_presentation")
x_train_test = train_test_set.drop(columns="final_result")
y_train_test = train_test_set.final_result.values
validation_set = feature_table[~first_course_mask].drop(columns="code_presentation")
x_validation = validation_set.drop(columns="final_result")
y_validation = validation_set.final_result.values
Data normalization, model training and evaluation#
In this section, we define the data normalization and model training pipeline.
As in the work of Al-Shabandar et al., we remove near-zero variance predictors
(VarianceThreshold
), normalize the features with the Yeo-Johnson transformation
(PowerTransformer
) and apply the Synthetic Minority Over-Sampling (SMOTE
)
technique.
We conduct a hyperparameter grid search on 70% of the BBB_2013B
course dataset.
Note
We have commented out the hyperparameter ranges and replaced them with the selected values to reduce the execution time of this notebook.
Finally, we assess the model performance using the remaining 30% of the BBB_2013B
course, which serves as the test set.
In addition, the full BBB_2013J
course dataset is employed as the validation set.
%%capture -ns detecting_at_risk_students scores_roc_curves
cachedir = mkdtemp()
grid = {
GradientBoostingClassifier: {
"classifier__learning_rate": [0.03], # [0.03, 0.1, 0.6, 1.1, 1.6],
"classifier__loss": ["exponential"], # ["log_loss", "exponential"],
"classifier__max_depth": [3], # [3, 4, None],
"classifier__min_samples_leaf": [40], # list(range(10, 100, 10)),
"classifier__min_samples_split": [20], # list(range(10, 100, 10)),
"classifier__n_estimators": [10], # [10, 50, 100],
"classifier__n_iter_no_change": [10], # [None, 10],
},
RandomForestClassifier: {
"classifier__criterion": ["gini"], # ["gini", "entropy", "log_loss"],
"classifier__max_depth": [3], # [None, *list(range(1, 20, 2))],
"classifier__min_samples_leaf": [10], # list(range(10, 100, 10)),
"classifier__min_samples_split": [50], # list(range(10, 100, 10)),
"classifier__n_estimators": [50], # [10, 50, 100],
},
MLPClassifier: {
"classifier__alpha": [0.1], # [1/10**x for x in range(1, 7)],
"classifier__early_stopping": [True], # [True, False],
"classifier__hidden_layer_sizes": [(135,)], # (x,) for x in range(10, 200, 5),
"classifier__learning_rate": ["constant"], # constant, adaptive, invscaling,
"classifier__learning_rate_init": [0.3], # [0.001, 0.1, 0.3],
"classifier__max_iter": [1200],
"classifier__momentum": [0.9], # [0.2, 0.5, 0.9],
"classifier__solver": ["sgd"],
},
LogisticRegression: {
"classifier__penalty": ["l1"], # ["l1", "l2", None],
"classifier__solver": ["saga"],
"classifier__tol": [1e-4], # [1e-03, 1e-04, 1e-05, 1e-06, 1e-07, 1e-08],
},
}
pipe = Pipeline(
[
("variance_threshold", VarianceThreshold(3)),
("smote", SMOTE()),
("power_transformer", PowerTransformer()),
("classifier", None),
],
memory=cachedir,
)
mean_fpr = np.linspace(0, 1, 100)
scores = {
"classifier": [],
"test_accuracy": [],
"test_f1": [],
"test_sensitivity": [],
"test_specificity": [],
"test_AUC": [],
"validation_accuracy": [],
"validation_f1": [],
"validation_sensitivity": [],
"validation_specificity": [],
"validation_AUC": [],
}
roc_curves = {}
def get_scores_and_roc_curves() -> tuple[dict]:
"""Computes the prediction scores and ROC curves."""
for classifier_class, hyper_parameters in grid.items():
tprs = []
for _ in range(5):
pipe.set_params(classifier=classifier_class())
classifier = GridSearchCV(
pipe,
hyper_parameters,
scoring="f1",
n_jobs=-1,
error_score="raise",
cv=StratifiedKFold(n_splits=10, shuffle=True),
refit=True,
)
x_train, x_test, y_train, y_test = train_test_split(
x_train_test, y_train_test, test_size=0.3
)
classifier.fit(x_train, y_train)
y_test_pred = classifier.predict(x_test)
y_validation_pred = classifier.predict(x_validation)
scores["classifier"].append(classifier_class.__name__)
for name in ["test", "validation"]:
y_true = y_test if name == "test" else y_validation
y_pred = y_test_pred if name == "test" else y_validation_pred
scores[f"{name}_accuracy"].append(accuracy_score(y_true, y_pred))
scores[f"{name}_f1"].append(f1_score(y_true, y_pred))
scores[f"{name}_sensitivity"].append(recall_score(y_true, y_pred))
scores[f"{name}_specificity"].append(
recall_score(y_true, y_pred, pos_label=0)
)
scores[f"{name}_AUC"].append(roc_auc_score(y_true, y_pred))
fpr, tpr, _ = roc_curve(y_validation, y_validation_pred)
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_curves[classifier_class.__name__] = np.mean(tprs, axis=0)
return scores, roc_curves
scores_roc_curves = get_scores_and_roc_curves()
display(pd.DataFrame(scores_roc_curves[0]).groupby("classifier").mean())
px.line(
pd.DataFrame(scores_roc_curves[1], index=mean_fpr)
.melt(var_name="Classifier", value_name="True Positive Rate", ignore_index=False)
.rename_axis(index="False Positive Rate")
.reset_index(),
x="False Positive Rate",
y="True Positive Rate",
color="Classifier",
title="ROC curve for the validation set",
width=800,
height=600,
).show()
rmtree(cachedir)
test_accuracy | test_f1 | test_sensitivity | test_specificity | test_AUC | validation_accuracy | validation_f1 | validation_sensitivity | validation_specificity | validation_AUC | |
---|---|---|---|---|---|---|---|---|---|---|
classifier | ||||||||||
GradientBoostingClassifier | 0.837069 | 0.859013 | 0.959950 | 0.704247 | 0.832098 | 0.858498 | 0.884687 | 0.950560 | 0.735901 | 0.843230 |
LogisticRegression | 0.836638 | 0.851489 | 0.911835 | 0.758487 | 0.835161 | 0.804262 | 0.823131 | 0.797948 | 0.812671 | 0.805309 |
MLPClassifier | 0.844397 | 0.859135 | 0.930380 | 0.755372 | 0.842876 | 0.744273 | 0.745048 | 0.677612 | 0.833043 | 0.755328 |
RandomForestClassifier | 0.831034 | 0.847872 | 0.919929 | 0.737445 | 0.828687 | 0.834843 | 0.858208 | 0.875933 | 0.780124 | 0.828029 |