Detecting at-risk students#

In this chapter, we attempt to replicate the student learning achievement model originally introduced by Al-Shabandar et al. [ASHLK19], which aims to identify students at risk of failure or withdrawal from an online course.

The approach of Al-Shabandar consists of the following steps:

  1. Data aggregation:

    Student click interactions are aggregated by activity type. This aggregation process computes both the total sum of clicks and interaction frequency for each distinct activity type.

  2. Data cleaning:

    Highly correlated features (>0.8) and near-zero variance predictors are removed.

  3. Data normalization:

    Following this, features are normalized with the Yeo-Johnson transformation. Additionally, to address the issue of class imbalance, the Synthetic Minority Over-Sampling (SMOTE) technique is applied to augment the representation of minority classes.

  4. Model training:

    Four distinct machine-learning algorithms are trained and fine-tuned. These algorithms undergo optimization through a combination of random search and grid search techniques conducted on the BBB_2013B course dataset. The assessment of model performance is achieved through ten-fold cross-validation, with a 70/30 training/test split.

  5. Model evaluation:

    The model’s predictive capabilities are evaluated on the subsequent BBB_2013J course dataset involving several quality metrics, such as the F-measure, sensitivity, specificity, and AUC, to assess the model’s efficacy and generalizability comprehensively.

[ASHLK19]

Raghad Al-Shabandar, Abir Jaafar Hussain, Panos Liatsis, and Robert Keight. Detecting at-risk students with early interventions using machine learning techniques. IEEE Access, 7():149464–149478, 2019. doi:10.1109/ACCESS.2019.2943351.

from shutil import rmtree
from tempfile import mkdtemp

import numpy as np
import pandas as pd
import plotly.express as px
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from IPython.display import display
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    recall_score,
    roc_auc_score,
    roc_curve,
)
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import PowerTransformer

from oulad import get_oulad

%load_ext oulad.capture
%%capture oulad
oulad = get_oulad()

Data aggregation#

We construct the feature_table DataFrame that aggregates student VLE interactions by activity type. Both the total sum of clicks and interaction frequency are computed.

%%capture -ns detecting_at_risk_students feature_table
feature_table = (
    oulad.vle.query("code_module == 'BBB' and code_presentation in ['2013B', '2013J']")
    .drop(columns=["code_module", "code_presentation"])
    .merge(oulad.student_vle[["id_student", "id_site", "sum_click"]], on="id_site")
    .groupby(["id_student", "activity_type"])
    .agg({"sum_click": ["sum", "count"]})
    .pivot_table(
        values="sum_click",
        index="id_student",
        columns="activity_type",
        fill_value=0,
    )
    .pipe(lambda df: setattr(df, "columns", df.columns.map("_".join)) or df)
    .join(
        oulad.student_info.query(
            "code_module == 'BBB' and code_presentation in ['2013B', '2013J']"
        )
        .loc[:, ["id_student", "final_result", "code_presentation"]]
        .assign(final_result=lambda df: df.final_result.isin(["Pass", "Distinction"]))
        .set_index("id_student")
    )
)
display(feature_table)
count_forumng count_glossary count_homepage count_oucollaborate count_oucontent count_ouelluminate count_quiz count_resource count_sharedsubpage count_subpage ... sum_oucollaborate sum_oucontent sum_ouelluminate sum_quiz sum_resource sum_sharedsubpage sum_subpage sum_url final_result code_presentation
id_student
23629 24.0 0.0 16.0 0.0 0.0 0.0 15.0 2.0 0.0 2.0 ... 0.0 0.0 0.0 31.0 2.0 0.0 5.0 0.0 False 2013B
23798 76.0 1.0 77.0 3.0 6.0 0.0 48.0 16.0 0.0 33.0 ... 3.0 44.0 0.0 104.0 21.0 0.0 47.0 56.0 True 2013J
25107 321.0 1.0 114.0 0.0 1.0 1.0 45.0 13.0 0.0 14.0 ... 0.0 1.0 1.0 85.0 23.0 0.0 21.0 14.0 True 2013B
27759 47.0 2.0 46.0 0.0 6.0 0.0 32.0 17.0 0.0 24.0 ... 0.0 17.0 0.0 88.0 19.0 0.0 35.0 15.0 False 2013J
27891 56.0 0.0 20.0 0.0 1.0 0.0 19.0 6.0 0.0 6.0 ... 0.0 1.0 0.0 38.0 6.0 0.0 11.0 6.0 False 2013B
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2685831 121.0 1.0 110.0 0.0 0.0 0.0 46.0 30.0 0.0 26.0 ... 0.0 0.0 0.0 148.0 35.0 0.0 49.0 9.0 True 2013B
2691100 18.0 0.0 62.0 0.0 6.0 0.0 40.0 49.0 0.0 38.0 ... 0.0 22.0 0.0 102.0 70.0 0.0 104.0 56.0 True 2013J
2691566 18.0 0.0 17.0 0.0 2.0 0.0 30.0 5.0 0.0 6.0 ... 0.0 33.0 0.0 81.0 6.0 0.0 13.0 1.0 False 2013J
2692384 133.0 0.0 112.0 0.0 1.0 0.0 32.0 54.0 0.0 43.0 ... 0.0 3.0 0.0 73.0 81.0 0.0 144.0 14.0 True 2013B
2693772 6.0 0.0 1.0 0.0 0.0 0.0 0.0 3.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 3.0 0.0 1.0 0.0 False 2013J

3421 rows × 24 columns

Data cleaning#

Note

Near-zero variance predictors are removed in the subsequent section. It is implemented in the model training pipeline.

Train test validation split#

To avoid leaking statistics from our validation data into the trained model, we split the dataset into a train/test set (BBB_2013B course presentation) and validation set (BBB_2013J course presentation), prior to data cleaning.

first_course_mask = feature_table["code_presentation"] == "2013B"
train_test_set = feature_table[first_course_mask].drop(columns="code_presentation")
x_train_test = train_test_set.drop(columns="final_result")
y_train_test = train_test_set.final_result.values
validation_set = feature_table[~first_course_mask].drop(columns="code_presentation")
x_validation = validation_set.drop(columns="final_result")
y_validation = validation_set.final_result.values

Highly correlated features removal#

We identify highly correlated features (>0.8) in the train set and remove them from both the train and validation sets.

x_train_test_corr = x_train_test.corr()
px.imshow(
    x_train_test_corr.round(2),
    title="Feature Correlation Matrix before feature removal",
    text_auto=True,
    color_continuous_scale="greens",
    width=800,
    height=600,
).show()
highly_correlated_features = (
    x_train_test_corr[(x_train_test_corr > 0.8)]
    .stack()
    .index.to_frame()
    .reset_index(drop=True)
    .pipe(lambda df: df[df[0] != df[1]])
    .apply(lambda col: "__".join(sorted([col[0], col[1]])), axis=1)
    .drop_duplicates(ignore_index=True)
    .map(lambda x: x.split("__")[0])
    .unique()
)
x_train_test = x_train_test.drop(columns=highly_correlated_features)
x_validation = x_validation.drop(columns=highly_correlated_features)
display(highly_correlated_features)
array(['count_forumng', 'count_glossary', 'count_homepage',
       'count_oucollaborate', 'count_ouelluminate', 'count_quiz',
       'count_resource', 'count_sharedsubpage', 'count_subpage',
       'count_url', 'sum_forumng', 'sum_resource'], dtype=object)
px.imshow(
    x_train_test.corr().round(2),
    title="Feature Correlation Matrix after feature removal",
    text_auto=True,
    color_continuous_scale="greens",
    width=800,
    height=600,
).show()

Data normalization, model training and evaluation#

In this section, we define the data normalization and model training pipeline.

As in the work of Al-Shabandar et al., we remove near-zero variance predictors (VarianceThreshold), normalize the features with the Yeo-Johnson transformation (PowerTransformer) and apply the Synthetic Minority Over-Sampling (SMOTE) technique.

We conduct a hyperparameter grid search on 70% of the BBB_2013B course dataset.

Note

We have commented out the hyperparameter ranges and replaced them with the selected values to reduce the execution time of this notebook.

Finally, we assess the model performance using the remaining 30% of the BBB_2013B course, which serves as the test set. In addition, the full BBB_2013J course dataset is employed as the validation set.

%%capture -ns detecting_at_risk_students scores_roc_curves
cachedir = mkdtemp()
grid = {
    GradientBoostingClassifier: {
        "classifier__learning_rate": [0.03],  # [0.03, 0.1, 0.6, 1.1, 1.6],
        "classifier__loss": ["exponential"],  # ["log_loss", "exponential"],
        "classifier__max_depth": [3],  # [3, 4, None],
        "classifier__min_samples_leaf": [40],  # list(range(10, 100, 10)),
        "classifier__min_samples_split": [20],  # list(range(10, 100, 10)),
        "classifier__n_estimators": [10],  # [10, 50, 100],
        "classifier__n_iter_no_change": [10],  # [None, 10],
    },
    RandomForestClassifier: {
        "classifier__criterion": ["gini"],  # ["gini", "entropy", "log_loss"],
        "classifier__max_depth": [3],  # [None, *list(range(1, 20, 2))],
        "classifier__min_samples_leaf": [10],  # list(range(10, 100, 10)),
        "classifier__min_samples_split": [50],  # list(range(10, 100, 10)),
        "classifier__n_estimators": [50],  # [10, 50, 100],
    },
    MLPClassifier: {
        "classifier__alpha": [0.1],  # [1/10**x for x in range(1, 7)],
        "classifier__early_stopping": [True],  # [True, False],
        "classifier__hidden_layer_sizes": [(135,)],  # (x,) for x in range(10, 200, 5),
        "classifier__learning_rate": ["constant"],  # constant, adaptive, invscaling,
        "classifier__learning_rate_init": [0.3],  # [0.001, 0.1, 0.3],
        "classifier__max_iter": [1200],
        "classifier__momentum": [0.9],  # [0.2, 0.5, 0.9],
        "classifier__solver": ["sgd"],
    },
    LogisticRegression: {
        "classifier__penalty": ["l1"],  # ["l1", "l2", None],
        "classifier__solver": ["saga"],
        "classifier__tol": [1e-4],  # [1e-03, 1e-04, 1e-05, 1e-06, 1e-07, 1e-08],
    },
}
pipe = Pipeline(
    [
        ("variance_threshold", VarianceThreshold(3)),
        ("smote", SMOTE()),
        ("power_transformer", PowerTransformer()),
        ("classifier", None),
    ],
    memory=cachedir,
)
mean_fpr = np.linspace(0, 1, 100)
scores = {
    "classifier": [],
    "test_accuracy": [],
    "test_f1": [],
    "test_sensitivity": [],
    "test_specificity": [],
    "test_AUC": [],
    "validation_accuracy": [],
    "validation_f1": [],
    "validation_sensitivity": [],
    "validation_specificity": [],
    "validation_AUC": [],
}
roc_curves = {}


def get_scores_and_roc_curves() -> tuple[dict]:
    """Computes the prediction scores and ROC curves."""

    for classifier_class, hyper_parameters in grid.items():
        tprs = []
        for _ in range(5):
            pipe.set_params(classifier=classifier_class())
            classifier = GridSearchCV(
                pipe,
                hyper_parameters,
                scoring="f1",
                n_jobs=-1,
                error_score="raise",
                cv=StratifiedKFold(n_splits=10, shuffle=True),
                refit=True,
            )
            x_train, x_test, y_train, y_test = train_test_split(
                x_train_test, y_train_test, test_size=0.3
            )
            classifier.fit(x_train, y_train)
            y_test_pred = classifier.predict(x_test)
            y_validation_pred = classifier.predict(x_validation)
            scores["classifier"].append(classifier_class.__name__)
            for name in ["test", "validation"]:
                y_true = y_test if name == "test" else y_validation
                y_pred = y_test_pred if name == "test" else y_validation_pred
                scores[f"{name}_accuracy"].append(accuracy_score(y_true, y_pred))
                scores[f"{name}_f1"].append(f1_score(y_true, y_pred))
                scores[f"{name}_sensitivity"].append(recall_score(y_true, y_pred))
                scores[f"{name}_specificity"].append(
                    recall_score(y_true, y_pred, pos_label=0)
                )
                scores[f"{name}_AUC"].append(roc_auc_score(y_true, y_pred))

            fpr, tpr, _ = roc_curve(y_validation, y_validation_pred)
            tprs.append(np.interp(mean_fpr, fpr, tpr))

        roc_curves[classifier_class.__name__] = np.mean(tprs, axis=0)

    return scores, roc_curves


scores_roc_curves = get_scores_and_roc_curves()
display(pd.DataFrame(scores_roc_curves[0]).groupby("classifier").mean())
px.line(
    pd.DataFrame(scores_roc_curves[1], index=mean_fpr)
    .melt(var_name="Classifier", value_name="True Positive Rate", ignore_index=False)
    .rename_axis(index="False Positive Rate")
    .reset_index(),
    x="False Positive Rate",
    y="True Positive Rate",
    color="Classifier",
    title="ROC curve for the validation set",
    width=800,
    height=600,
).show()
rmtree(cachedir)
test_accuracy test_f1 test_sensitivity test_specificity test_AUC validation_accuracy validation_f1 validation_sensitivity validation_specificity validation_AUC
classifier
GradientBoostingClassifier 0.837069 0.859013 0.959950 0.704247 0.832098 0.858498 0.884687 0.950560 0.735901 0.843230
LogisticRegression 0.836638 0.851489 0.911835 0.758487 0.835161 0.804262 0.823131 0.797948 0.812671 0.805309
MLPClassifier 0.844397 0.859135 0.930380 0.755372 0.842876 0.744273 0.745048 0.677612 0.833043 0.755328
RandomForestClassifier 0.831034 0.847872 0.919929 0.737445 0.828687 0.834843 0.858208 0.875933 0.780124 0.828029