{ "cells": [ { "cell_type": "markdown", "id": "6eb2cd7f", "metadata": {}, "source": [ "# Detecting at-risk students\n", "\n", "In this chapter, we attempt to replicate the student learning achievement model\n", "originally introduced by Al-Shabandar et al. {cite}`alshabandar_2019`, which aims to\n", "identify students at risk of failure or withdrawal from an online course.\n", "\n", "The approach of Al-Shabandar consists of the following steps:\n", "\n", "1. **Data aggregation:**\n", "\n", " Student click interactions are aggregated by activity type.\n", " This aggregation process computes both the total sum of clicks and interaction\n", " frequency for each distinct activity type.\n", "\n", "2. **Data cleaning:**\n", "\n", " Highly correlated features (>0.8) and near-zero variance predictors are removed.\n", "\n", "3. **Data normalization:**\n", "\n", " Following this, features are normalized with the Yeo-Johnson transformation.\n", " Additionally, to address the issue of class imbalance, the Synthetic Minority\n", " Over-Sampling (SMOTE) technique is applied to augment the representation of\n", " minority classes.\n", "\n", "4. **Model training:**\n", "\n", " Four distinct machine-learning algorithms are trained and fine-tuned.\n", " These algorithms undergo optimization through a combination of random search and\n", " grid search techniques conducted on the `BBB_2013B` course dataset.\n", " The assessment of model performance is achieved through ten-fold cross-validation,\n", " with a 70/30 training/test split.\n", "\n", "6. **Model evaluation:**\n", "\n", " The model's predictive capabilities are evaluated on the subsequent `BBB_2013J`\n", " course dataset involving several quality metrics, such as the\n", " F-measure, sensitivity, specificity, and AUC, to assess the model's efficacy and\n", " generalizability comprehensively.\n", "\n", "```{bibliography}\n", ":filter: docname in docnames\n", "```" ] }, { "cell_type": "code", "execution_count": 1, "id": "03c7d049", "metadata": {}, "outputs": [], "source": [ "from shutil import rmtree\n", "from tempfile import mkdtemp\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import plotly.express as px\n", "from imblearn.over_sampling import SMOTE\n", "from imblearn.pipeline import Pipeline\n", "from IPython.display import display\n", "from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier\n", "from sklearn.feature_selection import VarianceThreshold\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import (\n", " accuracy_score,\n", " f1_score,\n", " recall_score,\n", " roc_auc_score,\n", " roc_curve,\n", ")\n", "from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.preprocessing import PowerTransformer\n", "\n", "from oulad import get_oulad\n", "\n", "%load_ext oulad.capture" ] }, { "cell_type": "code", "execution_count": 2, "id": "771d3a95", "metadata": {}, "outputs": [], "source": [ "%%capture oulad\n", "oulad = get_oulad()" ] }, { "cell_type": "markdown", "id": "992812c2", "metadata": {}, "source": [ "## Data aggregation\n", "\n", "We construct the `feature_table` DataFrame that aggregates student VLE interactions\n", "by activity type.\n", "Both the total sum of clicks and interaction frequency are computed." ] }, { "cell_type": "code", "execution_count": 3, "id": "e6205f2f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | count_forumng | \n", "count_glossary | \n", "count_homepage | \n", "count_oucollaborate | \n", "count_oucontent | \n", "count_ouelluminate | \n", "count_quiz | \n", "count_resource | \n", "count_sharedsubpage | \n", "count_subpage | \n", "... | \n", "sum_oucollaborate | \n", "sum_oucontent | \n", "sum_ouelluminate | \n", "sum_quiz | \n", "sum_resource | \n", "sum_sharedsubpage | \n", "sum_subpage | \n", "sum_url | \n", "final_result | \n", "code_presentation | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id_student | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
23629 | \n", "24.0 | \n", "0.0 | \n", "16.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "15.0 | \n", "2.0 | \n", "0.0 | \n", "2.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "31.0 | \n", "2.0 | \n", "0.0 | \n", "5.0 | \n", "0.0 | \n", "False | \n", "2013B | \n", "
23798 | \n", "76.0 | \n", "1.0 | \n", "77.0 | \n", "3.0 | \n", "6.0 | \n", "0.0 | \n", "48.0 | \n", "16.0 | \n", "0.0 | \n", "33.0 | \n", "... | \n", "3.0 | \n", "44.0 | \n", "0.0 | \n", "104.0 | \n", "21.0 | \n", "0.0 | \n", "47.0 | \n", "56.0 | \n", "True | \n", "2013J | \n", "
25107 | \n", "321.0 | \n", "1.0 | \n", "114.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "45.0 | \n", "13.0 | \n", "0.0 | \n", "14.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "85.0 | \n", "23.0 | \n", "0.0 | \n", "21.0 | \n", "14.0 | \n", "True | \n", "2013B | \n", "
27759 | \n", "47.0 | \n", "2.0 | \n", "46.0 | \n", "0.0 | \n", "6.0 | \n", "0.0 | \n", "32.0 | \n", "17.0 | \n", "0.0 | \n", "24.0 | \n", "... | \n", "0.0 | \n", "17.0 | \n", "0.0 | \n", "88.0 | \n", "19.0 | \n", "0.0 | \n", "35.0 | \n", "15.0 | \n", "False | \n", "2013J | \n", "
27891 | \n", "56.0 | \n", "0.0 | \n", "20.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "19.0 | \n", "6.0 | \n", "0.0 | \n", "6.0 | \n", "... | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "38.0 | \n", "6.0 | \n", "0.0 | \n", "11.0 | \n", "6.0 | \n", "False | \n", "2013B | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2685831 | \n", "121.0 | \n", "1.0 | \n", "110.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "46.0 | \n", "30.0 | \n", "0.0 | \n", "26.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "148.0 | \n", "35.0 | \n", "0.0 | \n", "49.0 | \n", "9.0 | \n", "True | \n", "2013B | \n", "
2691100 | \n", "18.0 | \n", "0.0 | \n", "62.0 | \n", "0.0 | \n", "6.0 | \n", "0.0 | \n", "40.0 | \n", "49.0 | \n", "0.0 | \n", "38.0 | \n", "... | \n", "0.0 | \n", "22.0 | \n", "0.0 | \n", "102.0 | \n", "70.0 | \n", "0.0 | \n", "104.0 | \n", "56.0 | \n", "True | \n", "2013J | \n", "
2691566 | \n", "18.0 | \n", "0.0 | \n", "17.0 | \n", "0.0 | \n", "2.0 | \n", "0.0 | \n", "30.0 | \n", "5.0 | \n", "0.0 | \n", "6.0 | \n", "... | \n", "0.0 | \n", "33.0 | \n", "0.0 | \n", "81.0 | \n", "6.0 | \n", "0.0 | \n", "13.0 | \n", "1.0 | \n", "False | \n", "2013J | \n", "
2692384 | \n", "133.0 | \n", "0.0 | \n", "112.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "32.0 | \n", "54.0 | \n", "0.0 | \n", "43.0 | \n", "... | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "73.0 | \n", "81.0 | \n", "0.0 | \n", "144.0 | \n", "14.0 | \n", "True | \n", "2013B | \n", "
2693772 | \n", "6.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "False | \n", "2013J | \n", "
3421 rows × 24 columns
\n", "\n", " | test_accuracy | \n", "test_f1 | \n", "test_sensitivity | \n", "test_specificity | \n", "test_AUC | \n", "validation_accuracy | \n", "validation_f1 | \n", "validation_sensitivity | \n", "validation_specificity | \n", "validation_AUC | \n", "
---|---|---|---|---|---|---|---|---|---|---|
classifier | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
GradientBoostingClassifier | \n", "0.837069 | \n", "0.859013 | \n", "0.959950 | \n", "0.704247 | \n", "0.832098 | \n", "0.858498 | \n", "0.884687 | \n", "0.950560 | \n", "0.735901 | \n", "0.843230 | \n", "
LogisticRegression | \n", "0.836638 | \n", "0.851489 | \n", "0.911835 | \n", "0.758487 | \n", "0.835161 | \n", "0.804262 | \n", "0.823131 | \n", "0.797948 | \n", "0.812671 | \n", "0.805309 | \n", "
MLPClassifier | \n", "0.844397 | \n", "0.859135 | \n", "0.930380 | \n", "0.755372 | \n", "0.842876 | \n", "0.744273 | \n", "0.745048 | \n", "0.677612 | \n", "0.833043 | \n", "0.755328 | \n", "
RandomForestClassifier | \n", "0.831034 | \n", "0.847872 | \n", "0.919929 | \n", "0.737445 | \n", "0.828687 | \n", "0.834843 | \n", "0.858208 | \n", "0.875933 | \n", "0.780124 | \n", "0.828029 | \n", "