Withdrawal prediction on behavioral indicators

Withdrawal prediction on behavioral indicators#

In this section, we try to reproduce the withdrawal prediction approach based on behavioral indicators and Synthetic Minority Over-sampling (SMOTE) as presented in the work of Hlioui et al. [HAG21].

Given the recurring occurrence of elevated learner withdrawal rates in Massive Open Online Courses (MOOCs), the implementation of early withdrawal prediction models could facilitate pedagogical enhancements, enable tailored intervention strategies and empower learners to monitor and enhance their academic performance.

The approach of Hlioui et al. encloses four main phases:

Data preprocessing
Behavioral indicators extraction (feature engineering)
K-means-based Data Discretizing
Withdrawal prediction

[HAG21]

Fedia Hlioui, Nadia Aloui, and Faiez Gargouri. A withdrawal prediction model of at-risk learners based on behavioural indicators. International Journal of Web-Based Learning and Teaching Technologies, 16(2):32–53, 2021. doi:10.4018/ijwltt.2021030103.

from itertools import chain

import pandas as pd
from imblearn.over_sampling import SMOTE
from IPython.display import display
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from oulad import filter_by_module_presentation, get_oulad

%load_ext oulad.capture

%%capture oulad
oulad = get_oulad()

Data preprocessing#

As in the work of Hlioui et al., we use the data from the DDD course 2013B presentation. We start by extracting three tables containing data related to student demographics, student assessments and student course interactions.

MODULE = "DDD"
PRESENTATION = "2013B"

Student demographics table#

student_info = filter_by_module_presentation(
    oulad.student_info, MODULE, PRESENTATION
).drop(columns=["studied_credits", "imd_band"])
display(student_info)

	id_student	gender	region	highest_education	age_band	num_of_prev_attempts	disability	final_result
13091	24213	F	East Anglian Region	A Level or Equivalent	0-35	0	N	Withdrawn
13092	40419	M	East Midlands Region	Lower Than A Level	35-55	0	N	Withdrawn
13093	41060	M	Ireland	HE Qualification	0-35	0	N	Fail
13094	43284	M	East Midlands Region	Lower Than A Level	0-35	2	N	Withdrawn
13095	45664	M	Yorkshire Region	HE Qualification	0-35	1	N	Pass
...	...	...	...	...	...	...	...	...
14389	2693243	F	Yorkshire Region	Lower Than A Level	35-55	0	N	Distinction
14390	2694933	F	East Midlands Region	A Level or Equivalent	0-35	0	Y	Pass
14391	2697773	F	East Anglian Region	A Level or Equivalent	0-35	0	N	Fail
14392	2707979	F	East Midlands Region	Lower Than A Level	0-35	0	N	Fail
14393	2710343	M	North Western Region	Lower Than A Level	0-35	0	N	Fail

1303 rows × 8 columns

Student assessments table#

assessments = filter_by_module_presentation(oulad.assessments, MODULE, PRESENTATION)
student_assessments = assessments.merge(
    oulad.student_assessment, on="id_assessment"
).drop(columns=["is_banked", "assessment_type", "id_assessment"])
display(student_assessments)

	date	weight	id_student	date_submitted	score
0	23.0	2.0	907261	25	80.0
1	23.0	2.0	957467	25	77.0
2	23.0	2.0	1018685	25	40.0
3	23.0	2.0	1044992	25	73.0
4	23.0	2.0	1068316	25	67.0
...	...	...	...	...	...
10368	240.0	100.0	2677955	230	69.0
10369	240.0	100.0	2683836	230	71.0
10370	240.0	100.0	2689536	229	58.0
10371	240.0	100.0	2693243	230	82.0
10372	240.0	100.0	2694933	230	73.0

10373 rows × 5 columns

Student course interactions table#

%%capture -ns withdrawal_prediction_on_behavioral_indicators student_vle
student_vle = (
    filter_by_module_presentation(oulad.student_vle, MODULE, PRESENTATION)
    .merge(filter_by_module_presentation(oulad.vle, MODULE, PRESENTATION), on="id_site")
    .drop(columns=["id_site", "date", "week_from", "week_to"])
)
display(student_vle)

	id_student	sum_click	activity_type
0	430516	3	page
1	430516	6	homepage
2	430516	2	url
3	420388	1	url
4	420388	3	homepage
...	...	...	...
536832	536170	2	resource
536833	536926	1	homepage
536834	556295	3	homepage
536835	556780	1	homepage
536836	556918	2	homepage

536837 rows × 3 columns

Distribution of learners’ results in the course DDD2013B#

Next, we reproduce the pie chart from the paper of Hlioui et al. (Figure 3) which reveals class imbalance of student final outcome in the DDD2013B course presentation.

The class imbalance problem is a commonly recognized challenge in Machine Learning, known to hinder the development of effective classifiers. (Batista, et al., 2004)

When trained on imbalanced datasets, models often exhibit a significant bias toward the majority class. This bias occurs because traditional Machine Learning algorithms are primarily focused on maximizing overall prediction accuracy, which can lead them to overlook or neglect classes with fewer instances, typically referred to as the minority class. (Bekkar & Alitouche, 2013)

To deal with the class imbalance problem, the approach of Hlioui et al. applies the Synthetic Minority Over-sampling method (SMOTE) on the dataset, which generates new observations in the minority class by interpolating the existing ones.

Note

At this juncture, we deviate marginally from the initial approach as we apply the SMOTE method subsequent to the extraction of behavioral indicators.

(
    student_info.final_result.value_counts()
    .loc[["Withdrawn", "Fail", "Pass", "Distinction"]]
    .plot.pie(
        title="Distribution of learners' results in the DDD2013B course",
        ylabel="",
        wedgeprops={"width": 0.6, "edgecolor": "w"},
        autopct="%1.0f%%",
        pctdistance=0.72,
        colors=["firebrick", "steelblue", "silver", "darkkhaki"],
        startangle=-270,
        counterclock=False,
        labeldistance=None,
        radius=1.2,
        textprops={"color": "white", "weight": "bold", "fontsize": 12.5},
    )
)
plt.legend(loc="center")
plt.show()

../_images/5b98b98eacba621b9cfa74191764374066ce2ea4bb3b3d76830d836f88f5c7c9.png

Behavioral indicators extraction#

At this stage we extract behavioral indicators as described in the work of Hlioui et al.

Autonomy#

The autonomy indicator refers to the navigation frequency of learners within the virtual learning environment (VLE).

autonomy = (
    student_vle.drop(columns="activity_type")
    .groupby(["id_student"])
    .count()
    .rename(columns={"sum_click": "autonomy"})
)
display(autonomy)

	autonomy
id_student
40419	60
41060	214
43284	422
45664	405
52014	116
...	...
2691780	156
2692101	151
2693243	1249
2694933	398
2697773	86

1214 rows × 1 columns

Perseverance#

The perseverance indicator refers to the ratio of evaluations submitted on time by the learners.

perseverance = (
    student_assessments.query("date_submitted <= date")[["id_student", "score"]]
    .groupby("id_student")
    .count()
    .div(assessments.shape[0])
    .rename(columns={"score": "perseverance"})
)
display(perseverance)

	perseverance
id_student
40419	0.071429
41060	0.357143
43284	0.214286
45664	0.500000
52014	0.357143
...	...
2689536	0.285714
2691780	0.214286
2692101	0.071429
2693243	0.500000
2694933	0.428571

969 rows × 1 columns

Commitment indicators#

Commitment indicators aim to measure the level and type of involvement of learners. In the work of Hlioui et al. they refer to the total sum of clicks (interactions) made by learners on several related activity types (activity categories).

Below, a barplot is presented, illustrating the frequency distribution of interactions by activity type within the DDD2013B course.

student_vle.activity_type.value_counts().to_frame().plot.barh(
    title="Activity type frequency in the DDD2013B course", xlabel="Frequency"
)
plt.show()

../_images/6ec0c18da25634a0da1a3fd1f7d5dfb8b4c96e5032cfadb31521c548fc0388c0.png

Collaborative commitment#

The collaborative commitment indicator refers to the sum of clicks learners made on activities of type forumng, ouwiki, and ouelluminate.

collaborative_commitment = (
    student_vle.query("activity_type in ['forumng', 'ouwiki', 'ouelluminate']")
    .drop(columns="activity_type")
    .groupby("id_student")
    .sum()
    .rename(columns={"sum_click": "collaborative_commitment"})
)
display(collaborative_commitment)

	collaborative_commitment
id_student
40419	69
41060	46
43284	176
45664	181
52014	110
...	...
2691780	43
2692101	32
2693243	941
2694933	305
2697773	32

1106 rows × 1 columns

Course structure commitment#

The course structure commitment indicator refers to the sum of clicks learners made on activities of type homepage and glossary.

course_structure_commitment = (
    student_vle.query("activity_type in ['homepage', 'glossary']")
    .drop(columns="activity_type")
    .groupby("id_student")
    .sum()
    .rename(columns={"sum_click": "course_structure_commitment"})
)
display(course_structure_commitment)

	course_structure_commitment
id_student
40419	29
41060	80
43284	293
45664	287
52014	98
...	...
2691780	87
2692101	69
2693243	591
2694933	279
2697773	37

1211 rows × 1 columns

Course content commitment#

The course content commitment indicator refers to the sum of clicks learners made on activities of type resource, url, oucontent, page, and subpage.

course_content_commitment = (
    student_vle.query(
        "activity_type in ['resource', 'url', 'oucontent', 'page', 'subpage']"
    )
    .drop(columns="activity_type")
    .groupby("id_student")
    .sum()
    .rename(columns={"sum_click": "course_content_commitment"})
)
display(course_content_commitment)

	course_content_commitment
id_student
40419	45
41060	270
43284	348
45664	558
52014	132
...	...
2691780	154
2692101	215
2693243	1050
2694933	310
2697773	79

1206 rows × 1 columns

Evaluation activities commitment#

The evaluation activities commitment indicator refers to the sum of clicks learners made on activities of type extenalquiz.

evalutation_activities_commitment = (
    student_vle.query("activity_type == 'externalquiz'")
    .drop(columns="activity_type")
    .groupby("id_student")
    .sum()
    .rename(columns={"sum_click": "evalutation_activities_commitment"})
)
display(evalutation_activities_commitment)

	evalutation_activities_commitment
id_student
40419	1
41060	15
43284	30
45664	22
52014	9
...	...
2689536	8
2691780	24
2692101	2
2693243	61
2694933	41

1103 rows × 1 columns

Motivation#

The motivation indicator measures whether a learners’ sum of clicks on all activities is above average (motivated) or below (unmotivated).

motivation = (
    student_vle.drop(columns="activity_type")
    .groupby("id_student")
    .sum()
    .assign(motivation=lambda df: df.sum_click >= df.sum_click.mean())
    .drop(columns="sum_click")
    .astype(float)
)
display(motivation)

	motivation
id_student
40419	0.0
41060	0.0
43284	0.0
45664	0.0
52014	0.0
...	...
2691780	0.0
2692101	0.0
2693243	1.0
2694933	0.0
2697773	0.0

1214 rows × 1 columns

Performance#

The performance indicator refers to the sum of weighted assessment scores by learner.

performance = (
    student_assessments.assign(performance=lambda df: df.weight * df.score)
    .drop(columns=["date", "date_submitted", "weight", "score"])
    .groupby("id_student")
    .sum()
)
display(performance)

	performance
id_student
40419	704.0
41060	6359.5
43284	4020.0
45664	9528.0
52014	4424.0
...	...
2689536	10621.0
2691780	2703.0
2692101	1819.0
2693243	17058.0
2694933	14997.0

1065 rows × 1 columns

K-Means-based data discretizing#

At this stage, we discretize the generated indicators using the K-Means clustering method. We begin by estimating the ‘k’ parameter through the elbow method and then replace the indicators’ values with their corresponding clustering labels.

Elbow method#

%%capture -ns withdrawal_prediction_on_behavioral_indicators inertia
indicators = {
    "Perseverance indicator": perseverance,
    "Autonomy indicator": autonomy,
    "Structure content commitment indicator": course_structure_commitment,
    "Evaluation activities commitment indicator": evalutation_activities_commitment,
    "Learning content commitment indicator": course_content_commitment,
    "Collaborative commitment indicator": collaborative_commitment,
    "Performance indicator": performance,
}
k_range = list(range(2, 10))
fig = plt.figure(figsize=(20, 20))
# Inertia: Sum of squared distances of samples to their closest cluster center,
# weighted by the sample weights if provided.
inertia = [
    KMeans(n_clusters=k, n_init="auto").fit(indicator.values).inertia_
    for indicator in indicators.values()
    for k in k_range
]
for i, name in enumerate(indicators):
    index = i * len(k_range)
    pd.Series(inertia[index : index + len(k_range)], index=k_range, name="k").plot(
        title=f"{name} (Inertia/k)",
        xlabel="Number of clusters",
        ylabel="SSE",
        grid=True,
        marker="s",
        ax=plt.subplot2grid((5, 5), (int(i / 2), 2 * (i % 2) + int(i / 6)), colspan=2),
    )

plt.subplots_adjust(hspace=0.4, wspace=0.4)
plt.show()

../_images/8625d8ddfb2ef6f169fd5dcac527f85b227db139d4221180f7ad88e34d88179a.png

Discretization#

We obtain similar graphs as in the work of Hlioui et al. Thus, we proceed by setting the same k values.

selected_k = {
    "Perseverance indicator": 3,
    "Autonomy indicator": 3,
    "Structure content commitment indicator": 2,
    "Evaluation activities commitment indicator": 3,
    "Learning content commitment indicator": 2,
    "Collaborative commitment indicator": 3,
    "Performance indicator": 3,
}
for name, k in selected_k.items():
    kmeans = KMeans(n_clusters=k, n_init="auto")
    indicator = indicators[name]
    indicator[indicator.columns[0]] = kmeans.fit_predict(indicator)

Feature table#

We join the student demographics table with the discretized behavioral indicators into a single feature_table.

We also encode categorical columns (age_band, disability, gender, highest_education, region, and final_result) to numerical values and fill missing values with zeros.

region_encoder = OrdinalEncoder()
feature_table = (
    student_info.set_index("id_student")
    .join(chain([motivation], indicators.values()), how="outer")
    .fillna(0.0)
    .replace(
        {
            "age_band": {"0-35": "0.0", "35-55": "0.5", "55<=": "1.0"},
            "disability": {"N": "0.0", "Y": "1.0"},
            "gender": {"M": "0.0", "F": "1.0"},
            "highest_education": {
                "No Formal quals": "0.0",
                "Lower Than A Level": "0.25",
                "A Level or Equivalent": "0.5",
                "HE Qualification": "0.75",
                "Post Graduate Qualification": "1.0",
            },
            "final_result": {
                "Withdrawn": "1.0",
                "Fail": "0.0",
                "Pass": "0.0",
                "Distinction": "0.0",
            },
        }
    )
    .assign(region=lambda df: region_encoder.fit_transform(df[["region"]]))
    .astype(float)
)
display(feature_table)

	gender	region	highest_education	age_band	num_of_prev_attempts	disability	final_result	motivation	perseverance	autonomy	course_structure_commitment	evalutation_activities_commitment	course_content_commitment	collaborative_commitment	performance
id_student
24213	1.0	0.0	0.50	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
40419	0.0	1.0	0.25	0.5	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
41060	0.0	2.0	0.75	0.0	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	1.0	0.0	1.0
43284	0.0	1.0	0.25	0.0	2.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0
45664	0.0	12.0	0.75	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2693243	1.0	12.0	0.25	0.5	0.0	0.0	0.0	1.0	0.0	2.0	0.0	2.0	0.0	1.0	2.0
2694933	1.0	1.0	0.50	0.0	0.0	1.0	0.0	0.0	2.0	1.0	0.0	1.0	1.0	0.0	2.0
2697773	1.0	0.0	0.50	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
2707979	1.0	1.0	0.25	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2710343	0.0	5.0	0.25	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

1303 rows × 15 columns

Withdrawal prediction#

Prior to training the classification models, we split the feature_table into a train (75%) and test (25%) set. Then we scale features to values between 0 and 1 and apply the SMOTE method to balance the occurences of the target class (Withdrawn/Not Withdrawn).

x_train, x_test, y_train, y_test = train_test_split(
    feature_table.drop(columns="final_result").values, feature_table.final_result.values
)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
smote = SMOTE()
x_resampled, y_resampled = smote.fit_resample(x_train, y_train)

Subsequently, we undertake a grid search across classification parameters for five classifiers, which have been chosen to align closely with those utilized in the work of Hlioui et al. These classifiers encompass Decision Trees, Random Forest, Support Vector Machines, Gaussian Naive Bayes (as a substitute for Tree Augmented Naive Bayes), and Multilayer Perceptron.

In line with the work of Hlioui et al., we adopt a 5-fold cross-validation approach with stratification. Evaluation of classifier performance is conducted utilizing the F-measure as the primary metric.