A first descriptive analysis

A first descriptive analysis#

One of the best ways to get started with the OULAD analysis might be to explore the original paper that introduced the OULAD dataset. [KHZ17]

In this section we try to reproduce and summarize their findings. We also take some notes at the end which might be used later.

[KHZ17]

Jakub Kuzilek, Martin Hlosta, and Zdenek Zdrahal. Open university learning analytics dataset. Scientific Data, 4:170171, Nov 2017. doi:10.1038/sdata.2017.171.

import matplotlib.pyplot as plt
import pandas as pd

from oulad import get_oulad

%reload_ext oulad.capture

%%capture oulad
oulad = get_oulad()

General statistics#

module_count = oulad.courses.code_module.nunique()
print(
    "OULAD contains data about:\n"
    f"  - {oulad.courses.shape[0]} courses from {module_count} modules "
    "(4 STEM modules and 3 Social Sciences modules)\n"
    f"  - {oulad.student_info.shape[0]} students\n"
    f"  - {oulad.student_registration.shape[0]} student registrations\n"
    f"  - {oulad.student_vle.shape[0]} VLE interaction entries"
)

OULAD contains data about:
  - 22 courses from 7 modules (4 STEM modules and 3 Social Sciences modules)
  - 32593 students
  - 32593 student registrations
  - 10655280 VLE interaction entries

Student registration count by module with domain information#

registration_count = (
    oulad.student_registration.groupby(
        ["code_module", "code_presentation"], as_index=False
    )
    .count()
    .groupby(["code_module"])
    .agg(
        presentations=pd.NamedAgg(column="code_presentation", aggfunc="count"),
        students=pd.NamedAgg(column="id_student", aggfunc="sum"),
    )
)
oulad.domains.join(registration_count, on="code_module")

	code_module	domain	presentations	students
0	AAA	Social Sciences	2	748
1	BBB	Social Sciences	4	7909
2	CCC	STEM	2	4434
3	DDD	STEM	4	6272
4	EEE	STEM	3	2934
5	FFF	STEM	4	7762
6	GGG	Social Sciences	3	2534

Student registration count by module-presentation#

registration_count = oulad.student_registration.groupby(
    ["code_module", "code_presentation"]
).size()
registration_count.reset_index()

	code_module	code_presentation	0
0	AAA	2013J	383
1	AAA	2014J	365
2	BBB	2013B	1767
3	BBB	2013J	2237
4	BBB	2014B	1613
5	BBB	2014J	2292
6	CCC	2014B	1936
7	CCC	2014J	2498
8	DDD	2013B	1303
9	DDD	2013J	1938
10	DDD	2014B	1228
11	DDD	2014J	1803
12	EEE	2013J	1052
13	EEE	2014B	694
14	EEE	2014J	1188
15	FFF	2013B	1614
16	FFF	2013J	2283
17	FFF	2014B	1500
18	FFF	2014J	2365
19	GGG	2013J	952
20	GGG	2014B	833
21	GGG	2014J	749

max_id = registration_count.idxmax()
min_id = registration_count.idxmin()
print(
    f"The largest module-presentation {max_id} contains "
    f"{registration_count[max_id]} student registrations.\n"
    f"The smallest module-presentation {min_id} contains "
    f"{registration_count[min_id]} student registrations. \n"
    f"The average module-presentation registration count is "
    f"{registration_count.mean()}."
)

The largest module-presentation ('CCC', '2014J') contains 2498 student registrations.
The smallest module-presentation ('AAA', '2014J') contains 365 student registrations. 
The average module-presentation registration count is 1481.5.

Student assessment count#

exams = oulad.assessments[oulad.assessments.assessment_type == "Exam"]
print(
    f"The student_assessment table contains {oulad.student_assessment.shape[0]} rows."
    "\n"
    f"The assessment tabel contains {exams.shape[0]} Exams.\n"
    f"{pd.merge(oulad.student_assessment, exams, on='id_assessment').shape[0]} "
    "student_assessments are Exams."
)

The student_assessment table contains 173912 rows.
The assessment tabel contains 24 Exams.
4959 student_assessments are Exams.

Student info attributes distribution for CCC module#

ccc_student_info = oulad.student_info[oulad.student_info.code_module == "CCC"].drop(
    "code_module", axis=1
)

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 18), constrained_layout=True)

ccc_student_info.groupby(["age_band", "final_result"]).size().unstack().plot.bar(
    stacked=True, ax=axes[0, 0], title="Student count by age_band"
)
ccc_student_info.groupby(["disability", "final_result"]).size().unstack().plot.bar(
    stacked=True, ax=axes[0, 1], title="Student count by disability"
)
ccc_student_info.groupby(
    ["highest_education", "final_result"]
).size().unstack().plot.bar(
    stacked=True, ax=axes[1, 0], title="Student count by highest_education"
)
ccc_student_info.groupby(["gender", "final_result"]).size().unstack().plot.bar(
    stacked=True, ax=axes[1, 1], title="Student count by gender"
)
ccc_student_info.groupby(["imd_band", "final_result"]).size().unstack().plot.bar(
    stacked=True, ax=axes[2, 0], title="Student count by imd_band"
)
ccc_student_info.groupby(["region", "final_result"]).size().unstack().plot.bar(
    stacked=True, ax=axes[2, 1], title="Student count by region"
)
plt.show()

../_images/7b43975fb24aa908fb2a497257ec5e0b1a58cd65e0a3d02c2665d4ab594974ae.png

Notes#

The initial total number of students in the selected modules was 38239.
Students in a module presentation are organized into study groups of ~20 people.
Module resources are available from the VLE system a few weeks before the start.
If the final exam date is missing in the assessments table, it takes place during the last week of the module presentation.
The structure of B and J presentations may differ.
In the student_registration table, the student has withdrawn if the date_unregistration field is present.
If the student does not submit an assessment, no result is recorded.
The results of the final exam are usually missing.
An assessment score lower than 40 is interpreted as a failure.