Context: The students’ council mathematics and computer science in Paderborn (FSMI) provides a service called “Klausurausleihe”. Students may come to the FSMI and lend exams and protocols to prepare themselves for exams.
To manage the dataset behind this services, the so-called klausurtool is used. The FSMI an manage the set of exams available and their status. Exams are represented as tuples (examiner, date, content, lendingStatus, lecture) [0].
The problem: For every single exam, the examiner (and other data has to be entered). There were no rules on how to enter the examiner. Also, the klausurtool did not use UTF-8.
E.g. “Blömer” has been entered as”Blömer”, “Bloemer”, “Bl\”mer”. The last one is using LaTeX-notation (which has some funny effects on DB queries). Note, that this does not yet include typos like “Bleomer” or “Blmöer” which are also common.
Because of these problems (and some others not mentioned), a new klausurtool is in development. One of the originally planned features was to enter examiners and exams separately, i.e. when entering an exam, the examiner has to be picked from the existing ones.
Of course, the new klausurtool needs to be “feeded” with data. There are multiple possiblities – sortied ascending by my preference – to do this:
I am not a computer and therefore should not do tasks that can be automated. Especially when we’re talking about lots of stuff (there are ~300 uncleaned examiners, excluding the other data). Hence, I dismissed 1.
Possibility 2 is also to be dismissed: There’s a chance to do things right; why should we repeat the errors done before?
Number 3 sounds great! All the data, all clean and beautiful. Unfortunately this would be too much work: First, extract all uncleaned examiners from the set of exams. Then clean the examiners and create a map uncleanedExaminer_i -> cleanedExaminer_j. Finally, convert all exams to the new format, while replacing uncleanedExaminer_i on every exam with a relation to cleanedExaminer_j.
This may be possible but would get ugly soon enough. In fact, converting stuff into a new schema without cleaning is ugly enough (I already did this once).
Therefore, we are left with the final possibility: clean some data and add it to the new klausurtool.
Evaluating the data: In this post I will focus on examiners. Nonetheles the things mentioned here are applicable to different data (e.g. lecture names etc).
In this step, I asked the following questions, answered in later blog posts [1]:
The last step cannot be automated completely, without creating/typing a list of “correct” examiners. But it can be semi-automated: a program can present the clusters of Step 2 and the user can select the correct examiner from a cluster:
- Bleomer
- Bloemer
- Blömer
- Bl\”omer
Correct one: » 3
This solution should be faster than the “take out all exams and type the examiner into the system”…
[0] This is simplified.
[1] I’m lazy, therefore these links will be broken, until the corresponding blog posts are released. Could take some time…