"Proctor" began as a hobby project. I have been coding for more than twenty years and some articles on natural language processing caught my eye in winter 2019. I teach mostly high school history and the most common assignment I give my students on their reading tasks is to compose a summary. This is educationally more sound than having students answer questions on the text. However, grading hundreds of summaries is very burdensome for the teacher. I wondered whether I could code an artificial intelligence to assist me in comparing student summaries to my own.
Proctor does not replace the teacher. That is because it is, to be honest, more "artificial" than intelligent. Nonetheless, it has greatly increased the consistency of my scoring and it has much sped the process for me. Proctor can help score informational / expository compositions where the range of correct responses to the writing prompt is fairly narrow. Proctor works great for short answer tests, for summaries and for short informational compositions.
How does it work? Proctor analyzes a student writing sample and compares it to a "corpus" of model texts. These represent full credit responses. I always write the first one, then I select several samples from the class which represent full credit variations and I add those to the corpus as well. Proctor compares text features such as word count, text sophistication / readability, n-grams, presence of proper nouns, number of sentences, etc. If the corpus has more than one model, then Proctor first selects the model that is most similar to the student's sample and scores based on that comparison. In the short answer test application, Proctor remembers partial credit awarded by the teacher and suggests that score when another student's response is sufficiently similar to the previously scored partial credit response.
For longer writing tasks, Proctor grades by level (1-4) and then on a scale of 100. Level 4 responses are the best. The 100-scale scores possible in this range are 85, 94, and 100. Level 3 represents the average or middling-quality response that is just passing. These can be 65 or 76. Level 2 is not "passing" by most standards and represents about half credit (55). Anything below that, proctor assigns a score of zero.
For the short answer assessment, the untrained Proctor will recognize full and half credit responses. As you grade responses, you can add student responses to the corpus when you recognize variations on full credit that you may not have foreseen when you composed the test. This is called "training" the AI. I built in a feature so that you can download your trained AI corpus and share with other teachers at InnovationAssessments.
Proctor was trained to grade papers like I do. I manually scored over 500 samples of student work in the spring of 2019 and compared them to how the AI would score, adjusting profiles of each scoring range so that Proctor came to score papers like I do. I fully expect that I will continue to adjust and refine the scoring algorithm. The machine learning functions built in to Proctor will also help it improve its accuracy on its own.
I discovered that Proctor can help coach my students as well. If you turn on coaching mode for short answer testing, Proctor will provide a student hints from the model corpus. Proctor can give noun phrases and proper nouns that students could consider including with a view to building up their score.
A word is in order about the limitations of Proctor version 1.0. Proctor was devised to score the quality of the included content of a writing sample by comparing it to one or more "ideal" versions. Proctor does not handle persuasive samples well yet. It also does not really understand outlines yet. Proctor does not assess spelling or grammar conventions, since virtually all browsers have this functionality. The assessment may be affected sometimes by student spelling and punctuation errors, but not always and only minimally. Finally, Proctor is not (yet?) an "essay grading app", although it will be reliable in scoring portions of essays where a narrow, measurable range of factual content is expected.