Applying sequence analysis methods to response process data
This post describes three analysis methods we applied to response process data, and how they could be used in a validity argument.
Validity is an argument about the interpretation and use(s) of scores from a test. Like many arguments, a validity argument involves claims made about those interpretations and uses, along with evidence to support those claims. According to the Standards for Educational and Psychological Testing, there are (at least) 5 kinds of validity evidence.
In empirical validation studies, oftentimes validity evidence is focused on three things: the test’s internal structure (i.e. factor analysis), the test content, and the relationship of test scores with other measures. These are all great sources of validity evidence. Validity evidence from response processes, which are the cognitive and other processes that test takers engage in to answer a test item, is not collected or analyzed as frequently.
This article describes three methods of analyzing response process data that we undertook for a paper at AERA 2021: proportional frequency, sequence dissimilarity, and sequential pattern mining. We applied these methods to understand how modifications to an item influenced the response processes that students engaged in when responding. We were not seeking validity evidence for a specific test, but instead wanted to explore aspects of test items that test designers could use to elicit certain performance. Despite our focus, these methods are well suited to provide evidence for a validity argument.
I’ll briefly describe the data we analyzed with these methods, then I’ll explain the methods in more detail. Our primary goal was to understand how changing specific features of computer science assessment items influenced the process students engaged in to respond. For example, one feature was “openness”: a closed item had a line by line description of code to write, while an open item just provided a more general specification. To understand how that might influence responses, we conducted thinkaloud interviews, a research technique that asks participants to voice their thoughts as they solved each item. We made transcripts of those interviews, and then segmented the transcript and applied a qualitative code based on the primary process in the segment. For example, one segment might be planning code to write, followed by typing syntax, followed by re-reading the specification. We treated those qualitative codes as a sequence, specifically a sequence indicating the process that a participant took to answer an item. We also categorized the codes more broadly: for example, noticing confusion was a kind of “monitoring” of thinking, while re-reading the specification was a kind of problem solving.
We had specific item features that we wanted to compare. For example, we wanted to compare the difference in process for open and closed items. The proportional frequency let us do this by combining all individual’s sequences, and ignoring the order in those sequences. For open items, we calculated the proportion of all codes made up by each process category. So for example we found that, on open items, problem solving accounted for 30% of all codes. Then we compared that with the proportion for closed items, using a chi-square test with the null hypothesis that the difference in the proportions was zero. And so on for all features and process categories. This gave us really broad results: we could say that the proportion of problem solving was different, but not much more than that. We also were able to use the proportions to create visualizations that summarized differences between features, like the one below.
Sequence dissimilarity gave us similarly broad results. However, unlike proportional frequency, this analysis took the order of sequences into account. It also preserved each individual’s sequence in the analysis. This means that, even thought it also only provided results like “sequence openness explained significant variation in sequences”, that result was based the order of steps in individual sequences for that item feature. That means we could say that sequences from open items were more similar to each other than sequences for closed items, which suggests that openness did have an influence on student’s processes.
There are a few steps to doing dissimilarity analysis. First, you have to choose a dissimilarity measure. Studer and Ritschard have a great discussion of the different options. We chose to use longest common subsequence because we wanted to highlight the difference in the order between two sequences. We can think of a dissimilarity measure as the “distance” between two sequences. Once we calculated the dissimilarity between each sequence, we assembled the pairwise distance into a matrix, like the one below.
That matrix can be used to calculate “discrepancy”, which is like a variance but based on the pairwise distances between objects. We then calculated the sum of the squared discrepancies within an item feature (e.g. open items) and between item features (e.g. open and closed items). We used that within and between sum of squares to calculate and F-test, which told us whether the feature explains significant discrepancy (and for openness, we found that it did!)
Sequential Pattern Mining (SPM)
Where proportional frequency and sequence dissimilarity analysis told us about general differences in sequences, SPM gave us more detail about what distinguishes sequences of one feature from sequences of another feature. SPM is focused on subsequences, which are a set of sequence actions, in order, that is less than or equal to the length of the whole sequence. Briefly, SPM works by finding subsequences according to specific criteria (more below), then seeing how frequently those subsequences appear in sequences of each type, and determining their statistical significance with a chi-square test. Continuing with our example of comparing open and closed items, SPM compared how frequently each subsequence appeared in open items vs. closed items.
To use SPM, we had to set two values: the support, and maximum subsequence length. Support is the proportion of sequences that any given subsequence appears in. This can constrain the subsequences that go into the chi-square analysis. If support is set to 0, then any subsequence that appears in even a single sequence will be included; alternatively, if support is set to 80%, then a subsequence will only be analyzed if it appears in at least 80% of all sequences. We chose a value of 50% for our analysis, since we wanted to find subsequences that were more indicative of general aspects of student’s processes, rather than specific to a single student. The second value to set is the maximum subsequence length. This is the number of actions that can be considered as a subsequence. If this value is very high, the analysis will consider subsequences that may be difficult to interpret without strong theory of how performance in the assessment domain works. We limited our lengths to 3 actions, to ensure that we looked at interpretable chunks of student processes.
The results of SPM are a list of subsequences, ordered by their discriminating power (i.e. the magnitude of the chi-square value), as well as the frequency of that subsequence in sequences of each type. These kinds of results let us understand how specific sets of processes that students engaged in were associated with item features, which provided much more detail about how item features influenced student responses.
I mentioned above that the focus of our paper was to understand how item features influenced student’s responses. The analysis methods explained in this post certainly gave us some insight into that question, which you can read more about by visiting the iPresentation we prepared for AERA. I also want to point out how those same methods could be used to gather validity evidence for the interpretation and use of test scores.
- If we have really strong theory about the process that students are supposed to engage in when solving an item, not just what they should do but a specific sets of actions they should use, we can use sequential pattern mining to see whether specific subsequences appear significantly more frequently in student’s sequences than in randomly generated sequences. This would show us whether the presence of those subsequences could discriminate between student sequences, which were a result of the item and student, and randomly generated sequences, which were random.
- If we have less strong theory, for example about the process and a specific order of how students should do it, we can compare all student sequences to an “ideal” sequence with sequence dissimilarity measures. This would provide evidence on the extent to which students were engaging in the intended process.
- If we had weaker theory, say just the kinds or amounts of process but not the order, we could apply proportional frequency analysis to see whether a significant proportion of the process was distributed as we anticipated. These kinds of analysis methods are well-suited to understanding what response process can tell us about student performance on a test.
If you’d like to read more detail about our paper at AERA, including the item features we used, and the full results we found, visit our iPresentation. Please reach out if you have any questions!
This summer, I’ll also be presenting at the International Test Commission Colloquium on combining multiple sources of response process data for validity arguments.