Unique validity arguments supported by process data

This post, based on an upcoming presentation at ITC 2021, explores how process data can be used to evaluate some unique validity arguments. I consider three validity arguments supported by two different kinds of process data: a sequence of problem solving actions, and a keystroke log.

As you know if you have read my other blog posts I think a lot about validity. And you will also know that validity is an argument about the interpretation and use of test scores. In this post, I talk about how process data can present some really interesting opportunities and challenges with respect to validity.

Process data, sometimes also called log data, is a broad category for the sorts of data that can be captured while students are interacting with test questions. It is typically a moment to moment record of what students did: where they clicked, how long they spent, whether they changed their answer, etc. Although the terms log and process data are often used interchangeably, I think a distinction may be useful. We can use log data to refer to the raw log from the testing platform, and process data to refer to a refined version of that data that we argue actually represents some aspect of students’ process. Stay tuned for a future blog post on this topic.

I will briefly describe the process data I used to explore validity arguments; you can read more about our data here (under the “Data and Qualitative Coding” tab). We collected two types while our participants were answering test questions on introductory Java programming. The first are thinkalouds, a kind of interview where you ask the participant to say out loud what they are thinking. We transcribed those interviews and then categorized each segment according to what kinds of problem-solving and cognitive activities were going on. The second type of data are keystroke logs, which is a record of each key press and time interval between them.

Part of our study was to see how changing parts of the items might effect how participants responded. We developed a whole framework to think about those features of the items, but for this post I will just focus on the “openness” of an item. For the items where participants had to write Java code, the item could be either open or closed. The open items provided a specification, such as “write a program that asks users to input a number greater than 2, and then tells them whether that number is prime,” and asked participants to write the code. The closed version of that item provided line by line directions that participants had to translate. For example, “create a variable ‘x’ and set it to 2”, etc.

In this section, I set out three validity claims that could be evaluated by the evidence from our process data. For each claim, I briefly describe a result from the data, and then the validity claim(s) that data could be used for.

Based off of the thinkalouds, which we treated as a sequence of problem-solving and cognitive actions that participants took, we can compare the frequency of certain actions across the open and closed items. This showed that on open items, participants did significantly more planning than on the closed items. Specifically, there was more planning for the code as a whole, which we called global planning, as well as for specific lines of code, which we called local planning.

sideways stacked bar chart showing that open items had more planning than closed items
Comparing the planning activities on the open and closed writing items. This bar chart shows that on open items, which only gave a specification for the completed program, participants did much more planning for the code as a whole (global planning) and for specific lines of code (local planning).

In terms of validity, this result lets us evaluate a claim about what knowledge and skills an item is likely to require. The open items clearly had more instances of planning, which means that they required a code planning skill in addition to other knowledge, like the use of for loops. This means that scores on open items are a result of both participant’s knowledge of Java syntax as well as their skills at planning code. The closed items, which had less planning, seem to be measuring a kind of translation skill from natural language to Java code. So scores on those items reflect the knowledge of Java as well as a translation skill.

The second validity claim is based off the keystroke log. The figure below shows the elapsed time of the response on the horizontal axis, and the inter key interval, or time between keypresses, on the vertical axis for all the closed items. This shows that on closed items, all of the incorrect responses were submitted earlier, and generally have lower inter key intervals (i.e. faster typing) than correct responses.

This chart shows, for the closed items, the time between key presses throughout a each participant’s response to the item. The red points and lines represent incorrect answers, while the turquoise ones are correct answers.

The faster typing and earlier submissions on the incorrect responses suggests that participants with incorrect answers may have been rushing through these items. In terms of validity, this result means that scores on these closed items may not be valid measurements of the underlying construct, if the participants were not giving their full attention.

The last validity claim is based off thinkalouds and keystroke logs combined. Both data sources had timestamps, and we linked them based off when participants started typing their response. As a result, the thinkaloud provided more information about what was happening for any given interval in the keystroke log. The chart below shows that combined data for only two items: the top panel has a closed item, while the bottom has an open one. Elapsed time is on the horizontal axis, and the number of characters in the response is on the vertical. If the line goes up characters are being added, and if it goes down they are being removed. Coloring on the line stands for the problem-solving and cognitive activities happening in the thinkaloud (more on those here), and vertical lines are any time the participant was rereading the program specification for the item.

Comparing one open and one closed item by combining the keystroke and thinkaloud data. Vertical lines are placed when the participant rereads the program specification, a problem-solving action. Coloring is based on the other problem-solving and cognitive actions the participant was taking.

This figure shows three different ways that rereading the specification influenced responding: early in the response, after having typed some “givens” for the code; after production plateaus, about 150s into the response on the closed item; and at about the same time on the open item, where rereading the spec happens only once before checking and submitting the answer.

Turning to validity, there are a lot of different claims this data could support about how the program specification is related to participants’ responses. For example, the top panel in the figure showed the participant rereading the specification, followed by a plateau in production, then increased production. This suggests that the specification may have had some information that helped this participant keep writing. If, after rereading the specification, the participant removed some code, it would suggest the specification prompted some evaluation and revision of the code. These kinds of conclusions about a participant’s process should not inform the scoring of the item, but can provide useful information to interpret what the score means.

My goal here was not to be exhaustive about validity arguments based on process data. Instead, these three examples are intended give some sense of the many possibilities for using process data to evaluate unique validity claims that traditional validity methods, like factor analysis, cannot. In addition, I think these examples highlight the importance of research on the response processes that participants use when responding to an item. Evidence from response processes is a crucial type of validity evidence to examine, and process data can make that possible.

PhD Candidate in Educational Measurement and Statistics at UW Seattle | Improving CS assessments with psychometrics and process data