Unlocking the Data Behind Rosh Review to Improve Educational Outcomes: Pt. 2

August 2, 2017
In Part 1 of this series, we talked a bit about our analytic philosophy, the treasure trove of performance data accumulating inside the Rosh Review servers, and how we curate questions for different purposes, such as high-stakes exam prep, lifelong learning, learner diagnostics, and program assessments. Now, let’s dive into a bit more detail about how we think about question difficulty and learner performance.

The old way of thinking

Not too long ago, all we had as a measurement of performance on a multiple-choice exam was the percent correct. That’s still the way most of us were acculturated to think about “success” in school. In the mid 20th-century, psychometricians (people who study the science of educational and psychological measurement) working at ETS (the people who brought us the SAT, AP, GRE, TOEFL, and others)* realized the many shortcomings of this method and created the mathematics and techniques to think about things differently. Their methods didn’t become approachable for real-world use, however, until computers became more ubiquitous a few decades later.

The problem with percent correct, or classical test theory more broadly, is that it’s a bit information-poor. It discards a lot of what we could figure out by using combinations of questions and combination of test-takers and simplifies everything down to one number. Consider this example:

  • Adam and I both answer the same 10 questions about cardiac rhythm disorders. I get 6 correct and Adam gets 8 (he probably wrote them). We get scores of 60% and 80%, respectively.**
  • Our scores have nothing to do with one another and were calculated completely in isolation, assuming that the other person does not exist. They are assumed to be independent observations. Is he actually 33% smarter than me?  
  • The statistics have no idea that the questions are all in the same topic area, and maybe my performance on one questions would have predicted my performance on another.
  • The statistics also don’t take into account that if we randomly guessed on all 10 questions (which have 4 answer choices each), our scores would probably be in the neighborhood of 25%. So, does 80% correct mean that Adam really knows 80% of the material or 55%? Or some other number?
  • If I gave Adam an 11th question about the same topic, what would be his probability of getting it correct? Nobody knows…

Now consider the perspective of a Rosh Review editor looking at our performance on those 10 questions. She would see something similarly simplistic. Remember, it’s easy for you to keep track of Adam and me because it’s just two of us, but if the editor is looking at aggregate data for 1000 users, she has no idea which test taker got each question correct, nor does she know how each individual performed overall.

  • Say that Adam and I had no overlap on the questions we got incorrect. Adam missed questions #1 and 2, and I missed #3, 4, 5, and 6. The editor would see that 50% of the learners answered correctly on questions 1-6 and 100% on 7-10. What should she conclude?
    • Are questions 7-10 too easy? Remember, I could have guessed randomly and had a 25% chance of getting each of those questions correct. Maybe question #10 is plenty hard, and we both just got lucky.
    • Are questions 1-6 equally difficult? No idea… Even if you imagined 1000 test-takers, you could envision a scenario where the editor would still face a conundrum of interpreting her overall question performance data.

Remember, the statistics assume that everything is independent. (Alternative reality: what if Adam and I got the same questions wrong? Are those questions too hard? Are they appropriately hard but just exceed each of our abilities, and a third test-taker smarter than both of us would get it correct? Who knows…)

What is good about classical test theory? It’s really simple to calculate. Also, we have all been trained since elementary school to understand why it’s great to get 100%.

Using all of that information

I alluded to the loss of information that happens when you consider everything in isolation. What if a powerful computer could help that poor editor by looking at those 1000 test takers simultaneously and finding each of the questions they answered correctly and incorrectly? Where are the overlaps? Where are the surprises? Does performance on one question predict performance on another? Do people who usually perform very well overall keep getting a certain question wrong? Maybe there is a problem with that question. You can imagine how such insights could be very powerful.

From the perspective of the learner, considering everything together helps as well. For example, if the computer knows the overall performance of everybody and knows which people got a given question correct, but I am seeing that question for the first time, the algorithm can construct a probability curve that predicts my chance of getting that question correct (and it’s probably a sigmoidal function ranging from 0.25—guessing—to 1, depending on my ability and the question itself). That way, whether I get the question correct or incorrect, I can also understand the probability of that outcome, based on my prior performance. While the outcome is binary, my odds might not have been 50:50. 

Take that one step further by performing the calculation after I answer each question. If I have a 50% chance of answering a given question correctly, taking into account my current ability level, and I answer the question correctly, an algorithm could select another question for me to see next, where I have a 40% chance of answering correctly. The algorithm could keep ratcheting up the difficulty until it converges on the limits of my actual ability. The computer can probe the frontiers of my understanding, feeding me questions just challenging enough to keep me learning but not frustrated (if I am studying), or feeding me questions until there is a tight confidence interval around my ability level (if it’s a high-stakes exam). That’s how Computer Adaptive Testing (CAT) works, and you can see why ETS might be interested in developing and refining such algorithms (and they have).

The assumption behind IRT

What I just described is possible with Item Response Theory (IRT), the methods that the team at ETS developed a few decades ago. There is a lot of math and computing power required, as you might suspect, but the fundamental assumption is simple. IRT assumes that questions measure some underlying latent trait, like how much I really know about cardiac rhythm disorders. It does not assume that each question is equally difficult, nor does it assume that each person with a different level of knowledge of cardiac rhythm disorders would have the same probability of answering a given item correctly.

That makes possible a lot of pretty neat things. For example, we can now calculate question difficulty—real difficulty, based upon all of the data and the total inference we can make about each user’s underlying knowledge level. We can also score each learner’s ability level of the latent trait. In fact, we can perform both measures on the same scale, so we can make quick comparisons between questions and learners and say things like, “knowing all we know about this user’s performance thus far, we think they are about 1.5 standard deviations above their peers, but the next question they see will be really tough—only 50% of people who are 3 standard deviations above their peers will get this question right.”

*All of these are trademarks of Educational Testing Service

**(Pure classical theorists would say that our scores are actually observed scores, which are made up of some true score plus some random statistical error. So on a randomly different attempt, I might get 5 or 7 correct, and Adam might get 7 or 9. I’m using small, round numbers to illustrate the point, but you get the idea.)

By Sean Michael

Comments (0)