Unlocking the Data Behind Rosh Review to Improve Educational Outcomes: Pt. 1

April 12, 2017
Many of our savvy customers realize intuitively that there is a treasure trove of performance data accumulating in the background with each new question answered by each of our users (40,933,037 total questions answered, as of today). We frequently get emails from Rosh Review users and program directors asking for even more data than we are currently able to display on the user performance pages or the PD Dash. Since the release of the PD Dash in late 2012, we have worked to lead the industry in analyzing, curating, and processing those data, in order to allow our users to glean actionable information.

As our user base grows, the power of Rosh Review for comparative analytics increases proportionally, both for individual subscribers and for program directors. While we continue to design, evaluate, and build new analytics features, we still have access to more interesting data than we are able to display in the current user interface. One recent conversation with the director of a PA program led to some interesting conversation on this topic that we thought we would share.

Mock Exams in Rosh Review

Educators currently have access to a selection of mock exams in the PD Dash, which can be assigned to trainees and tracked separately from main question bank performance. The Mock PANCE, Mock Emergency Medicine ITE, Mock ABFM for Family Medicine, Mock NCCPA EM CAQ Exam, and Mock Emergency Nurse Practitioner Certification Exam are curated by senior Rosh Review editors and assembled from questions not otherwise available to individual subscribers in the primary question banks. They are designed to be broadly representative of high-stakes exams, and many program directors have taken advantage of these offerings to help prepare their learners for real-world exams.

Unlike the main question banks in Rosh Review, however, the analytics currently available for mock exams are limited to performance comparisons among users from the same program. We are still working on building an interface to allow PDs to see worldwide comparison data for these exams. In the interim, we have partnered with a handful of “data-head” PDs to drill into the data from the entire Rosh Review user base.

What you do with the information matters

People use Rosh Review for different purposes. Some subscribers aren’t actively preparing for a high-stakes exam and simply use the question bank to stay current and expand their knowledge. Many are preparing for a particular exam. Some training programs have integrated the Rosh Review question bank into their curricula, and, while their learners will have to take a certification exam at some point in the future, they use the questions to augment clinical and didactic teaching, rather than for pure exam prep. Still others would prefer to use Rosh Review to assess learners’ progress or to remediate struggling learners. Many users have an overlap of multiple strategies. All of these objectives necessitate slightly different question styles and content foci. They also require different analytic philosophies.

Originally, we validated the Rosh Review scoring prediction and probability of passing algorithms against real-world exam performance. For example, we compared Rosh Review performance among emergency medicine residents against their ITE or ABEM Qualifying Exam scores. As our users’ needs diversify, however, we have broadened our thinking.

One interesting question raised by a PD recently was whether it would be statistically valid to use performance on the Rosh Review Mock PANCE for summative evaluation of their students.

Can Rosh Review be a program’s final exam?

If one were to evaluate Rosh Review’s statistical ability to pick out students who shouldn’t pass a course, what would be the “gold standard” used to make that decision? In other words, to which existing standard should Rosh Review be compared? Many of the mock exams are relatively new (or are frequently revised, so the exam as-a-whole is a “new” version), which means we haven’t had enough time to see the actual PANCE performance for the cohort of Rosh Review users who took our Mock PANCE.

The mock exams are typically built to give “bread-and-butter questions” to students in the middle of the pack. Currently, mock exams provide the most information for students whose ability is within 0.75 standard deviations of either side of the group mean. There are a handful of questions with really high discriminatory power, but it’s in a small population at this point. In other words, we didn’t really design the Mock PANCE to have the power to identify students at the extremes. Instead, they provide some mainstream, representative questions with a few tougher ones thrown in for the overachievers.

Widening that distribution

In contrast to the mock exams, the entire question bank is designed to target a much wider range of ability levels and has questions of varying levels of difficulty. In part, that is due to the broad array of objectives and use cases that we discussed earlier. Perhaps overall performance on Rosh Review can be used to predict much more than we originally thought. 

Classically, people think about the goal with exams (and medical board exam questions) as being “the more correct answers, the better.” Examiners typically report percent correct, and examinees strive to score 100%. Passing thresholds, where applicable, are also usually calibrated against percent correct.

But that’s not the only way to think about exam performance. In fact, a much more modern and statistically robust set of methods (and mathematics) called Item Response Theory (IRT) underlies much of our thinking at Rosh Review and provides a different paradigm for analysis. In Part 2, we will talk in more detail about what IRT is, how it works, and how it will eventually unlock the data behind Rosh Review to improve educational outcomes.

By Sean Michael

Comments (0)