Always Learning

Author Posts

Educator’s Voice: The Testing Effect-Improving Long-Term Retention of Information by More Frequent Testing

by Gail E. Krovitz
Tuesday, February 7th, 2012

By Gail E. Krovitz, Ph.D. , Director of Academic Training & Consulting

Testing, testing, 1, 2, 3… The idea of testing in education brings out mixed feelings in many of us. We think of excessive standardized testing in K-12 education, or instructors in higher education who assess a semester of content with only a scantron final test. But what if the act of testing is misunderstood and can actually provide an opportunity for learning instead of just assessing learning gained elsewhere? Research on the testing effect shows just that. The testing effect is the “finding that retrieval of information from memory [i.e., as during a test] produces better retention than restudying the same information for an equivalent amount of time” (Roediger and Butler, 2011: 20) and is supported by a strong series of experiments in laboratory settings as well as classroom studies. “In education today, people tend to think of tests as dipstick devices… you stick it in to measure what people know. But every time you test someone, you change what they know” (HL Roediger III, as quoted in Glenn).

Many laboratory studies that examine the testing effect are set up as follows:

“One group of students studied some set of materials and then was given an initial test (or sometimes repeated tests). Retention of the material was assessed on a final criterial test, and the tested group’s performance was compared with that of one or two control groups. In one type of control, students studied the material and took the final test just as the tested group did, but were not given an initial test. In a second type of control (a restudy control), students studied the material just as the tested group did, but then studied the material a second time when the tested group received the initial test; in this case total exposure time to the material was equated for the tested and control groups. The typical finding throughout the literature is that the tested group outperforms both kinds of control groups… on the final test” (Roediger and Karpicke, 2006b: 182).

So in summary, taking a test provides better preparation for future retention of material than does repeatedly studying or re-reading the same material (see studies reviewed in Roediger and Karpicke, 2006b, and Roediger and Butler, 2011). This is particularly true if the final test is delayed, compared to immediately taken after the studying (otherwise known as cramming in educational circles!).

This result may seem counter intuitive – how can taking a test provide better preparation for retention of information on a delayed test (i.e., long-term retention), compared to continuing to study or re-read the same material? But this is actually not surprising if you think about it terms of learning the skills needed for a task: exactly practicing the skill during learning (i.e., taking a practice test), helps you perform better when being assessed on that skill later (i.e., taking a follow up or final test) (Roediger and Karpicke, 2006a). If I want to learn to play tennis, I’m best served by actually practicing the skills needed to play tennis instead of reading about it. Also, research on how memory works shows that retrieving information during a test is not a “neutral event,” but actually impacts the ability to retrieve that information in the future. “People usually imagine memory as a storage space, as a space where we put things, as if they were books in a library. But the act of retrieval is not neutral. It affects the system” (JD Karpicke, as quoted in Glenn).

We might want to chalk these findings up to so-called “mediated” (or indirect) effects of testing. Mediated effects would include that frequent testing encourages students to study more throughout the class rather than cramming right before one or two large tests, or that tests give students feedback about what they do or don’t know so they can refine their future studying efforts (Roediger and Karpicke, 2006b: 182). With mediated effects of testing, “it is not the act of taking the test itself that influences learning, but rather the fact that testing promotes learning via some other process or processes” (Roediger and Kapicke, 2006b: 182). While these mediated effects are certainly valuable, and could lead to recommendations for more instructors to use low-stakes formative testing in their classes, this research focuses on direct or unmediated effects of the tests, something intrinsic to taking a test that helps future knowledge retrieval. “Testing not only measures knowledge, but also changes it, often greatly improving retention of the tested knowledge” (Roediger and Karpicke, 2006b: 181). Unfortunately, researchers don’t currently know why the testing effect works (see Roediger and Karpicke, 2006b for discussion of the theoretical studies investigating this), but the testing effect has been strongly shown in many studies.

Some other interesting findings from this research involve whether or not to give feedback (the correct answers) and what test format is most effective.

Feedback: It is important to give students the correct answers (or “feedback” as it’s called in these studies), as presenting students with the correct response after the test is more effective than simply telling them that a particular question is correct or not (Butler et al., 2007). It is also best to give this feedback after the test as a whole (delayed feedback) rather than right after answering each question (immediate feedback) (Butler et al., 2007). A laboratory study by Butler and Roediger (2008) illustrates the testing effect, as well as the importance of feedback, and of delayed feedback. The experiment yielded the following results for students who did not have a chance to read (study) the assigned passages they would be eventually be tested on (2008: 609):

No study, no initial test: 10% correct on final test

No study, initial test, but no feedback given: 18% correct on final test

No study, initial test, immediate feedback given: 42% correct on final test

No study, initial test, delayed feedback given: 57% correct on final test

Similar patterns were observed in each experimental set up, illustrating the importance of the initial test and the use of delayed feedback.

Interestingly, Butler et al. (2007) mention other studies showing that immediate feedback might be more effective in actual classroom settings (rather than laboratory experiments), and they suggest that this might be due to students not going back after the test to process the delayed feedback (both correct and incorrect questions) since they are not forced to do that as part of an experimental set up. Thus recommending that students make an effort to view the feedback and read correct and incorrect questions and answers after the test might be helpful.

Test format: Studies reviewed here suggest that if the initial test is short answer or essay format (a “free-recall” or “production” type test) it contributes to a larger test effect than if the initial test is multiple-choice (a “recognition” type test); the follow-up test format (whether short answer or multiple-choice) does not matter as much (Roediger and Karpicke, 2006b). However, recognition type tests do still show a strong test effect, so it’s probably still advantageous to use them in educational settings (like if the class size is too large to make manual grading of free-recall tests realistic).

Another potential issue of using multiple-choice or true/false questions on tests is that students are exposed to incorrect answers during the testing process. Therefore, “students may sometimes endorse false items as being true and thereby learn erroneous information,” or “even if they read a false item and know it is false, the mere act of reading the false statement may make it seem true at a later point of time” (Roediger and Karpicke, 2006b: 203). This is called the negative suggestion effect (Roediger and Karpicke, 2006b). To counterbalance this, providing feedback is extremely important on recognition type tests, and research shows that “if feedback is provided after a multiple-choice test, the negative effects are completely nullified” (Roediger and Butler, 2011: 23).

So far much of this discussion has focused on results of laboratory studies on the test effect, but what about studies in actual classrooms? Classroom findings might differ because students are responsible for more information in the classroom than in a laboratory setting, the material is presented in a variety of ways, and “students also differ greatly in the amount of studying they do before tests, in how soon they begin studying (relative to when tests occur), in their interest in the course material, and in their motivation to learn” (Roediger and Karpicke, 2006b: 195). However, studies in classroom settings also demonstrate a test effect. In a study looking at frequent quizzes given in a middle school science class, McDaniel and Agarwal found that frequent quizzing increased student performance on unit tests from 79% correct (for material not previously tested with a quiz) to more than 90% correct (2011: 403). The quizzing effect persisted until end of semester test (79% on what was quizzed vs. 72% on non-quizzed content) and an end of school year test (68% for quizzed vs. 62% non-quizzed content) (2011: 403). The quizzes were low stakes, less than 10% of students’ grades. In another example, one section of a statistics for psychology course included a test of four short answer questions at the end of each lecture period (totaling around 8% of the final grade), while another section of the same class taught by the same professor did not use these end of class tests. Students in the section using the end of class tests scored significantly higher on the exams (mean score of 86% versus 78%), and fewer students overall earned mean exam scores lower than 70% (5.4% of class versus 27.1% of class in comparison section), compared with students in the course section not using the tests (Lyle and Crawford, 2011).

All in all, research on the testing effect is compelling, and suggests that testing (or information retrieval practice) has a greater effect than studying on long-term retention of information, so more frequent “retrieval practice” (i.e., testing/quizzing) in the classroom should help increase long-term retention of information (Roediger and Butler, 2011). This research should hopefully allow us to see tests as opportunities for learning, instead of just instruments that assess learning acquired place by other means, and maybe it will inspire some of us to include more frequent testing in our own classes.

Sources:

Butler, A.C., J.D. Karpicke, and H.L. Roediger, III (2007). The effect of type and timing of feedback on learning from multiple-choice tests. Journal of Experimental Psychology: Applied 13(4): 273-281.

Butler, A.C., and H.L. Roediger, III (2008). Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing. Memory and Cognition 36(3): 604-616.

Glenn, D. (2007). You will be tested on this. Chronicle of Higher Education 53(40): A14. Accessed online on January 10, 2012 at http://chronicle.com/article/You-Will-be-Tested-on-This/14732

Lyle, K.B. and N.A. Crawford (2011). Retrieving essential material at the end of lectures improves performance on statistics exams. Teaching of Psychology 38(2): 94-97.

McDaniel, M.A., and P.K. Agarwal (2011). Test-enhanced learning in a middle school science classroom: the effects of quiz frequency and placement. Journal of Educational Psychology 103(2): 399-414.

Roediger, H.L., III, and A.C. Butler (2011). The critical role of retrieval practice in long-term retention. Trends in Cognitive Science 15(1): 20-27.

Roediger, H.L., III, and J.D. Karpicke (2006a). Test-enhanced learning: taking memory tests improves long-term retention. Psychological Science 17(3): 249-255.

Roediger, H.L., III and J.D. Karpicke (2006b). The power of testing memory: basic research and implications for educational practice. Perspectives on Psychological Science 1(3): 181-210.

Instructor’s Tip: Using exam statistics to improve exam quality

by Gail E. Krovitz
Friday, January 6th, 2012

It’s the end of the term and I want to look over my classes to determine what to update before the next term begins. Exams are an important part of my assessment strategy and I want to be sure that my exams are the best that they can be. How do I know what questions are working well? Which questions are too hard or too easy? Which ones aren’t as clear as they could be? Exam Statistics can help answer these questions.

You can view the Exam Statistics either by locating the exam in the Gradebook or by clicking on the exam item in the left navigation menu while in author mode. It is best to view the Exam Statistics after the exam has been completed, as only submitted exams are included in the analysis. You can download and save your Exam Statistics report if you want to archive it, or to work with it in a program like Excel.

Exam Statistics provide both exam-level and question-level information. When entering the Exam Statistics area for a particular exam, you start in the exam-level area. Exam-level statistics include: highest and lowest scores, range, mean, median, mode, the average difficulty of the question, standard deviation and frequency distribution of scores. This information shows how the class as a whole performed on the exam.

If you want more information about specific exam questions you can click the Question-Level Statistics tab. By default, analysis at the question-level displays in the order each question appears in the exam, but you can also sort exam questions by their level of difficulty. Question-level statistics include:

• The number of respondents for each question and the frequency that each response choice was selected. For multiple choice questions, this can be helpful for seeing not only how many students chose the correct answer, but which incorrect answers were also selected. From this you can investigate if the question distractors are working well or if there any you want to change.

• The difficulty level of each question. This is the proportion of students who got the question right. For example, if 43 students attempted a question and 28 answered it correctly, then the difficulty is 28/43 or 0.65 (i.e., 65% answered correctly). Research shows that the difficulty range should be between 0.3 and 0.8, since if fewer than 30% of the answered the question correctly then it could be too hard, and if more than 80% answered it correctly then the question may be too easy (see further information on the Definitions & Examples link in the Exam Statistics).

• The discrimination value for each question. This shows the degree to which each question separates the better students on the exam from the weaker students. When the value is high and positive, the students who did well on the question were the students who did well on the exam. When the value is high but negative, the students who did well on the question were the students who did poorly on the exam. When the value is low (positive or negative), there appears to be little correlation between how the students did on the question and how they did on the exam.

Using the difficulty and discrimination values together can help you investigate how effective your exam questions are. For example, a question on my exam has difficulty of 0.32 (meaning it’s reasonably difficult- only 32% of students who attempted it got it correct), but it also has a discrimination of 0.65 (meaning that it does help discriminate between students who do well on the exam and those that don’t). So assuming the topic is well covered in class materials, this is a question I’ll probably keep. But another question has difficulty 1.0 and discrimination of 0, so everyone who attempted it got it right. I’ll want to look at this further to decide if it’s something I want to keep.

So take a look at your Exam Statistics and see how you can improve the exams in your own courses!

Gail E. Krovitz, Ph.D.

Director of Academic Training & Consulting