Saturday, 7 September 2013

Assessment Reporting: Feedback and Scoring

Feedback

Designing and selecting tasks is one thing, but how to establish quality feedback is quite another, and a very important one. Without proper feedback the whole concept of assessment contributing to the learning process is endangered. Feedback possibilities depend clearly on the “format” that is chosen. In discourse the feedback can be immediate and very differentiated in the sense that the feedback can be direct (giving the student information about what is wrong and why and giving a suggestion for correction) but also and probably quite often, indirect (just asking whether the student is “sure” and can explain his answer and comparing it with other answers given by fellow students).


Feedback possibilities with the multiple-choice format are not abundant: Usually, the only feedback students get is whether something was correct or incorrect; in a best-case scenario, the teacher might spend some time in the classroom highlighting some of the most common incorrect answers. Within the common restricted-time written test, there are ample opportunities to give dedicated, individual feedback to the student. This is time-consuming and the quality of the feedback depends to a large extent on how the student’s answers are formulated. If the student fails to write down anything relevant, the question of quality feedback becomes an extremely difficult one. In such cases, oral feedback after additional questioning seems to be the only option.


Feedback can also have a very stimulating effect. Consider, for example, the homework method. Compare the experience of a student who is assigned homework but nothing is done beyond “checking” whether he or she “did” it, versus the student who gets quality feedback (as described in the Homework section). This was also pointed out in a study in Venezuela on mathematics homework (Elawar & Corno, 1985). One group of students was given specific feedback on specific errors and strategies used. Another group followed the “normal” practice of homework without comments. Analysis of the results showed a large effect of the feedback treatment on future student achievement.


A definition for feedback can be found in Ramaprasad (1983): “Feedback is information about the gap between the actual level and the reference level of a system parameter, which is used to alter the gap in some way. In order for feedback to exist, the information about the gap must be used in altering the gap.” This definition is a little too restricted for our purposes because the “gap” need not necessarily be a gap in the strict sense. Students might be able to solve a problem at very different levels of mathematization and formalization. But they are all successful. So theoretically speaking there is no gap. But we might still use the feedback mechanism to bridge the level-of-formality “gap”: to show the students on a less formal level what is possible with some more formal mathematics. It can also be used the other way around: to show the more formal students how elegant—maybe even superior—“common sense” solutions can be. Kluger and DeNisi (1996) identified four different ways to close the gap. The first one will come as no surprise: try to reach the standard or reference level—this needs clear goals and high commitment on the part of the learner. On the other end of the scale, one can abandon the standard completely. In between we have the option of lowering the standard. And finally, one can deny the gap exists.


Kluger and DeNisi also identified three levels of linked processes involved in the regulation of task performance: meta-task processes involving the self, task motivation processes involving the focal task, and finally the task-learning processes involving details of the task. About the meta-task processes, it might be interesting to note that feedback that directs attention to the self rather than the task appears likely to have negative effects on performance (Siero & Van Oudenhoven, 1995; Good & Grouws, 1975; Butler 1987). In contrast to those interventions that cue attention to meta-task processes, feedback interventions that direct attention toward the task itself are generally more successful. In 1998, Black and Wiliam were surprised to see how little attention in the research literature had been given to task characteristics and the effectiveness of feedback. They concluded that feedback appears to be less successful in “heavily-cued” situations (e.g., those found in computer-based instruction and programmed learning sequences) and relatively more successful in situations that involve “higher-order” thinking (e.g., unstructured test comprehension exercises).


From our own research (de Lange, 1987), it became clear that the “two-stage task” format affords excellent opportunities for high-quality feedback, especially between the first and second stages of the task. This is in part due to the nature of the task format: After completion of the first stage, the students are given feedback that they can use immediately to complete the second stage. In other words, the students can “apply” the feedback immediately in a new but analogous situation, something they were able to do very successfully.


Scoring

Wiggins (1992) points out, quite correctly, that feedback is often confused with test scores. This perception is one of many indications that feedback is not properly understood. A score on a test is encoded information, whereas feedback is information that provides the performer with direct, usable insights into current performance and is based on tangible differences between current performance and hoped-for performance.


So what we need is quality feedback on one side and “scores” to keep track of growth in a more quantitative way on the other side. And quite often we need to accept that we are unable to quantify in the traditional sense (e.g., on a scale from one to ten), but just make short notes when, during a discourse or during homework, a student does something special, whether good or bad. Many of the formats described before have in common that they are in a free-response format. Analysis of students’ responses to free-response items can provide valuable insights into the nature of student knowledge and understanding and in that sense help us formulate quality feedback. With such formats we get information on the method the student uses in approaching the problem and information about the misconceptions or error types that they may demonstrate. 


But as the TIMSS designers observed (Martin & Kelly, 1996), student responses to free response items scored only for correctness would yield no information on how the students approached problems. So TIMSS developed a special coding system that can also be used in classroom assessment to provide diagnostic information in addition to information about the correctness of the student responses. It was proposed that a two-digit coding system be employed for all free-response question items. The first digit, ranging between 1 and 3, would be used for a correctness score, and the second digit would relate to the approach or strategy used by the student. Numbers between 70 and 79 would be assigned to categories of incorrect response attempts, while 99 would be used to indicate that the student did not even try. This TIMSS coding system, which was later adapted successfully for the Longitudinal Study on Middle School Mathematics (Shafer & Romberg, 1999), is demonstrated in Table 1 using a generic example of the coding scheme worth one point.


Student responses coded as 10, 11, 12, 13, or 19 are correct and earn one point. The second digit denotes the type of response in terms of the approach used or explanation provided. For items worth more than one point, rubrics were developed to allow partial credit to describe the approach used or the explanation provided.


Student responses coded as 70, 71, 76, or 79 are incorrect and earn zero points. The second digit gives us a representation for the misconception displayed, incorrect strategy used, or incomplete explanation given. This gives the teacher a good overview of where the classroom as a whole stands, as well as individual differences, which can lead to adequate and effective feedback. Student responses with 90 or 99 also earn zero points. A score of 90 means the student attempted but failed completely, and 99 represents no attempt at all.


Another addition to the scoring system that can be very helpful is a code for the level of mathematical competency. Of course, when a teacher designs her classroom assessment system she will balance it in relation to the levels of mathematical competencies. But this will not necessarily lead to information on the levels of individual students. A crucial and potential weak point arises when we are dealing with partial credit, as will quite often be the case. This is a difficult point for students and teachers alike. Without preparation, guidelines, exemplary student responses, or a proper “assessment” contract between teacher and students, partial-credit scoring can be a frustrating experience even though its necessity is obvious. We therefore discuss the issue of partial credits in a little more detail through the following examples.


First, we will present an example of a very simple and straightforward method for partial scoring credits in the form of an (external) examination item about a cotton cloth for a round table .
Nowadays you quite often see small round tables with an overhanging cloth . You can make such a cover yourself using—
• Cotton, 90 cm wide; 14.95 guilders per meter
• Cotton, 180 cm wide; 27.95 guilders per meter
• Ornamental strip, 2 cm wide; 1.65 guilders per meter
When buying the cotton or strip, the length is rounded to the nearest 10 cm. For instance, if you want 45 cm, you need to buy 50 cm.

1. Marja has a small, round table: height 60 cm; diameter 50 cm. On top of the table, she puts a round cloth with a diameter of 106 cm.
3 points—How high above the ground will the cloth reach?

2. Marja is going to buy cloth to make her own cover. She wants it to reach precisely to the ground. It will be made from one piece of cotton fabric and will be as economical as possible. There will be a hem of 1 cm.
6 points—Compute the amount of cotton Marja will have to buy and how much that will cost.

3. Marja wants an ornamental strip around the edge of the cloth.
4 points—Compute how much ornamental strip she will need and how much it will cost.


This might seem clear but of course it is not. There are still many answers possible for which the subjective judgment of one teacher might differ from another. That is why it is advisable to use inter subjective scoring with external examinations. With inter subjective scoring, at least two teachers score the test independently, and they have to come to an agreement. This is a must for high-stakes testing but can also be done on a somewhat regular basis in the classroom if teachers coordinate their classroom assessment practices.


Scores are usually on a 100-point scale and are deceptive in the sense that a score of 66 actually means a score from something like 62 to 69 and thus seems more precise than it actually is. But the advantage is that students can check the judgment of the teacher and start a discussion about a score based on clear points of departure.


If so-called “holistic” scoring is used, the clarity is less obvious because there is more room for subjective judgment. Under holistic scoring, we group the scoring systems that are quite often very meaningful for such formats as essays, journals, and projects but are nowadays also used for formats that can be scored more easily. As an example we present two higher-grade descriptors of journals by Clarke, Stephens, and Waywood (1992):

A: Makes excellent use of her journal to explore and review the mathematics she is learning. She uses mathematical language appropriately and asks questions to focus and extend her learning. She can think through the difficulties she encounters.

B: Maintains regular entries and is able to record a sequence of ideas. She uses examples to illustrate and further her understanding and is able to use formal language to express ideas but is yet to develop mathematical explorations.And two descriptors by Stephens and Money (1993) for extended tasks (not complete):
A: Demonstrated high-level skills of organization, analysis, and evaluation in the conduct of the investigation. Used high levels of mathematics appropriate to the task with accuracy.
B: Demonstrated skills of organization, analysis, and evaluation in the conduct of the investigation. Used mathematics appropriate to the task with accuracy.


It is clear that in this latter set of descriptors, subjective judgments are a greater risk than in the previous example. But for some formats we almost have to use this kind of scoring system. One can still use numbers of course, even on a 100-point scale for very complex tasks. Exemplary student work and how the teacher judged it can be very helpful. This of course is also part of the assessment contract between teacher and students. Students need to know clearly what he teacher values—maybe not so much the correct answers but the reasoning or the solution’s presentation and organization. But even without exemplary work, experienced teachers are very capable of sensible scoring on more complex tasks if we are willing to accept the uncertainty behind every scoring grade.


Our own research (de Lange, 1987) on how well teachers are able to score very open-ended tasks without any further help in the form of scoring rubrics showed that the disagreement among teachers grading the same task was acceptable for most teachers; if we assume that the average of a series of grades is the “correct” one, we noticed that 90% of the grades were within 5 points of the correct grade on a 100-point scale. Other research shows that especially the ordering of such complex tasks can be done with very high reliability (Kitchen, 1993). One example is the scoring system of the mathematics A-lympiad, a modeling contest for high school students that uses both some kind of holistic scoring (gold, silver, bronze, honorable mention) and an ordering system. Even though the commission that carried out the difficult task of scoring had many personal changes over time, agreement on the rank order was consistently high (De Haan & Wijers, 2000).


No comments:

Post a Comment