Friday 16 August 2013

Assessing Student Performance Using Test Item Analysis and its Relevance

This article discusses the classroom assessment and action research, which are the two most crucial components of the teaching and learning process. Student performance using test item analysis and its relevance.
The classroom assessment and action research are the two most crucial components of the teaching and learning process. These are also essential parts of the scholarship of teaching and learning. Action Research is an important, recent development in classroom assessment techniques, defined as teacher-initiated classroom research which seeks to increase the teacher’s understanding of classroom teaching and learning and to bring about improvements in classroom practices. Assessing the student performance is very important when the learning goals involve the acquisition of skills that can be demonstrated through action. Many researchers have worked and developed useful theories and taxonomies on the assessment of academic skills, intellectual development, and cognitive abilities of students, both from the analytical and quantitative point of view. Different kinds of assessments are appropriate in different settings. Item analysis is one powerful technique available to instructors for the guidance and improvement of instruction.
Assessing student performance is very important when the learning goals involve the acquisition of skills that can be demonstrated through action. Many researchers have worked and developed useful theories and taxonomies (for example, Bloom’s taxonomy) on the assessment of academic skills, intellectual development, and cognitive abilities of students, both from the analytical and quantitative point of view. For details on Bloom’s cognitive taxonomy and its applications, see, for example, Bloom (1956), Ausbel (1968), Bloom et al. (1971), Simpson (1972), Krathwohl et al. (1973), Angelo & Cross (1993), and Mertler (2003), among others. Different kinds of assessments are appropriate in different settings. One of the most important and authentic techniques of assessing and estimating student performance across the full domain of learning outcomes as targeted by the instructor is the classroom test. Each item on a test is intended to sample student performance on a particular learning outcome. Thus, creating valid and reliable classroom tests are very important to an instructor for assessing student performance, achievement and success in the class. The same principle applies to the State Exit Exams and Classroom Tests conducted by the instructors, state and other agencies. Moreover, it is important to note that, most of the time, it is not well known whether the test items (e.g., multiple-choice) accompanied with the textbooks or test-generator software or constructed by the instructors are already tested for their validity and reliability. One powerful technique available to the instructors for the guidance and improvement of instruction is the test item analysis. It appears from the literature that, in spite of the extensive work on item analysis and its applications, very little attention has been paid to this kind of quantitative study of item analysis.
What Is Action Research?
The development of the general idea of “action research” began with the work of Kurt Lewin (1946) in his paper entitled “Action Research and Minority Problems,” where he describes action research as “a comparative research on the conditions and effects of various forms of social action and research leading to social action” that uses “a spiral of steps, each of which is composed of a circle of planning, action, and fact-finding about the result of the action”. Further development continued with the contributions by many other authors later, among them Kemmis (1983), Ebbutt (1985), Hopkins (1985), Elliott (1991), Richards et al. (1992), Nunan (1992), Brown (1994), and Greenwood et al. (1998), are notable. For recent developments on the theory of action research and its applications, the interested readers are referred to Brydon-Miller et al. (2003), Gustavsen (2003), Dick (2004), Elvin (2004), Barazangi (2006), Greenwood (2007), and Taylor & Pettit (2007), and references therein. As cited in Gabel (1995), following are some of the commonly used definitions of action research:


  • Action Research aims to contribute both to the practical concerns of people in an immediate problematic situation and to the goals of social science by joint collaboration within a mutually acceptable ethical framework. (Rapoport, 1970).



  •  Action Research is a form of self-reflective enquiry undertaken by participants in social (including educational) situations in order to improve the rationality and justice of (a) their own social or educational practices, (b) their understanding of these practices, and (c) the situations in which the practices are carried out. It is most rationally empowering when undertaken by participants collaboratively... ...sometimes in cooperation with outsiders. (Kemmis, 1983).



  •  Action Research is the systematic study of attempts to improve educational practice by groups of participants by means of their own practical actions and by means of their own reflection upon the effects of those actions. (Ebbutt, 1985).

In the field of education, the term action research is defined as inquiry or research in the context of focused efforts in order to improve the quality of an educational institution and its performance. Typically, in an educational institution, the action research is designed and conducted by the instructors in their classes to analyze the data to improve their own teaching. It can be done by an individual instructor or by a team of instructors as a collaborative inquiry. Action research gives an instructor opportunities to reflect on and assess his/her teaching and its effectiveness by applying and testing new ideas, methods, and educational theory for the purpose of improving teaching, or to evaluate and implement an educational plan. According to Richards et al. (1992), action research is defined as teacher-initiated classroom research, which seeks to increase the teacher's understanding of classroom teaching and learning and to bring about improvements in classroom practices. Nunan (1992) defines it as a form of self-reflective inquiry carried out by practitioners, aimed at solving problems, improving practice, or enhancing understanding. According to Brown (1994), “Action research is any action undertaken by teachers to collect data and evaluate their own teaching. It differs from formal research, therefore, in that it is usually conducted by the teacher as a researcher, in a specific classroom situation, with the aim being to improve the situation or teacher rather than to spawn generalizeable knowledge. Action research usually entails observing, reflecting, planning and acting. In its simplest sense, it is a cycle of action and critical reflection, hence the name, action research.”
Item Analysis
Item analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. It is a valuable, powerful technique available to teaching professionals and instructors for the guidance and improvement of instructions. It enables instructors to increase their test construction skills, identify specific areas of course content which need greater emphasis or clarity, and improve other classroom practices. According to Thompson & Levitov, (1985, p. 163), “Item analysis investigates the performance of items considered individually either in relation to some external criterion or in relation to the remaining items on the test." For example, when norm-referenced tests (NRTs) are developed for instructional purposes, such as placement test, or to assess the effects of educational programs, or for educational research purposes, it can be very important to conduct item and test analyses. Similarly, criterion-referenced tests (CRTs) compare students’ performance to some preesablished criteria or objectives (such as classroom tests designed by the instructors). These analyses evaluate the quality of items and of the test as a whole. Such analyses can also be employed to revise and improve both items and
the test as a whole. Many researchers have contributed to the theory of test item analysis, among them Galton, Pearson, Spearman, and Thorndike are notable. For details on these pioneers of test item analysis theories and their contributions, see, for example, Gulliksen (1987), among others. For recent developments on the test item analysis practices, see Crocker & Algina (1986), Gronlund & Linn (1990), Pedhazur & Schemlkin (1991), Sax (1989), Thorndike, et al. (1991), Elvin (2003), and references therein.
Classical Test Theory (CTT)
An item analysis involves many statistics that can provide useful information for improving the quality and accuracy of multiple-choice or true/false items (questions). It describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardize existing tests. The item analysis is an important phase in the development of an exam program. For example, a test or exam consisting of multiple-choice or true-false items is used to determine the proficiency (or ability) level of an examinee in a particular discipline or subject. Most of the times, the test or exam score obtained contributes a considerable weight in determining whether or not an examinee has passed or failed the subject. That is, the proficiency (or ability) level of an examinee is estimated using the total test score obtained from the number of correct responses to the test items. If the test score is equal to a cut-off score or greater than a cut-off score, then the examinee is considered to pass the subject, otherwise, it is considered a failure. This approach of using the test score as proficiency (or ability) estimate is called as the true score model (TSM) or classical test theory (CTT) approach. Classical Item Analysis, based on traditional classical theory models, forms the foundation for looking at the performance of each item in a test. The development of the CTT began with the work of Charles Spearman (1904) in his paper entitled “General intelligence: Objectively determined and measured”. Further development continued with the contributions by many researchers later, among them Francis Galton (1822 – 1911), Karl Pearson (1857 – 1936), and Edward Thorndike (1874 – 1949) are notable, (for details, see, for example, Nunnally, 1967; Gulliksen 1987; among others). For recent developments on the theory of CTT and its applications, the interested readers are referred to Chase (1999), Haladyna (1999), Nitko (2001), Tanner (2001), Oosterhof (2001), Mertler (2003), and references therein. The TSM equation is given by
                                                         X=T+ε,
where , X=score observed,T=score true,ε=error random, and . Note ()TXE=that, in the above TSM equation, the true score reflects the exact value of the examinee's ability or proficiency. Also, the TSM assumes that abilities (or traits) are constant and the variation in observed scores are caused by random errors, which may result from factors such as guessing, lack of preparation, or stress. Thus, in CTT, all test items and statistics are test-dependent. The trait (or ability) of an examinee is defined in terms of a test, whereas the difficulty of a test item is defined in terms of the group of examinees. According to Hambleton, et. al (1991, p. 3), “Examinee characteristics and test item characteristics cannot be separated: each can be interpreted only in the context of the other.” Some important criterias which are employed in the determination of the validity of a multiple-choice exam are following:

  •  Whether the test items were too difficult or too easy.
  •  Whether the test items discriminated between those examinees who really knew the material and those who did not.
  •  Whether the incorrect responses to a test item were distractors or non-distractors.

Item Analysis Statistics
 An item analysis involves many statistics that can provide useful information for determining the validity and improving the quality and accuracy of multiple-choice or true/false items. These statistics are used to measure the ability levels of examinees from their responses to each item.
(I) Item Difficulty: Item difficulty is a measure of the difficulty of an item. For items (that is, multiple-choice questions) with one correct alternative worth a single point, the item difficulty (also known as the item difficulty index, or the difficulty level index, or the difficulty factor, or the item facility index, or the item easiness index, or the p-value) is defined as the proportion of respondents (examinees) selecting the answer to the item correctly, and is given by
p= c / n
where p= the difficulty factor, c= the number of respondents selecting the correct answer to an item, and    n= total number of respondents. Item difficulty is relevant for determining whether students have learned the concept being tested. It also plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not. Note that
(i) 0≤ p ≤ 1
(ii)A higher value of pindicate low difficulty level index, that is, the item is easy. A lower value of pindicate high difficulty level index, that is, the item is difficult. In general, an ideal test should have an overall item difficulty of around 0.5; however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8). In a criterion-referenced test (CRT), with emphasis on mastery-testing of the topics covered, the optimal value of pfor many items is expected to be 0.90 or above. On the other hand, in a norm-referenced test (NRT), with emphasis on discriminating between different levels of achievement, it is given by 50.0≈p. For details on these, see, for example, Chase(1999), among others.
(iii) To maximize item discrimination, ideal (or moderate or desirable) item difficulty level, denoted as Mp, is defined as a point midway between the probability of success, denoted as Sp, of answering the multiple - choice item correctly (that is, 1.00 divided by the number of choices) and a perfect score (that is, 1.00) for the item, and is given by 21SSMppp−+=.
(iv)Thus, using the above formula in (iv), ideal (or moderate or desirable) item difficulty levels for multiple-choice items can be easily calculated, which are provided in the following table, (for details, see, for example, Lord, 1952; among others).
(Ia) Mean Item Difficulty (or Mean Item Easiness): Mean item difficulty is the average of difficulty easiness of all test items. It is an overall measure of the test difficulty and ideally ranges between 60 % and 80 % (that is, 80.060.0≤≤p) for classroom achievement tests. Lower numbers indicate a difficult test while higher numbers indicate an easy test.
(II) Item Discrimination: The item discrimination (or the item discrimination index) is a basic measure of the validity of an item. It is defined as the discriminating power or the degree of an item's ability to discriminate (or differentiate) between high achievers (that is, those who scored high on the total test) and low achievers (that is, those who scored low), which are determined on the same criterion, that is, (1) internal criterion, for example, test itself; and (2) external criterion, for example, intelligence test or other achievement test. Further, the computation of the item discrimination index assumes that the distribution of test scores is normal and that there is a normal distribution underlying the right or wrong dichotomy of a student’s performance on an item.
(III) Test Item Distractor Analysis: It is an important and useful component of test item analysis. A test item distractor is defined as the incorrect response options in a multiple-choice test item. According to the research, there is a relationship between the quality of the distractors in a test item and the student performance on the test item, which also affect the student performance on his/her total test score. The performance of these incorrect item response options can be determined through the test item distractor analysis frequency table which contains the frequency, or number of students, that selected each incorrect option.
It is hoped that the present study would be helpful in recognizing the most critical pieces of the state exit test items data, and evaluating whether or not that test item needs revision. The methods discussed in this project can be used to describe the relevance of test item analysis to classroom tests. These procedures can also be used or modified to measure, describe and improve tests or surveys such as college mathematics placement exams (that is, CPT), mathematics study skills, attitude survey, test anxiety, information literacy, other general education learning outcomes, etc.
Further research based on Bloom’s cognitive taxonomy of test items, the applicability of Beta-Binomial models and Bayesian analysis of test items and item response theory (IRT) using the 1-parameter logistic model (also known as Rasch model), 2- & 3- parameter logistic models, plots of the item characteristic curves (ICCs) of different test items, and other characteristics of measurement instruments of IRT are under investigation will be reported soon at an appropriate time.
Finally, this action research project has given me new directions about the needs of my students and other mathematics classes. It has helped me to know about their learning styles, individual differences and ability. It has also given me insight to construct valid and reliable tests & exams for more student successes and achievements in my math classes. This action research project has provided me inputs to coordinate with my colleagues in mathematics and other disciplines as well as colleagues to identify methods to improve classroom practices through test item analysis and action research in order to enhance the student success and achievement in the class and, later, in their lives, which are also the MDC QEP and General Education Learning Outcomes.

No comments:

Post a Comment