Sunday 20 October 2013

The reliability and validity of the cognitive styles analysis

Regarding the fact that reliability is a necessary condition for any kind of validity, Riding’s claims about the validity of CSA was doubted immediately. Riding (1997, 1998) correctly contends that construct validity of a test requires that different dimensions of the test be independent of each other, correlated with similar constructs, and separate from other factors such as intelligence, personality, and gender. In those articles, Riding provides some evidence of the construct validity of CSA. Although Riding (1997) reports many research studies to support the validity of the CSA test, these reports appear to be mostly qualitative and fail to provide validity scores for CSA.


One criterion of validity of a test is the lack of correlation between the test score and the scores on unrelated constructs. Many research studies on CSA reported by Riding (1997) show the differential validity of CSA (lack of relationship between the cognitive styles and other constructs such as intelligence, gender, and personality). Lack of correlation with these constructs, however, does not necessarily show the construct validity of CSA. Differential validity should always be followed up by concurrent, predictive, or criterion related validity. The low correlation coefficient between the scores for separate constructs may also be more a function of the low reliabilities than the independence of the constructs. Finally Riding’s claim (Riding, 1998) that CSA is independent of intelligence needs to be tested in future. It seems that the second and the third parts of CSA are similar to some intelligence and aptitude tests such as Multi-dimentional Aptitude Battery and Gottschaldt Figures test.


Furthermore, the correlation between the styles and physiological factors or between the styles and performance, as reported in some studies, cannot be legitimate if the test is not reliable (Boles & Pillary, 1999;Riding & Grimley, 1999). As mentioned earlier, no reports of the reliability of the test were found in the literature. This may be partly due to the fact that Riding (1998) recommends a 6-month to 1-year interval for a test–retest reliability measurement.


It is important to note that the structure and the theoretical framework of CSA and the validity of its predecessors (Paivio, 1971) are not being questioned here. The results of this study indicated that the current version of CSA is not sufficiently reliable to be used as a foundation for designing customized instruction. However, it is possible that under a more controlled environment, better scores of reliability might be obtained. Such a result is expected because the structure and theoretical framework of the test appear to be well founded. Accordingly, in the third experiment, it was observed that the reliability of the test improved in comparison with the first and the second experiments. However, since the groups were not randomly assigned and the sample sizes were relatively small, a test of significance could be misleading here. Since reliability is a necessary condition for the validity of any test, the authors strongly recommend that there be a
re-investigation of the reliability and validity of a revised version of CSA based on the following recommendations.


In this article, I would like to discuss some limitations of CSA and provide some suggestions to improve the validity and reliability of the test. As mentioned earlier, CSA has three subtests that measure four scales (two dimensions) of cognitive styles. The computer records the average latency of the responses and calculates the Wholist-Analytic and Verbalizer-Imager ratios. Only the overall ratios are reported by the software at the end of the test. Regarding the structure and the administration guides of CSA, and regarding the authors’ observations in the present project, several issues are raised and suggestions are presented.

(1) In the administration guidelines of CSA, the administrators are warned against giving the users a clue that their scores depend on their differential speed. The theoretical model on which CSA is based requires that CSA be conducted with the person processing information in a comfortable, relaxed state. Therefore, it is argued that if response speed were stressed, then the way in which the individual processed information would probably change. Although this argument seems legitimate, based on the findings of the present study, it may be better to inform the subjects that their overall speed (not their differential speed on different subtests) is important. A true experimental design was not possible because the initial sample was not accessible in the follow up investigations. Therefore, any test of significance between the three experiments could be misleading. 

(2) Generally speaking, the ‘‘mean’’ (average) is considered a better index of central tendency than the ‘‘median’’. However, if there are a few outstanding scores in the list it is better to use the median rather than the mean. In CSA, in order to measure a student’s location on each dimension of CSA the computer records the response time to each statement. Then, the program calculates the mean (average) reaction time for each of the four scales and finally, calculates the Verbal-Imagery and Wholist- Analytic ratios. The user scores on each dimension depend on the user’s differential speed of answering each set of items. For example, a low ratio indicates a Verbalizer and a high ratio, an Imager. In this study, some outliers were observed. Since CSA ‘‘Result File’’ does not show the reaction time to individual items it is not possible for the investigator to delete the extreme reaction times from the calculation of the ratios. As a result, it is impossible to tell if an extreme ratio is the result of an extreme score on a particular item, or if the person is really at the extreme on that dimension of CSA. In future versions of CSA, it is recommended that the ‘‘Result File’’ provide the investigators with the response status (correct or incorrect) and the reaction time to every item of the test.

(3) One limitation of CSA maybe related to the content validity of the two scales of Verbal and Imagery. Theoretically, it is accepted that Imagers will respond faster to the Color questions (i.e., questions that ask the user if the two items are the same color or not). Similarly, it is expected that Verbalizers will respond faster to the Type questions (i.e., questions that ask the user if the two items are the same type). However, it should be noted that users need to read the statement first and all the statements are represented in a sentence form. This would help Verbalizer spend less time to comprehend the sentence and therefore, reduce their reaction time to these items. Furthermore, the difference between Imagers and non-Imagers may not be limited to the differences in their ability to judge the colors of two items. There are several other indices that could be used to differentiate between Imagers and non-Imagers not covered in CSA. For example, objects differ in size, shape, dimensions, and other attributes. Furthermore, the ability to see the difference between the various perspectives of a particular item is another aspect of the Imagery scale. None of these visual characteristics are considered in the design of CSA. To increase the validity, and perhaps the reliability of the test, it is suggested that some of the present questions be replaced with those that compare items from other perspectives. For instance, questions which are from a mental imagery test developed by Shepard and Metzler (1971) might be useful in this regard. Other kinds of shape items or mental rotation items could also be investigated (item analysis) either separately or in CSA structure.

(4) The other limitation of CSA is that the number of questions (statements) for each category is too limited. Generally, the length of the test is positively correlated with the reliability of a test. This general rule is particularly applicable to CSA. Since CSA scores depend on a very short reaction time, the scores are highly sensitive to extraneous variables such as sitting position, keyboard position, and some physiological and psychological factors. In order to reduce the effects of the random errors due to these factors, it is recommended that the number of items in each section be increased. Since the test is administered on computer, increasing the number of items requires adding only a few minutes to the test length. However, this change could significantly improve the validity and reliability of the test. Although adding too many items could cause fatigue and boredom, observations in this research project suggest that the test is sufficiently interesting to keep the user’s attention for at least 20–30 min.

(5) In CSA, the user is required to push a red or a blue button in order to answer a question: Red indicates a correct statement and blue indicates a wrong statement. The problem is that the user is required to push a red button if he/she thinks a statement is right and push a blue one if he/she thinks a statement is wrong. This directly contradicts the facts of daily life where the red button is usually associated with danger, stopping, or something incorrect. During the administration of the test a number of subjects mentioned this problem. It is possible that associating a red button with wrong statements will improve the validity and reliability of the test. It is also possible that using the same color or no color buttons would work even better.

(6) Some features of the test are influenced by cultural differences. In another words, there is not a universal consensus on the right answer to some of CSA questions. For example, in the test, fire engines are assumed to be red, while in some jurisdictions (e.g., Canada) they are yellow. Furthermore, according to some subjects in the study, paired items such as, Ice & Glass, Canary & Sun, Omelette & Waffle, Cream & Paper, or Leaf & Cucumber are not necessarily the same color. A greater consensus was observed among participants on the Type questions than on the Color questions. It is suggested to perform a pilot study to find less controversial pairs.

(7) As mentioned earlier, the software records response time to each statement, only the overall ratio is displayed on the screen at the end of the test. Therefore, user’s response time to individual items cannot be investigated by the researcher. Therefore, CSA does not let the investigators examine the internal consistency or parallel form reliability of the test. Further investigation on subjects are asked to judge if pairs of objects are the same or not. required using a revised version that provides the reaction time for each individual item. Only, then can one examine the internal consistency of the test. Moreover, it is possible to make a parallel form and examine the parallel form reliability of the test. Regarding the importance of cognitive style evaluation in educational and other professional settings, it seems that investment in such investigation is essential.

(8) All the experiments showed that the reliability of the Wholistic-Analystic dimension was higher than the Verbal-Imagery dimension. This might be due simply to the fact that the Wholist-Analytic questions are more difficult than the Verbal-Imagery items. Generally, in easy tests, almost all scores are very high and so the range is limited. On the other hand, the correlation between two variables depends on variability (variance) of scores on each variable. The lower the variance, the less chance there is to get a high correlation coefficient.  The lower reliability of the Verbal-Imagery dimension could be attributed to the lower level of difficulty of the items on this scale. If the, the variance of scores in Wholistic-Analytic dimension was observed to be higher in all three experiments (0.14, 0.13, and 0.22 respectively) than the Verbal-Imager dimension (0.05, 0.03, and 0.12 respectively). Replacing the current Verbal-Imager items in CSA with those kinds of items suggested earlier (suggestion No. 3) may increase the variability of the scores on this scale and may increase the reliability of the test. In that case, the authors would also need to construct items with a similar level of difficulty for the Verbalizers.


Overall, CSA appears to have a strong theoretical basis but the program itself appears to lack reliability in test–retest trials. The suggestions provided above may serve to improve the reliability and validity of CSA so that it can be employed meaningfully in educational research.

No comments:

Post a Comment