Supplement to Critical Thinking

Assessment

How can one assess, for purposes of instruction or research, the degree to which a person possesses the dispositions, skills and knowledge of a critical thinker?

In psychometrics, assessment instruments are judged according to their validity and reliability.

Roughly speaking, an instrument is valid if it measures accurately what it purports to measure, given standard conditions. More precisely, the degree of validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (American Educational Research Association 2014: 11). In other words, a test is not valid or invalid in itself. Rather, validity is a property of an interpretation of a given score on a given test for a specified use. Determining the degree of validity of such an interpretation requires collection and integration of the relevant evidence, which may be based on test content, test takers’ response processes, a test’s internal structure, relationship of test scores to other variables, and consequences of the interpretation (American Educational Research Association 2014: 13–21). Criterion-related evidence consists of correlations between scores on the test and performance on another test of the same construct; its weight depends on how well supported is the assumption that the other test can be used as a criterion. Content-related evidence is evidence that the test covers the full range of abilities that it claims to test. Construct-related evidence is evidence that a correct answer reflects good performance of the kind being measured and an incorrect answer reflects poor performance.

An instrument is reliable if it consistently produces the same result, whether across different forms of the same test (parallel-forms reliability), across different items (internal consistency), across different administrations to the same person (test-retest reliability), or across ratings of the same answer by different people (inter-rater reliability). Internal consistency should be expected only if the instrument purports to measure a single undifferentiated construct, and thus should not be expected of a test that measures a suite of critical thinking dispositions or critical thinking abilities, assuming that some people are better in some of the respects measured than in others (for example, very willing to inquire but rather closed-minded). Otherwise, reliability is a necessary but not a sufficient condition of validity; a standard example of a reliable instrument that is not valid is a bathroom scale that consistently under-reports a person’s weight.

Assessing dispositions is difficult if one uses a multiple-choice format with known adverse consequences of a low score. It is pretty easy to tell what answer to the question “How open-minded are you?” will get the highest score and to give that answer, even if one knows that the answer is incorrect. If an item probes less directly for a critical thinking disposition, for example by asking how often the test taker pays close attention to views with which the test taker disagrees, the answer may differ from reality because of self-deception or simple lack of awareness of one’s personal thinking style, and its interpretation is problematic, even if factor analysis enables one to identify a distinct factor measured by a group of questions that includes this one (Ennis 1996). Nevertheless, Facione, Sánchez, and Facione (1994) used this approach to develop the California Critical Thinking Dispositions Inventory (CCTDI). They began with 225 statements expressive of a disposition towards or away from critical thinking (using the long list of dispositions in Facione 1990a), validated the statements with talk-aloud and conversational strategies in focus groups to determine whether people in the target population understood the items in the way intended, administered a pilot version of the test with 150 items, and eliminated items that failed to discriminate among test takers or were inversely correlated with overall results or added little refinement to overall scores (Facione 2000). They used item analysis and factor analysis to group the measured dispositions into seven broad constructs: open-mindedness, analyticity, cognitive maturity, truth-seeking, systematicity, inquisitiveness, and self-confidence (Facione, Sánchez, and Facione 1994). The resulting test consists of 75 agree-disagree statements and takes 20 minutes to administer. A repeated disturbing finding is that North American students taking the test tend to score low on the truth-seeking sub-scale (on which a low score results from agreeing to such statements as the following: “To get people to agree with me I would give any reason that worked”. “Everyone always argues from their own self-interest, including me”. “If there are four reasons in favor and one against, I’ll go with the four”.) Development of the CCTDI made it possible to test whether good critical thinking abilities and good critical thinking dispositions go together, in which case it might be enough to teach one without the other. Facione (2000) reports that administration of the CCTDI and the California Critical Thinking Skills Test (CCTST) to almost 8,000 post-secondary students in the United States revealed a statistically significant but weak correlation between total scores on the two tests, and also between paired sub-scores from the two tests. The implication is that both abilities and dispositions need to be taught, that one cannot expect improvement in one to bring with it improvement in the other.

A more direct way of assessing critical thinking dispositions would be to see what people do when put in a situation where the dispositions would reveal themselves. Ennis (1996) reports promising initial work with guided open-ended opportunities to give evidence of dispositions, but no standardized test seems to have emerged from this work. There are however standardized aspect-specific tests of critical thinking dispositions. The Critical Problem Solving Scale (Berman et al. 2001: 518) takes as a measure of the disposition to suspend judgment the number of distinct good aspects attributed to an option judged to be the worst among those generated by the test taker. Stanovich, West and Toplak (2011: 800–810) list tests developed by cognitive psychologists of the following dispositions: resistance to miserly information processing, resistance to myside thinking, absence of irrelevant context effects in decision-making, actively open-minded thinking, valuing reason and truth, tendency to seek information, objective reasoning style, tendency to seek consistency, sense of self-efficacy, prudent discounting of the future, self-control skills, and emotional regulation.

It is easier to measure critical thinking skills or abilities than to measure dispositions. The following eight currently available standardized tests purport to measure them: the Watson-Glaser Critical Thinking Appraisal (Watson & Glaser 1980a, 1980b, 1994), the Cornell Critical Thinking Tests Level X and Level Z (Ennis & Millman 1971; Ennis, Millman, & Tomko 1985, 2005), the Ennis-Weir Critical Thinking Essay Test (Ennis & Weir 1985), the California Critical Thinking Skills Test (Facione 1990b, 1992), the Halpern Critical Thinking Assessment (Halpern 2016), the Critical Thinking Assessment Test (Center for Assessment & Improvement of Learning 2017), the Collegiate Learning Assessment (Council for Aid to Education 2017), the HEIghten Critical Thinking Assessment (https://territorium.com/heighten/), and a suite of critical thinking assessments for different groups and purposes offered by Insight Assessment (https://www.insightassessment.com/products). The Critical Thinking Assessment Test (CAT) is unique among them in being designed for use by college faculty to help them improve their development of students’ critical thinking skills (Haynes et al. 2015; Haynes & Stein 2021). Also, for some years the United Kingdom body OCR (Oxford Cambridge and RSA Examinations) awarded AS and A Level certificates in critical thinking on the basis of an examination (OCR 2011). Many of these standardized tests have received scholarly evaluations at the hands of, among others, Ennis (1958), McPeck (1981), Norris and Ennis (1989), Fisher and Scriven (1997), Possin (2008, 2013a, 2013b, 2013c, 2014, 2020) and Hatcher and Possin (2021). Their evaluations provide a useful set of criteria that such tests ideally should meet, as does the description by Ennis (1984) of problems in testing for competence in critical thinking: the soundness of multiple-choice items, the clarity and soundness of instructions to test takers, the information and mental processing used in selecting an answer to a multiple-choice item, the role of background beliefs and ideological commitments in selecting an answer to a multiple-choice item, the tenability of a test’s underlying conception of critical thinking and its component abilities, the set of abilities that the test manual claims are covered by the test, the extent to which the test actually covers these abilities, the appropriateness of the weighting given to various abilities in the scoring system, the accuracy and intellectual honesty of the test manual, the interest of the test to the target population of test takers, the scope for guessing, the scope for choosing a keyed answer by being test-wise, precautions against cheating in the administration of the test, clarity and soundness of materials for training essay graders, inter-rater reliability in grading essays, and clarity and soundness of advance guidance to test takers on what is required in an essay. Rear (2019) has challenged the use of standardized tests of critical thinking as a way to measure educational outcomes, on the grounds that  they (1) fail to take into account disputes about conceptions of critical thinking, (2) are not completely valid or reliable, and (3) fail to evaluate skills used in real academic tasks. He proposes instead assessments based on discipline-specific content.

There are also aspect-specific standardized tests of critical thinking abilities. Stanovich, West and Toplak (2011: 800–810) list tests of probabilistic reasoning, insights into qualitative decision theory, knowledge of scientific reasoning, knowledge of rules of logical consistency and validity, and economic thinking. They also list instruments that probe for irrational thinking, such as superstitious thinking, belief in the superiority of intuition, over-reliance on folk wisdom and folk psychology, belief in “special” expertise, financial misconceptions, overestimation of one’s introspective powers, dysfunctional beliefs, and a notion of self that encourages egocentric processing. They regard these tests along with the previously mentioned tests of critical thinking dispositions as the building blocks for a comprehensive test of rationality, whose development (they write) may be logistically difficult and would require millions of dollars.

A superb example of assessment of an aspect of critical thinking ability is the Test on Appraising Observations (Norris & King 1983, 1985, 1990a, 1990b), which was designed for classroom administration to senior high school students. The test focuses entirely on the ability to appraise observation statements and in particular on the ability to determine in a specified context which of two statements there is more reason to believe. According to the test manual (Norris & King 1985, 1990b), a person’s score on the multiple-choice version of the test, which is the number of items that are answered correctly, can justifiably be given either a criterion-referenced or a norm-referenced interpretation.

On a criterion-referenced interpretation, those who do well on the test have a firm grasp of the principles for appraising observation statements, and those who do poorly have a weak grasp of them. This interpretation can be justified by the content of the test and the way it was developed, which incorporated a method of controlling for background beliefs articulated and defended by Norris (1985). Norris and King synthesized from judicial practice, psychological research and common-sense psychology 31 principles for appraising observation statements, in the form of empirical generalizations about tendencies, such as the principle that observation statements tend to be more believable than inferences based on them (Norris & King 1984). They constructed items in which exactly one of the 31 principles determined which of two statements was more believable. Using a carefully constructed protocol, they interviewed about 100 students who responded to these items in order to determine the thinking that led them to choose the answers they did (Norris & King 1984). In several iterations of the test, they adjusted items so that selection of the correct answer generally reflected good thinking and selection of an incorrect answer reflected poor thinking. Thus they have good evidence that good performance on the test is due to good thinking about observation statements and that poor performance is due to poor thinking about observation statements. Collectively, the 50 items on the final version of the test require application of 29 of the 31 principles for appraising observation statements, with 13 principles tested by one item, 12 by two items, three by three items, and one by four items. Thus there is comprehensive coverage of the principles for appraising observation statements. Fisher and Scriven (1997: 135–136) judge the items to be well worked and sound, with one exception. The test is clearly written at a grade 6 reading level, meaning that poor performance cannot be attributed to difficulties in reading comprehension by the intended adolescent test takers. The stories that frame the items are realistic, and are engaging enough to stimulate test takers’ interest. Thus the most plausible explanation of a given score on the test is that it reflects roughly the degree to which the test taker can apply principles for appraising observations in real situations. In other words, there is good justification of the proposed interpretation that those who do well on the test have a firm grasp of the principles for appraising observation statements and those who do poorly have a weak grasp of them.

To get norms for performance on the test, Norris and King arranged for seven groups of high school students in different types of communities and with different levels of academic ability to take the test. The test manual includes percentiles, means, and standard deviations for each of these seven groups. These norms allow teachers to compare the performance of their class on the test to that of a similar group of students.

Copyright © 2022 by
David Hitchcock <hitchckd@mcmaster.ca>

Open access to the SEP is made possible by a world-wide funding initiative.
The Encyclopedia Now Needs Your Support
Please Read How You Can Help Keep the Encyclopedia Free