Teaching critical thinking
Edited by Samuel C. Silverstein, College of Physicians and Surgeons, New York, NY, and accepted by the Editorial Board July 14, 2015 (received for review March 17, 2015)
Significance
Understanding and thinking critically about scientific evidence is a crucial skill in the modern world. We present a simple learning framework that employs cycles of decisions about making and acting on quantitative comparisons between datasets or data and models. With opportunities to improve the data or models, this structure is appropriate for use in any data-driven science-learning setting. This structure led to significant and sustained improvement in students’ critical thinking behaviors, compared with a control group, with effects far beyond that of statistical significance.
Abstract
The ability to make decisions based on data, with its inherent uncertainties and variability, is a complex and vital skill in the modern world. The need for such quantitative critical thinking occurs in many different contexts, and although it is an important goal of education, that goal is seldom being achieved. We argue that the key element for developing this ability is repeated practice in making decisions based on data, with feedback on those decisions. We demonstrate a structure for providing suitable practice that can be applied in any instructional setting that involves the acquisition of data and relating that data to scientific models. This study reports the results of applying that structure in an introductory physics laboratory course. Students in an experimental condition were repeatedly instructed to make and act on quantitative comparisons between datasets, and between data and models, an approach that is common to all science disciplines. These instructions were slowly faded across the course. After the instructions had been removed, students in the experimental condition were 12 times more likely to spontaneously propose or make changes to improve their experimental methods than a control group, who performed traditional experimental activities. The students in the experimental condition were also four times more likely to identify and explain a limitation of a physical model using their data. Students in the experimental condition also showed much more sophisticated reasoning about their data. These differences between the groups were seen to persist into a subsequent course taken the following year.
Sign up for PNAS alerts.
Get alerts for new articles, or get an alert when an article is cited.
A central goal of science education is to teach students to think critically about scientific data and models. It is crucial for scientists, engineers, and citizens in all walks of life to be able to critique data, to identify whether or not conclusions are supported by evidence, and to distinguish a significant effect from random noise and variability. There are many indications of how difficult it is for people to master this type of thinking, as evidenced by many societal debates. Although teaching quantitative critical thinking is a fundamental goal of science education, particularly the laboratory portion, the evidence indicates this is seldom, if ever, being achieved (1–6). To address this educational need, we have analyzed the explicit cognitive processes involved in such critical thinking and then developed an instructional design to incorporate these processes.
We argue that scientists engage in such critical thinking through a process of repeated comparisons and decisions: comparing new data to existing data and/or models and then deciding how to act on those comparisons based on analysis tools that embody appropriate statistical tests. Those actions typically lead to further iterations involving improving the data and/or modifying the experiment or model. In a research setting, common decisions are to improve the quality of measurements (in terms of accuracy or precision) to determine whether an effect is hidden by large variability; to embrace, adjust, or discard a model based on the scientific evidence; or to devise a new experiment to answer the question. In other settings, such as medical policy decisions, there may be fewer options, but corresponding decisions are made as to the consistency of the model and the data and what conclusions are justified by the data.
We hypothesize that much of the reason students do not engage in these behaviors is because the educational environment provides few opportunities for this process. Students ought to be explicitly exposed to how experts engage in critical thinking in each specific discipline, which should, in turn, expose them to the nature of knowledge in that discipline (7). Demonstrating the critical thinking process, of course, is insufficient for students to use it on their own. Students need practice engaging in the critical thinking process themselves, and this practice should be deliberate and repeated with targeted feedback (7–9). We do not expect first-year university students to engage in expert-level thinking processes. We can train them to think more like scientists by simplifying the expert decision tree described above. Making the critical thinking process explicit to students, demonstrating how the process allows the students to learn or make discoveries, and having the students practice in a deliberate way with targeted feedback will help students understand the nature of scientific measurement and data uncertainty, and, in time, adopt the new ways of thinking.
The decision tree and iterative process we have described could be provided in any setting in which data and models are introduced to students. Virtually all instructional laboratories in science offer such opportunities as students collect data and use it to explore various models and systems. Such laboratories are an ideal environment for developing students’ critical thinking, and this environment is arguably the laboratories’ greatest value.
We have tested this instructional concept in the context of a calculus-based introductory laboratory course in physics at a research-intensive university. The students repeatedly and explicitly make decisions and act on comparisons between datasets or between data and models as they work through a series of simple, introductory physics experiments. Although this study is in the context of a physics course, we believe the effect would be similar using experiments from any subject that involve quantitative data, opportunities to quantitatively compare data and models, and opportunities to improve data and models. With this simple intervention, we observed dramatic long-term improvements in students’ quantitative critical thinking behaviors compared with a control group that carried out the same laboratory experiments but with a structure more typical of instructional laboratories.
In our study, students in the experiment condition were explicitly instructed to (and received grades to) quantitatively compare multiple collected datasets or a collected dataset and a model and to decide how to act on the comparisons (Fig. 1). Although a variety of options for acting on comparisons, as listed above, were presented to students, striving to improve the quality of their data were the most rigorously enforced. For example, in one of the earliest experiments, students were told to make two sets of measurements and compare them quantitatively. The students were then prompted to devise a plan to improve the quality of their measurements, to discuss this plan with other groups, and to carry out the revised measurements and analysis. This explicit focus on measurements, rather than improving models, was intended to address the fact that students in a laboratory course often assume data they collect is inherently low quality compared with expert results (10). This perception can lead students to ignore disagreements between measurements or to artificially inflate uncertainties to disguise the disagreements (11). When disagreements do arise, students often attribute them to what they refer to as “human error” (12) or simply blame the equipment being used. As such, students are unlikely to adjust or discard an authoritative model, because they do not trust that their data are sufficiently high quality to make such a claim. We hypothesize that the focus on high-quality data will, over time, encourage students to critique models without explicit support.
Fig. 1.

To compare measurements quantitatively, students were taught a number of analysis tools used regularly by scientists in any field. Students were also taught a framework for how to use these tools to make decisions about how to act on the comparisons. For example, students were shown weighted calculations for least squares fitting of data to models and then were given a decision tree for interpreting the outcome. If students obtain a low , they would decide whether it means their data are in good agreement with the model or whether it means they have overestimated their uncertainties. If students obtain a large , they would decide whether there is an issue with the model or with the data. From these interpretations, the decision tree expands into deciding what to do. In both cases, students were encouraged to improve their data: to improve precision and decrease their uncertainties in the case of low or to identify measurement or systematic errors in the case of a large . Although students were told that a large might reflect an issue with the model, they were not told what to do about it, leaving room for autonomous decision-making. Regardless of the outcome of the comparison, therefore, students had guidelines for how to act on the comparison, typically leading to additional measurements. This naturally led to iterative cycles of making and acting on comparisons, which could be used for any type of comparison.
Before working with fitting and models, students were first introduced to an index for comparing pairs of measured values with uncertainty (the ratio of the difference between two measured values to the uncertainty in the difference; see Supporting Information, Quantitative Comparison Tools for more details). Students were also taught to plot residuals (the point-by-point difference between measured data and a model) to visualize the comparison of data and models. Both of these tools, and any comparison tool that includes the variability in a measurement, lend themselves to the same decision process as the value when identifying disagreements with models or improving data quality. A number of standard procedural tools for determining uncertainty in measurements or fit parameters were also taught (see Supporting Information, Quantitative Comparison Tools for the full list). As more tools were introduced during the course, the explicit instructions to make or act on the comparisons were faded (see Supporting Information, Comparison Cycles Instruction Across the Year for more details and for a week-by-week diagram of the fading).
The students carried out different experiments each week and completed the analysis within the 3-h laboratory period. To evaluate the impact of the comparison cycles, we assessed students’ written laboratory work from three laboratory sessions (see Supporting Information, Student Experiments Included in the Study for a description of the experiments) from the course: one early in the course when the experimental group had explicit instructions to perform comparison cycles to improve data (week 2) and two when all instruction about making and acting on comparisons had been stopped (weeks 16 and 17). We also examined student work from a quite different laboratory course taken by the same students in the following year. Approximately a third of the students from the first-year laboratory course progressed into the second-year (sophomore) physics laboratory course. This course had different instructors, experiments, and structure. Students carried out a smaller number of more complex experiments, each one completed over two weeks, with final reports then submitted electronically. We analyzed the student work on the third experiment in this course.
Results
Students’ written work was evaluated for evidence of acting on comparisons, either suggesting or executing changes to measurement procedures or critiquing or modifying physical models in light of collected data. We also examined students’ reasoning about data to further inform the results (see Supporting Information, Interrater Reliability for interrater reliability of the coding process for these three measures). Student performance in the experimental group () was compared with a control group (). The control was a group of students who had taken the course the previous year with the same set of experiments. Analysis in Supporting Information, Participants demonstrates that the groups were equivalent in performance on conceptual physics diagnostic tests. Although both groups were taught similar data analysis methods (such as weighted fitting), the control group was neither instructed nor graded on making or acting on cycles of quantitative comparisons. The control group also was not introduced to plotting residuals or comparing differences of pairs of measurements as a ratio of the combined uncertainty. Since instructions given to the experimental group were faded over time, the instructions given to both groups were identical in week 16 and week 17.
We first compiled all instances where students decided to act on comparisons by proposing and/or making changes to their methods (Fig. 2), because this was the most explicitly structured behavior for the experimental group. When students in the experimental group were instructed to iterate and improve their measurements (week 2), nearly all students proposed or carried out such changes. By the end of the course, when the instructions had been removed, over half of the experimental group continued to make or propose changes to their data or methods. This fraction was similar for the sophomore laboratory experiment, where it was evident that the students were making changes, even though we were evaluating final reports rather than laboratory notebooks. Almost none of the control group wrote about making changes during any of the experiments in the study.
Fig. 2.

Next, we looked for instances where students decided to act on a comparison by critiquing the validity of a given physical model (Fig. 3). For both groups of students, many experiments asked them to verify the validity of a physical model. Neither group, however, received explicit prompts to identify or explain a disagreement with the model. Three experiments (week 2, week 17, and the sophomore laboratory) were included in this portion of the analysis, because these experiments involved physical models that were limited or insufficient for the quality of data achievable (Supporting Information, Student Experiments Included in the Study). In all three experiments, students’ written work was coded for whether they identified a disagreement between their data and the model and whether they correctly interpreted the disagreement in terms of the limitations of the model.
Fig. 3.

As shown in Fig. 3, few students in either group noted a disagreement in week 2. As previously observed, learners tend to defer to authoritative information (7, 10, 11). In fact, many students in the experimental group stated that they wanted to improve their data to get better agreement, ignoring the possibility that there could be something wrong with the model.
As students progress in the course, however, dramatic changes emerge. In week 17, over three-fourths of the students in the experimental group identified the disagreement, nearly four times more than in the control group, and over half of the experimental group provided the correct physical interpretation. Students in the experimental group showed similar performance in the sophomore laboratory, indicating that the quantitative critical thinking was carried forward. The laboratory instructions for the sophomore experiment provided students with a hint that a technical modification to the model equation may be necessary if the fit was unsatisfactory and prompted them to explain why it might be necessary. This is probably why a larger percentage of students in the control group identified the disagreement in this experiment than in the week 2 and 17 experiments. However, only 10% of the students in the control group provided the physical interpretation, compared with 40% in the experimental group.
The more sophisticated analysis of models depends on the repeated attempts to improve the quality of the measurements. Students obtain both better data and greater confidence in the quality of their data, giving them the confidence to question an authoritative model. This is evident when we examine how students were reasoning about their data.
We coded students’ reasoning into four levels of sophistication, somewhat analogous to Bloom’s Taxonomy (13), with the highest level reached by a student in a given experiment being recorded. Level 1 comments reflect the simple application of analysis tools or comparisons without interpretation; level 2 comments analyze or interpret results; level 3 comments combine multiple ideas or propose something new; and level 4 comments evaluate or defend the new idea (see Supporting Information, Reflection Analysis for additional comments and Figs. S2 and S3 for examples of this coding).
Fig. S1.

Fig. S2.

Fig. S3.

In Fig. 4, we see only a moderate difference between the experimental and control groups in week 2, even though the experimental group received significant behavioral support in week 2. This suggests that the support alone is insufficient to create significant behavioral change. By week 16, there is a larger difference between the groups, with the control group shifting to lower levels of comment sophistication and the experimental group maintaining higher levels of comment sophistication, despite the removal of the behavioral support. In week 17, when the model under investigation is inadequate to explain high-quality data, the difference between the groups becomes much more dramatic. For the experimental group, the unexpected disagreement triggers productive, deep analysis of the comparison beyond the level the previous week (14–16). We attribute this primarily to attempts to correct or interpret the disagreement. In contrast, most of the students in the control group are reduced to simply writing about the analysis tools they had used.
Fig. 4.

Students in the control group had primarily been analyzing and interpreting results (level 1 and 2) but not acting on them. Because students will continue to use strategies that have been successful in the past (17), the students were not prepared to manage the unexpected outcome in week 17. Our data, however, are limited in that we only evaluate what was written in the students’ books by the end of the laboratory session. It is plausible that the students in the control group were holding high-level discussions about the disagreement but not writing them down. The students’ low-level written reflections are, at best, evidence that they needed more time to achieve the outcomes of the experimental group.
In the sophomore laboratory, the students in the experimental group continued to show a high level in their reflective comments, showing a sustained change in reasoning and epistemology. The students in the control group show higher-level reflections in the sophomore laboratory than they did in the first-year laboratory, possibly because of the greater time given to analyze their data, the prompt about the model failing, or the selection of these students as physics majors. They still remained well below the level of the experimental group, nonetheless.
Discussion
The cycles of making and deciding how to act on quantitative comparisons gave students experience with making authentic scientific decisions about data and models. Because students had to ultimately decide how to proceed, the cycles provided a constrained experimental design space to prepare them for autonomous decision-making (18). With a focus on the quality of their data and how they could improve it, the students came to believe that they are able to test and evaluate models. This is not just an acquisition of skills; it is an attitudinal and epistemological shift unseen in the control group or in other studies of instructional laboratories (11, 12). The training in how to think like an expert inherently teaches students how experts think and, thus, how experts generate knowledge (7).
The simple nature of the structure used here gives students both a framework and a habit of mind that leaves them better prepared to transfer the skills and behaviors to new contexts (19–21). This simplicity also makes it easily generalizable to a very wide range of instructional settings: any venue that contains opportunities to make decisions based on comparisons.
Quantitative Comparison Tools
The first type of comparison encountered in a typical introductory physics laboratory is to compare two independently measured values of the same physical parameter, a task that is known to be challenging for students (3, 5, 10). In many instructional laboratories, students do so by assessing whether the uncertainty ranges defined by the measurements overlap. Scientists, however, generally refer to a continuous scale associated with the measurements’ probability distributions (22), such as the number of units of uncertainty by which two measurements differ (so-called , , or differences in physics, for example). Following the Guide to Uncertainty in Measurement (23), this could be calculated aswhere A and B are two measured values and and are their uncertainties, respectively. As such, a large score means that the measurements differ by more than their combined uncertainties, and a small score means the measurements are similar within their combined uncertainties. We use the letter t for the index in reference to the structural similarity to the Student’s t value, but we do not imply the index applies to the t distribution.
[S1]
Interpreting the outcome of this comparison provides the necessary structure for deciding how to act on the comparison. For example, because overestimated uncertainties can lead to an artificially small score, a low score could mean that poor precision has hidden a small disagreement. As such, one could choose to improve the quality of the measurements. Under a model that predicts the two measurements should agree, a large score could mean that the model is limited or inappropriate. One could then choose to evaluate, adjust, or discard this model. One could also attempt to identify possible measurement errors that are causing a systematic effect. In all of these cases, the statistic compares the difference between measured quantities within units of variability. Rather than specifically comparing sample means according to the sample SDs, however, the score uses any measurement value with its uncertainty. As such, we do not try to compare the scores on the t distribution or make inferences about probabilities. Indeed, if the measurements were sample means from populations with the same variance, the score would be equivalent to Student’s t for comparing independent samples (or, if homogeneity of variance is violated, the score would be equivalent to Welch’s t).
The equation for least-squares fitting lends itself to the same quantitative framework defined by the weighted or reduced statisticwhere and are the measured independent and dependent values, is the uncertainty associated with each , N is the number of data points, and are the model values associated with each . This parameter evaluates the average difference between measured data and a model in units of uncertainty (squared). Values, therefore, are subject to the same interpretation and follow-up measurements as with the score (see Table S1).
[S2]
Table S1.
score | Interpretation of measurements | Follow-up investigation | |
---|---|---|---|
Unlikely different, uncertainty may be overestimated | Improve measurements, reduce uncertainty | ||
Unclear whether different | Improve measurements, reduce uncertainty | ||
Likely different | Improve measurements, correct systematic errors, evaluate model limitations or approximations |
score comparisons are between pairs of measurements and comparisons are between datasets and models.
Students were also taught a number of additional statistical analysis tools. The full set of tools taught to each condition are found in Table S2, which also specifies whether the tool informs a comparison or is primarily procedural.
Table S2.
Comparison tools | Procedural tools | |
---|---|---|
Control and experiment condition | Experiment condition only | Control and experiment condition |
Overlapping uncertainty ranges | t′ score | Histograms |
Unweighted χ2 | Residual plots | Mean |
Weighted χ2 | SD | |
Standard uncertainty in the mean (SE) | ||
Semilog and log–log plots | ||
Weighted average | ||
Uncertainty in fit parameters of fit lines |
The statistical tools taught to students in each condition are specified by whether they are procedural or inform the comparison cycles.
Comparison Cycles Instruction Across the Year
Students in the experimental group were given explicit instructions to make comparisons between their measurements and/or models and iterate to improve their measurements. These behaviors were also graded and present in a grading rubric. This support was faded across the course. The explicit instructions in the text were the first to be removed, followed by assigned marks, and eventually instructor support was also removed. A map of this fading process across the year is included in Table S3.
Table S3.
Support | Week | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
Compare: instructions | X | X | X | X | X | X | X | ||||||||||||
Compare: marking | X | X | X | X | X | X | X | X | X | X | X | ||||||||
Iterate: instructions | X | X | X | ||||||||||||||||
Iterate: marking | X | X | X | X |
The experimental group received explicit support to make and act on comparisons. The support came in the form of explicit instructions and/or reference in the marking scheme and was faded over time. In the table, an X indicates that the behavior (comparing or iterating) was supported that week.
Student Experiments Included in the Study
Week 2: Period of a Pendulum as a Function of Amplitude.
In this experiment, students were asked to measure the period of a pendulum at two (experimental group, and ) or three (control group, , , and ) angles of amplitude and compare their measurements. Students were not given a model for the process, but most of the students believed from previous experience (high school or college-level physics class) that the period was independent of angle according to the equationwhere L is the length of the pendulum, g is the acceleration due to gravity, and T is the period of the pendulum. The derivation of this equation, however, involves an approximation thatfor small angles, θ. High-precision measurements, therefore, expose this approximation and reveal the difference in the periods at different amplitudes from the second-order correction to this approximation.
[S3]
[S4]
Week 16: Resistor–Capacitor Circuit 2.
In this experiment, students studied the voltage decay across a resistor in a parallel resistor–capacitor (RC) circuit. This was the second experiment with this equipment and circuit. They measured the time constant (τ) of the voltage decay across the resistor as a function of resistance of the resistor, which is given by the modelIn addition to verifying that the relationship between τ and R was in fact linear with an intercept through the origin, students could compare the capacitance of the capacitor with the value of the slope from a graph of τ versus R. Resistance from other parts of the circuit were negligible in this experiment.
[S5]
Week 17: Inductor–Resistor Circuit.
Using a similar measurement procedure to the week 16 experiment, students studied the time constant of the voltage decay (τ) across a resistor in a series inductor–resistor (LR) circuit, which is given by the modelFor this model, the time constant as a function of resistance, plotted as versus resistance, would give a straight line with an intercept through the origin. Resistance in the additional components in the circuit, however, is nonnegligible here, resulting in a nonzero intercept in the plot. Students could choose whether to perform a one-parameter () or two-parameter () linear fit to their data, which would cause them to confront the issue of the intercept. Students did not know the inductance of the inductor and so could not make a comparison with the value from the fit. Students could check their circuit for a finite (noninfinite) time constant with the resistor set to zero resistance.
[S6]
Sophomore Laboratory: LRC Circuit.
In the LRC circuit experiment, an inductor (L), resistor (R), and capacitor (C) are connected in series, and the equation governing the voltage decay across the resistor iswhere is the voltage across the resistor, is the amplitude of the input AC voltage source, ω is the angular frequency of the voltage source, is the resonant frequency, and γ is the bandwidth. Students fit their data of as a function of frequency, ω, to determine the parameters and γ. Additional resistance in the circuit beyond the resistance in the resistor, however, means that the ratio of to will never be exactly 1, and so it is necessary to add a third scaling factor, A, to the model, such that
[S7]
[S8]
Students also measured the parameters and γ through another experiment and could calculate their values (using measurements of the components R, L, and C) through the definition of these parameters. As such, students had multiple comparisons to make to inform the quality of the fit beyond the analysis of the fit itself.
Interrater Reliability
For all of the data presented, one rater coded all items and another rater coded ∼10% of the items. The primary coder was never blind to condition because of the nature of the student products. In the control group, students printed their analysis work from spreadsheets and pasted them into their laboratory notes, whereas the experimental group submitted their spreadsheets electronically. The second rater, however, was given copies that made the rater blind to condition.
Interrater-reliability analysis using Cohen’s κ statistic was performed to evaluate consistency between raters. Values greater than 0.6 were considered substantial agreement and so do not suggest a need for blind coding. For the quality of reflective comments, the interrater reliability for the raters was found to be . For identifying whether students proposed or proposed and carried out changes to their methods and measurements, the interrater reliability for the raters was found to be . For identifying whether students identified and/or physically interpreted the disagreements with models, the interrater reliability for the raters was found to be .
Participants
Included in the study were two cohorts (groups) of students enrolled in the same introductory undergraduate physics course at a research-intensive university in Canada. The control group consisted of students enrolled in 2012/2013, whereas the experimental group consisted of students enrolled in 2013/2014. The course, both years, was spread across two semesters of eight or nine 3-h laboratory weekly laboratory sessions. Each laboratory session included no more than 48 students and was facilitated by two graduate student teaching assistants and the course instructor. The number of students included in the analysis is found in Table S4. The variability in the number of students each week is attributable to students not attending all laboratories. In the control group, 109 students conducted all three first-year laboratories, and only 31 students conducted all three first-year laboratories and the sophomore laboratory. In the experimental group, 108 students conducted all three first-year laboratories and only 36 students conducted all three first-year laboratories and the sophomore laboratory. Because the effects of the laboratory occurred throughout more than just the four laboratories evaluated, we include any students who participated each particular week.
Table S4.
Group | Week 2 | Week 16 | Week 17 | Sophomore laboratory |
---|---|---|---|---|
Control | 146 | 132 | 131 | 39 |
Experiment | 159 | 138 | 133 | 48 |
On entering the course, the two groups had statistically equivalent pretest scores on the Force Concept Inventory (FCI) (24): control, ; experiment, . By the end of the first term, the groups had statistically equivalent scores on the Mechanics Baseline Test (MBT) (25): control, ; experiment, . By the end of the second term, the groups also had statistically equivalent scores on the Brief Electricity and Magnetism Survey (BEMA) (26): control, ; experiment, . These assessments have been used to evaluate the introductory physics students in the department for over 20 y, and, in the last decade, students’ incoming scores have been consistent within a 2% SD.
The critical thinking behaviors assessed in this study relate primarily to evaluating data and physical measurement systems. The questions on the FCI, MBT, and BEMA evaluate students’ ability to apply specific physics concepts in idealized situations. There is very little overlap between the knowledge and reasoning required to answer those questions, and the real-world, data-driven critical thinking about data and measurement systems learned in the laboratory course. We also would expect that the lecture and other components of the courses would dominant over a possible effect related to the laboratory. Therefore, it is not surprising that the scores are not correlated.
Students in the course both years were almost all intending to major in a science, technology, engineering, or math field, although they do not declare their majors until their second year. The breakdown of students’ intended majors in the experimental group by the end of the course are in Table S5. Unfortunately, these data were unavailable for the control group. We do have data that show that ∼15% of students in the control group and 20% of the students in the experimental group chose physics as a major by their second year of study.
Table S5.
Intended Major | Experimental group, % |
---|---|
Physics or astronomy | 14 |
Life sciences | 13 |
Engineering physics | 7 |
Non-STEM | 2 |
Computer science | 1 |
Chemistry | 1 |
Other STEM or undecided | 62 |
STEM, science, technology, engineering, and mathematics.
Evaluation of the Sophomore Students.
We will further evaluate the students who continued into the sophomore laboratory course to explore whether the results seen in the sophomore laboratory are attributable to transfer or selection effects. First, we will do a two-by-two comparison on the end-of-first-year MBT and BEMA scores (Table S6), comparing between students who did and did not take the sophomore laboratory course and between the experiment and control groups in the first-year course.
Table S6.
Group | Sophomore Laboratory | Comparisons | ||
---|---|---|---|---|
Took laboratory, % | Did not take laboratory, % | Took laboratory vs. did not take laboratory | Experimental vs. control group | |
MBT | ||||
Control Group | 77 (12) | 70 (16) | t(76.6) = 2.46; P = 0.016* | |
Experimental Group | 75 (17) | 66 (16) | t(80.6) = 2.81; P = 0.006** | |
Took laboratory | t(71.2) = 0.59; P = 0.556 | |||
BEMA | ||||
Control Group | 74 (9) | 65 (20) | t(34.8) = 1.85; P = 0.073 | |
Experimental Group | 68 (16) | 61 (16) | t(70.8) = 2.06; P = 0.04* | |
Took laboratory | t(44.3) = 1.71; P = 0.094 |
Students from the first year course (both from the control and experimental conditions) who did and did not take the sophomore laboratory are compared on MBT and BEMA diagnostics. Numbers are mean percentage on the test with SD in parentheses. *P < 0.05; **P < 0.01.
Overall, the students who went on to take the sophomore physics laboratory course outperformed the students who did not take the sophomore laboratory, as measured on both the MBT and the BEMA (note that, of the students in the control group, there was no difference between students who did and did not take the sophomore laboratory course on the BEMA). This tells us that the students in the sophomore physics laboratories generally had a stronger conceptual physics background than the students who did not continue in an upper-year physics laboratory course. This is consistent with the expected selection bias of students who choose to pursue more physics courses. Of the students who took the sophomore physics laboratory, however, there is a nonsignificant difference between the experimental and control groups on both the MBT and BEMA. This is consistent with the overall lack of differences on these measures between the full experiment and control conditions in the first-year laboratory course discussed in the previous section.
Next, we compare these two subgroups on their evaluation, iteration, and reflection behaviors throughout the first-year laboratories. The trends in the Fig. S1 A–C showing only the sophomore students are very similar to those for the whole course (Figs. 1–3). This suggests that the students who continued into the sophomore course were not exceptional in their behaviors in first-year. This further suggests that the effect seen in the sophomore laboratory experiment are not attributable to selection effects. It remains that the upwards shift in the control group’s reflective comments and evaluation of the model are attributable to something inherent in the sophomore laboratory course. Most likely these shifts can be attributed to the prompt in the instructions to explain why there may be extra parameters in the model. This instruction would explain a shift in the model evaluation and reflective comments but not in iteration, as seen in the data.
Reflection Analysis
To analyze students’ reflection in the laboratory, we evaluated students’ reflective comments associated with their statistical data analysis and conclusions. The reflective comments were coded using a set of four classes based on Bloom’s Taxonomy classes (13). Fig. S2 A and B provide samples of this coding applied to student work. The four comments levels were:
i)
Application: a written reflection statement that offers the outcome of the procedural application of data analysis tools (e.g., The value is 2.1.) These comments were distinct from procedural statements (e.g., then we calculated the value).
ii)
Analysis: a written reflection statement that analyzes or interprets their data analysis or results (e.g., our value is 0.84, which is close to one, indicating that our model fits the data well).
iii)
Synthesis: a written reflection statement that synthesizes multiple ideas, tool analyses, or reflections to propose a new idea. This could include suggesting ways to improve measurements (e.g., we will take more data in this range, because the data are sparse) or models (e.g., our data has an intercept so the model should have an intercept), as well as making comparisons (e.g., the value for the fit was 43.8 but for the fit was 4.17, which is much smaller).
iv)
Evaluation: a written reflection statement that evaluates, criticizes, or judges the previous ideas presented. Evaluation can look similar to analysis, but the distinction is that evaluation must follow a synthesis comment. For example, after a synthesis that compared two different models and demonstrated that adding an intercept lowered the value, an evaluation could follow as, “…the intercept was necessary due, most likely, to the inherent resistance within the circuit (such as in the wires).”
Fig. S2 A and B demonstrate how the coding scheme is applied to three excerpts from students’ books in the LR experiment (week 17). Each of the levels build on each other, so a student making a level 4 evaluation statement would also have made lower level statements, although level 1 comments (application) need not be present. Although it is important that students reflect on various parts of the data analysis, only the maximum reflection level a student reached was coded. It should be noted that the comments were not evaluated on correctness.
Analysis
For the first-year experiments, generalized linear mixed-effects models were performed using R (27) and the linear mixed-effects models using Eigen and S4 package (28) to analyze all three outcome measures (proposing and/or carrying out measurement changes, identifying and/or interpreting disagreements with models, and levels of reflection/comments). For measurement changes and evaluating models, logistic regression analysis was performed because of the dichotomous nature of the outcome variables. For the reflection data, Poisson regression was used to account for the bounded nature of the outcome variables. All three analyses used condition, laboratory week, and the interaction between condition and laboratory week as fixed effects and student identifier code (student ID) as a random effects intercept. Type 3 analysis of variance (ANOVA) was performed on the logistic regression models using the R Companion to Applied Regression package (29) to assess the overall impact of the variables. Sophomore laboratory data were analyzed using tests for independence of proportions.
Proposing and/or Carrying Out Measurement Changes.
A logistic regression was carried out to compare the proportion of students in each group and across each experiment proposing and/or carrying out changes to their measurements (Table S7). Note, for this analysis, proposing versus proposing and carrying out changes were collapsed to a single dichotomous variable of proposing or carrying out changes. The logistic regression model was statistically significant, . A type 3 ANOVA of the logistic regression model demonstrated that condition and the interaction between condition and laboratory week were highly significant in the model, but laboratory week alone was not significant.
Table S7.
Model coefficients and variables | Estimate | Wald z | P | |||
---|---|---|---|---|---|---|
Model coefficients | ||||||
Condition = Experiment | 7.97 | 0.94 | 8.49 | <0.0001*** | ||
Week = Week 16 | −0.82 | 0.86 | −0.96 | 0.336 | ||
Week = Week 17 | −0.41 | 0.75 | −0.55 | 0.582 | ||
[Condition = experiment] × [week = week 16] | −2.64 | 1.03 | −2.56 | 0.010** | ||
[Condition = experiment] × [week = week 17] | −2.54 | 0.93 | −2.72 | 0.007** | ||
Model variables | ||||||
Condition | 1 | 83.02 | <0.001*** | |||
Week | 2 | 28.99 | <0.001*** | |||
Condition × week | 2 | 9.28 | 0.01* |
Analysis used logistic regression to compare the control and experimental groups across four experiments, three in the first-year course and one in the sophomore course. *P < 0.05; **P < 0.01; ***P < 0.001.
With significant effects for the interaction, we can compare the groups each week to explore where the significant differences exist. To do this, we use a test of proportions comparing groups on the distribution of the number of students who did not propose or change their measurements, who proposed changes to their measurements, and who proposed and made changes to their measurements (returning to the three-level, rather than dichotomous, variable). Taking into account the multiple comparisons across weeks, we use a Bonferroni correct to set the α level at 0.01. This gave statistically significant differences between groups on all four experiments: week 2, ; week 16, ; week 17, ; sophomore laboratory, . This demonstrates that the experimental group outperformed the control group on this measure on all experiments.
Evaluating Models.
A logistic regression was carried out to compare the proportion of students in each group and across each experiment identifying the disagreement with the model and/or physically interpreting the issue (Table S8). Note, for this analysis, identifying versus physically interpreting the disagreement with the model were collapsed to a single dichotomous variable. The logistic regression model was statistically significant, . A type 3 ANOVA of the logistic regression model demonstrated that condition and the interaction between condition and laboratory week were highly significant in the model, but laboratory week alone was not significant.
Table S8.
Model coefficients and variables | Estimate | Wald z | P | |||
---|---|---|---|---|---|---|
Model coefficients | ||||||
Condition = experiment | −0.83 | 0.33 | −2.55 | 0.011* | ||
Week = week 17 | −0.27 | 0.30 | −0.88 | 0.379 | ||
[Condition = experiment] × [week = week 17] | 3.60 | 0.60 | 5.97 | <0.001*** | ||
Model variables | ||||||
Condition | 1 | 6.49 | 0.011* | |||
Week | 1 | 0.77 | 0.379 | |||
Condition × week | 1 | 35.62 | <0.001*** |
Analysis used logistic regression to compare the control and experimental groups across three experiments, two in the first-year course and one in the sophomore course. *P < 0.05; ***P < 0.001.
With significant effects for the interaction, we can compare the groups each week to explore where the significant differences exist. To do this, we use a test of proportions comparing groups on the distribution of the number of students who did not identify the disagreement with a model, who did identify the disagreement, and who identified and interpreted the disagreement. Taking into account the multiple comparisons across weeks, we use a Bonferroni correct to set the α level at 0.02. This gave significant differences between groups on all three experiments: week 2, ; week 17, ; sophomore laboratory, .
Reflection Behaviors.
A Poisson regression was carried out to analyze the quality of the reflective comments in each group across each experiment (Table S9). The regression model was statistically significant, . A type 3 ANOVA of the logistic regression model demonstrated that condition and the interaction between condition and laboratory week were highly significant in the model, but laboratory week alone was not significant.
Table S9.
Model coefficients and variables | Estimate | Wald z | P | |||
---|---|---|---|---|---|---|
Model coefficients | ||||||
Condition = experiment | 0.13 | 0.07 | 1.89 | 0.059 | ||
Week = week 16 | −0.29 | 0.08 | −3.48 | <0.001*** | ||
Week = week 17 | −0.40 | 0.09 | −4.59 | <0.001*** | ||
[Condition = experiment] × [week = week 16] | 0.17 | 0.11 | 1.52 | 0.130 | ||
[Condition = experiment] × [week = week 17] | 0.58 | 0.11 | 5.29 | <0.001*** | ||
Model variables | ||||||
Condition | 1 | 3.57 | 0.059 | |||
Week | 2 | 24.48 | <0.001*** | |||
Condition × week | 2 | 28.55 | <0.001*** |
Analysis used regression to compare the control and experimental groups across four experiments, three in the first-year course and one in the sophomore course. ***P < 0.001.
With a significant interaction, we can compare the groups each week to explore where the significant differences exist. To do this, we use a test of proportions comparing the distribution of the numbers of students in each group who reached each maximum comment level. Taking into account the multiple comparisons across weeks, we use a Bonferroni correct to set the α level at 0.01. This gave significant differences between groups on all three first-year experiments, but nonsignificant differences on the sophomore-laboratory: week 2, ; week 16, ; week 17, ; sophomore laboratory, .
Time on Task in the LR Experiment
One confounding issue to the week 17 LR circuit experiment was that students in the control group worked through a computer-based inquiry activity at the beginning of the experiment session. The activity taught students how to calculate the uncertainty in the slope of a best-fitting line, which they also used to reanalyze the previous week’s data. As such, the control group spent approximately 2 h on the LR circuit laboratory, whereas the experimental group spent 3 h. Not having enough time to reflect on data and act on that reflection may explain the different outcomes observed in the main text. As a precautionary measure, we observed students in the experimental group 2 h into the laboratory session to evaluate what analysis they had performed by that time. The observer recorded whether the group had by that time produced a one-parameter fit or a two-parameter fit.
The results, shown in Fig. S3, demonstrate that if the students in the experimental group had been given the same amount of time on task as students in the control group, more of them still would have made the modification to the model and included an intercept in their fit. Given additional time, however, even more students were able to think critically about the task and make better sense of their data. From this result, we conclude that the effects seen in this experiment are still primarily attributable to students’ overall improved behaviors. Indeed, the effect is much larger because of the additional time, which is an important feature of the intervention itself. It takes time for students to engage deeply in a task, think critically, and solve any problems that arise (30). Comparing between students in the experimental group at the 2-h mark and the final 3-h mark demonstrates the striking effect that an extra hour can make to students’ productivity.
The number of single-parameter fits decreased slightly from the 2-h observations and the final submitted materials for the experimental group. This could have occurred if students recognized that the fit was not helpful in understanding their data, because of the additional intercept required. This is interesting to note in light of the limitations of the analysis methods used in this study. Analyzing laboratory books can only keep track of recorded activity and many behaviors may have occurred without record. The result that some students created additional fits and then did not submit them at the end of the laboratory period demonstrates that students in the experimental group still may have engaged in additional reflective and iterative behaviors beyond what was recorded. Differences between the control and experimental groups, then, are unlikely attributable to students in the experimental group simply recording more while engaging in the same behaviors as students in the control group.
The slope uncertainty activity provided to the students in the control group just before the LR circuit laboratory may, however, have narrowed the focus of students’ analysis. That is, the activity first introduced students to the uncertainty in the slope of a one-parameter best fitting line (that is, with the intercept fixed at the origin). As such, it could be argued that these students were more likely to fix the intercept at the origin so that they could apply the learned formula. The activity, however, also included a follow-up task that introduced the uncertainty in the slope of a two-parameter best fitting line (intercept not fixed), and so students did have access to both options. Students also could have used their analysis to identify the issue even if they did not change their fit.
Acknowledgments
We acknowledge the support of Deborah Butler in preparing the manuscript. We also thank Jim Carolan for the diagnostic survey data about the study participants. This research was supported by the University of British Columbia’s Carl Wieman Science Education Initiative.
Supporting Information
Supporting Information (PDF)
Supporting Information
- Download
- 929.86 KB
References
1
Z Kanari, R Millar, Reasoning from data: How students collect and interpret data in science investigations. J Res Sci Teach 41, 748–769 (2004).
2
E-K Kumassah, J-G Ampiah, E-J Adjei, An investigation into senior high school (shs3) physics students understanding of data processing of length and time of scientific measurement in the Volta region of Ghana. Int J Res Stud Educ Technol 3, 37–61 (2013).
3
R-L Kung, C Linder, University students’ ideas about data processing and data comparison in a physics laboratory course. Nordic Stud Sci Educ 2, 40–53 (2006).
4
J Ryder, J Leach, Interpreting experimental data: The views of upper secondary school and university science students. Int J Sci Educ 22, 1069–1084 (2000).
5
J Ryder, Data interpretation activities and students’ views of the epistemology of science during a university earth sciences field study course. Teaching and Learning in the Science Laboratory, eds D Psillos, H Niedderer (Kluwer Academic Publishers, Dordrecht, The Netherlands), pp. 151–162 (2002).
6
M-G Séré, R Journeaux, C Larcher, Learning the statistical analysis of measurement errors. Int J Sci Educ 15, 427–438 (1993).
7
J Baron, Why teach thinking? - An essay. Appl Psychol 42, 191–214 (1993).
8
K-A Ericsson, R-T Krampe, C Tesch-Romer, The role of deliberate practice in the acquisition of expert performance. Psychol Rev 100, 363–406 (1993).
9
D Kuhn, M Pease, What needs to develop in the development of inquiry skills? Cogn Instr 26, 512–559 (2008).
10
S Allie, A Buffler, B Campbell, F Lubben, First-year physics students’ perceptions of the quality of experimental measurements. Int J Sci Educ 20, 447–459 (1998).
11
N-G Holmes, D-A Bonn, Doing science or doing a lab? Engaging students with scientific reasoning during physics lab experiments. 2013 Physics Education Research Conference Proceedings, eds Engelhardt P-V, Churukian A-D, Jones D-L (Portland, OR), pp 185–188. (2013).
12
M-G Séré, et al., Images of science linked to labwork: A survey of secondary school and university students. Res Sci Educ 31, 499–523 (2001).
13
L-W Anderson, L-A Sosniak Bloom’s Taxonomy: A Forty-Year Retrospective; National Society for the Study of Education Yearbooks (Univ of Chicago Press, Chicago, 1994).
14
N-G Holmes, J Ives, D-A Bonn, The impact of targeting scientific reasoning on student attitudes about experimental physics. 2014 Physics Education Research Conference Proceedings, eds Engelhardt P-V, Churukian A-D, Jones D-L (Minneapolis, MN), pp 119–122. Available at www.compadre.org/Repository/document/ServeFile.cfm?ID=13463&DocID=4062. Accessed July 28, 2015. (2014).
15
M Kapur, Productive failure. Cogn Instr 26, 379–424 (2008).
16
K VanLehn, Toward a theory of impasse-driven learning, Learning Issues for Intelligent Tutoring Systems, Cognitive Sciences, eds Mandl H, Lesgold A (Springer-Verlag, New York), pp 19–41. (1988).
17
D-L Butler, Individualizing instruction in self-regulated learning. Theory Pract 41, 81–92 (2002).
18
M-G Séré, Towards renewed research questions from the outcomes of the European project Labwork in Science Education. Sci Educ 86, 624–644 (2002).
19
S Bulu, S Pedersen, Scaffolding middle school students’ content knowledge and ill-structured problem solving in a problem-based hypermedia learning environment. Educ Tech Res Dev 58, 507–529 (2010).
20
G Salomon, D-N Perkins, Rocky roads to transfer: Rethinking mechanism of a neglected phenomenon. Educ Psychol 24, 113–142 (1989).
21
R-J Sternberg, T Ben-Zeev, Complex Cognition: The Psychology of Human Thought (Oxford Univ Press, New York). (2001).
22
M Krzywinski, N Altman, Points of significance: Error bars. Nat Methods 10, 921–922 (2013).
23
; Buereau International des Pois et Mesures, International Electrotechnical Commission, International Federation for Clinical Chemistry and Laboratory Medicine, International Organization for Standardization, International Union of Pure and Applied Chemistry, International Union of Pure and Applied Physics, International Organization of Legal Metrology, Guides to the Expression of Uncertainty in Measurement (Organization for Standardization, Geneva). (2008).
24
L Ding, R Chabay, B Sherwood, R Beichner, Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment. Phys Rev Spec Top-PH 2, 010105 (2006).
25
D Hestenes, M Wells, A mechanics baseline test. Phys Teach 30, 159–166 (1992).
26
D Hestenes, M Wells, G Swackhamer, Force concept inventory. Phys Teach 30, 141–158 (1992).
27
; R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna). (2014).
28
D Bates, M Maechler, B Bolker, S Walker, lme4: Linear Mixed-Effects Models Using Eigen and S4 (R Foundation for Statistical Computing, Vienna). Available at CRAN.R-project.org. Accessed July 28, 2015. (2014).
29
J Fox, S Weisberg, An R Companion to Applied Regression (Sage Publications, Inc., Thousand Oaks, CA), 2nd Ed. (2011).
30
A Hofstein, V-N Lunetta, The laboratory in science education: Foundations for the twenty-first century. Sci Educ 88, 28–54 (2004).
Information & Authors
Information
Published in
Classifications
Submission history
Published online: August 17, 2015
Published in issue: September 8, 2015
Keywords
Acknowledgments
We acknowledge the support of Deborah Butler in preparing the manuscript. We also thank Jim Carolan for the diagnostic survey data about the study participants. This research was supported by the University of British Columbia’s Carl Wieman Science Education Initiative.
Notes
This article is a PNAS Direct Submission. S.C.S. is a Guest Editor invited by the Editorial Board.
Authors
Competing Interests
The authors declare no conflict of interest.
Metrics & Citations
Metrics
Altmetrics
Citations
Cite this article
Teaching critical thinking, Proc. Natl. Acad. Sci. U.S.A.
112 (36) 11199-11204,
https://doi.org/10.1073/pnas.1505329112
(2015).
Copied!
Copying failed.
Export the article citation data by selecting a format from the list below and clicking Export.
Cited by
Loading...
View Options
View options
PDF format
Download this article as a PDF file
DOWNLOAD PDFLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Personal login Institutional LoginRecommend to a librarian
Recommend PNAS to a LibrarianPurchase options
Purchase this article to access the full text.