This paper reviews empirical research examining a number of common concerns about the accuracy and usefulness of “student evaluations of teaching” (SETs). It also outlines changes made to Stanford’s end-term course evaluation form and reports to address some of these issues, as well as recommendations for utilizing alternative forms of feedback on teaching effectiveness. Key points are summarized below.
Course and teaching feedback, often referred to as “student evaluations of teaching (SET)” or “student evaluations,” has a long and complex history in higher education. The purpose and usefulness of SETs has come into question over the years and more recent research has revealed important insights into these issues. These questions include whether there is a direct correlation between evaluation scores and teaching effectiveness and student learning, concerns that evaluation results are confounded by factors such as bias and contextual conditions beyond an instructor’s control, and broader critiques of the drive to rate and quantify teaching and learning for institutional purposes. This paper examines these concerns and summarizes the research associated with them. The last section addresses changes that have been made at Stanford to the course evaluations process and recommendations for how instructors, departments, and schools can interpret and use end-term student course feedback.
Although it is common practice to include questions about teaching effectiveness in many SETs, studies suggest that such ratings alone should not be used to directly and reliably rate the teaching ability of instructors. While students are well-positioned to speak of their experience of the course 1 -- such as how difficult or easy they found the content, or whether classroom activities helped them complete assignments -- they lack the expert judgement to comment on the structure, relevance, or method behind the course content, or the instructor’s knowledge and scholarship 2 . Furthermore, students’ subjective experiences and contextual factors, such as lack of interest in a subject or poor attendance, may influence their reactions to an instructor and make interpreting results difficult.
Further complicating this point are concerns about the statistical validity of SETs. Stark and Frieshtat (2014), point out that the variables in SETs are frequently “ordinal category variables”; that is, possible values that have a natural range or order, such as “Excellent, Good, Fair, Poor, Very Poor”, but have no standard quantitative distance between each rating. These variables are typically assigned a number, such as 1 to 5, but these numbers are not measurable qualities or quantities, rather, they are only labels from which it is convenient to calculate mean, median, and so forth. One student’s choice of a number on a 1 to 5 scale may not mean the same as another’s, and the context for these choices is not always clear. When the scores are averaged, the precise value (such as a 4.2 in “teaching quality”) is difficult to interpret as averages can obscure potentially meaningful variations across responses. For example, the average of two “3” scores is the same as the average of a “1” and a “5”, and yet these distributions have very different implications when coming from a group of students. In addition, a “bootstrap” analysis of historical data at Stanford reveals that a variation of +/- 0.3 (more than half a point altogether) can be expected due to random selection from the pool of available students alone. Finally, a significant portion of instructors will always fall below the average, regardless of their individual ability or commitment, if a departmental average is taken as a target for performance.
Several studies have found that SET ratings do not correlate highly with student learning, if the best measure of effective instruction is taken to mean achieving better assessment outcomes. Experimental studies assigning students at random to sections and courses generally find weak or negative associations between SET ratings and performance in graded assessments 3,4,5,6,7,8 . One study found similar learning gains for students in sections with the highest and lowest course evaluation scores 9 . Many have hypothesized that there may be a strong correlation between grading rigor and student ratings, or that students’ expectations of high grades encourage high ratings, and vice versa 10 . While there is some evidence to support a moderate association between students’ expected grades and SET scores, as one might expect, Tripp et al (2018) found that students’ perception of fairness in grading (clear rubrics, etc.) can offset potential for student retaliation over grades 11 . To preclude the possibility of retaliation, Stanford’s end-term course feedback is collected before final grades are assigned and instructors do not see results until after grades have been released. Perhaps of greater concern is that instructors will modify their teaching style to focus on higher student ratings rather than student’s mastery of the topic and academic rigor.
If SETs and student learning and teaching effectiveness are not highly correlated, it may be the case that student ratings of instruction are influenced by a variety of factors that are difficult to correct for and often outside of the instructor’s control. This raises the concern about whether it is possible to obtain a fair measure of teaching quality. For example, research suggests that gender 12 and attractiveness 13,14 of instructors may influence students’ evaluation of teaching quality. In addition, some studies have suggested that students make unconscious decisions based on factors such as gestures and deportment, rather than their whole experience of the class, and that these decisions match very closely with end-term SET results 15 . Instructor race 16 , and also cultural factors such as accent 17 , have also been examined as sources of bias. Given these factors, SETs should be used cautiously, particularly when making high-stakes decisions in departments and schools. A more robust approach would also call upon a variety of parallel measures that are sensitive to the complete context of the instructor’s individual career. These measures may include mid-quarter feedback from students, peer evaluations, and self-evaluations in the form of a teaching portfolio or similar artifact.
End-term course feedback began at Stanford as a student initiative in the late 1950s 18 , when a student publication, The Scratch Sheet, first published editorial descriptions of courses. Today, students’ feedback continues to be regarded as a useful source of insight into students’ experience of a course. Students’ end-term feedback on their experiences can inform instructors’ reflection on decisions about their pedagogy, but should be considered as one tool in a range of evaluation techniques and in the context of the instructor’s entire academic and teaching career. With the revised end-term course evaluations process in Autumn Quarter 2015-2016, Stanford reconfigured the end-term course feedback form to attend to some of the concerns over bias and using SETs as a calibrated measure of teaching effectiveness. Central to the Course Evaluation Committee’s thinking was shifting attention away from the performance of individual instructors towards the students’ learning in the course and their experience with specific elements of the course and learning process, such as text-books, guest lecturers, and their achievement of the defined learning goals. Consequently, there are fewer questions about instructors and more questions about aspects of the course such as learning goals, course elements, and course organization. A question about the effectiveness of instruction was retained to provide some continuity with the previous feedback form (although the previous scores are not directly comparable), but with the goal of using the customization options to supplement and provide context to this score.
The current process also provides the option to customize course feedback forms with specific learning goals, other course elements, and questions of the instructors’ own devising. These customizable questions enable instructors to direct their own enquiries about teaching effectiveness and student learning, and to gain a sense of student engagement with the topics of their course. While there is a set of common questions that apply across all participating schools, the option to add custom questions enables instructors to set questions that speak to the specifics of their course. Instructors have the freedom to craft learning goals that reflect their own goals and models for student learning, from discrete units of skill or competency to outcomes such as informed and deepened appreciation of the field or the ability to synthesize viewpoints into an argument. Well-defined learning goals can help students focus on their own achievement in the course, and help instructors judge if this is consistent with the course design.
The section feedback form is used to evaluate instructors in course sections, such as labs and discussion sections, including TAs, CAs, and Teaching Fellows. The current section feedback form is based on the section leader evaluation form, used before Autumn 2015-2016, but incorporates several changes. Bearing in mind that these results can often be part of an instructor’s teaching portfolio and used in job applications, the section form is designed to elicit useful feedback for the instructor specifically. The section form includes an extensive question bank from which the instructor can select items that specifically reflect elements of their teaching situation and style, and there are several opportunities for students to provide detailed, open-ended responses on aspects of the instructor’s teaching that were most helpful or offer potential for improvement. This gives wide scope to students’ qualitative feedback for those TA and CA instructors who may be contemplating a teaching career.
In step with changes to the course feedback form, the instructor reports were also redesigned to move away from reliance on averages and direct attention to the distribution of scores, enabling instructors to make more informed judgements about the range of student responses (Fig. 1).
Fig 1. Chart of frequency distributions for use in result reporting
Furthermore, averages of averages are no longer generated and their use is discouraged. Concentrating on the frequency distributions allows instructors to view the spread of responses and identify the difference between, for example, a flat distribution, a clustering of responses at one extreme, or even multiple clusters. Standard statistics are still included along with response rates in data tables appended to the charts. There are also more opportunities for student comments, which often provide valuable context when interpreting the numerical results. The The Center for Teaching and Learning (CTL) provides resources and access to consultations that can help instructors read and interpret evaluations reports, and, if necessary, develop responses.
A review of the research literature suggests that a measure of due caution is appropriate when SETs are used in making critical decisions about personnel and faculty development. Studies to date have not shown compelling evidence that student ratings and student learning and teaching effectiveness are highly correlated. This may in part be due to the fact that evidence suggests that student ratings of their instructors are subject to a range of potential biases. There are questions about the statistical validity of SETs and it can be misleading to put too much weight on averages, particularly combined averages as ratings of teaching effectiveness, or fractional differences in scores that can fall within expected variation. However, as the Stanford Course Evaluation Committee observes:
Part of that larger picture could include mid-term feedback from students (for which CTL provides a number of services), peer observations and evaluations, and self-evaluations in the form of a teaching portfolio or similar artifact.
Student responses, focused on their experience of the course and its goals rather than individual instructors, can productively inform the growth of instructors and their pedagogical approaches. To reflect this, the current course form is shorter, structured by student’s experience of the course, and customizable. Results, when reported, emphasize the distribution of responses, rather than average scores, which are still tabulated and presented for reference. When taken as one piece of evidence balanced with other important materials, the end-term course evaluation can be a useful tool in tracking the trajectory of an instructor’s teaching career, and a catalyst to improved teaching 19 .
1 Scriven, Michael (1995). Student ratings offer useful input into teacher evaluations. Practical Assessment, Research & Education, 4(7).
2 Stark, P.B., and R. Freishtat, 2014. An Evaluation of Course Evaluations, ScienceOpen, DOI 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1
3 Carrell, S.E., and J.E. West, 2010. Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors, J. Political Economy, 118, 409-432.
4 Boring, A., K. Ottoboni, and P.B. Stark, 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness, ScienceOpen, DOI 10.14293/S2199- 1006.1.SOR-EDU.AETBZC.v1
5 Braga, M., M. Paccagnella, and M. Pellizzari, 2014. Evaluating Students’ Evaluations of Professors, Economics of Education Review, 41, 71-88.
6 Johnson, V.E., 2003. Grade Inflation: A Crisis in College Education, Springer-Verlag, NY, 262pp.
7 MacNell, L., A. Driscoll, and A.N. Hunt, 2015. What’s in a Name: Exposing Gender Bias in Student Ratings of Teaching, Innovative Higher Education, 40, 291-303. DOI10.1007/s10755-014-9313-4
8 Uttl, B., C.A. White, and D.W. Gonzalez, 2016. Meta-analysis of Faculty’s Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning Are Not Related, Studies in Educational Evaluation, DOI: .1016/j.stueduc.2016.08.007
9 Beleche T, Fairris D, Marks M. Do course evaluations truly reflect student learning? Evidence from an objectively graded post-test. Economics of Education Review. Elsevier Ltd; 2012; 31(5):709–19.
10 Boring, A., K. Ottoboni, and P.B. Stark, 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness, ScienceOpen, DOI 10.14293/S2199- 1006.1.SOR-EDU.AETBZC.v1
11 Tripp, Thomas M., L. Jiang, K. Olson, and M. Graso, 2018. The Fair Process Effect in the Classroom: Reducing the Influence of Grades on Student Evaluations of Teachers, Journal of Marketing Education, DOI 10.1177/0273475318772618
12 Boring, A., 2015. Gender Bias in Student Evaluations of Teachers, OFCE-PRESAGE- Sciences-Po Working Paper, http://www.ofce.sciences-po.fr/pdf/dtravail/WP2015-13.pdf
13 Wolbring, T., and P. Riordan, 2016. How Beauty Works. Theoretical Mechanisms and Two Empirical Applications on Students’ Evaluations of Teaching, Social Science 10.1016/j.ssresearch.2015.12.009
14 Campbell, H., Gerdes, K., & Steiner, S. (2005). What’s looks got to do with it? Instructor appearance and student evaluations of teaching. Journal of Policy Analysis and Management, 24, 611–620
15 Ambady, N., and R. Rosenthal, 1993. Half a Minute: Predicting Teacher Evaluations from Thin Slices of Nonverbal Behavior and Physical Attractiveness, J. Personality and Social Psychology, 64, 431-441.
16 Archibeque, O., 2014. Bias in Student Evaluations of Minority Faculty: A Selected Bibliography of Recent Publications, 2005 to Present. http://library.auraria.edu/content/bias-student-evaluations-minority-fa… (last retrieved 30 September 2016)
17 Subtirelu, N.C., 2015. “She does have an accent but. ”: Race and language ideology in students’ evaluations of mathematics instructors on RateMyProfessors.com, Language in Society, 44, 35-62. DOI 10.1017/S0047404514000736
18 Stanford Course Evaluation Committee. Course Evaluation Committee Report. 18 December, 2013.
19 Stanford Course Evaluation Committee. Course Evaluation Committee Report. 18 December, 2013.