Evaluating Teachers: Data-Driven vs. Peer-Driven, or, Wal-Mart vs. Etsy
A NYT article exposes flaws in the data-driven “value-added” approach to assessing a given teacher’s classroom skills: when using student standardized test scores as the key measure of a teacher’s effectiveneness, significant gaps emerge not because of what the score is supposed to represent, but rather because a test score doesn’t reflect every classroom situation.
Teachers in team-teaching situations fall through the cracks, for example. Blips in the algorithms that tabulate scores can create a picture of students who don’t seem to measure up according to one standardized test, but on another, score near the top. Data may be assigned incorrectly, or not at all. And there is the matter of trying to assess a teacher’s personality–the electricity, charisma, or sense of delight she or he may impart to students and that keeps parents requesting the teacher despite what scores might say.
Now that scores are public–in the case of the Los Angeles Times, published in a ranked scale–the newspaper has been forced to provide context given that the data was originally presented with few qualifiers or discussions of how it was arrived at. Parents grasp at straws in an effort to compare teachers within a school or schools to other schools.
It’s like our wish to comparison shop when buying a mattress. There’s no standardization and the model numbers and names change from mattress to mattress, frustrating our ability to assess price and value. Data gives the impression of objective solidity. We want to be able to say, this apple versus that apple, versus those oranges over there.
But learning isn’t about tufted versus firm or pillow-topped versus flat. Beyond a certain level of skill mastery, measuring learning is like asking, “How do you sleep?”
So when teachers take the lead in evaluating themselves, they go about it very differently. The objective, as Cynthia Danielson at the Association for Supervision and Curriculum Development (ACSD) notes in her post, is to do four things:
- develop a consistent definition of good teaching over the arc of a given teacher’s career
- develop a consistent definition of good teaching among an entire school’s teaching staff
- establish a conversation between a teacher and his/her evaluator that incorporates outside observation and the teacher’s own self-reflection
- train the evaluator to give good notes and make proper time and space for evaluation
What I notice is how the cart comes before the horse in the data-driven approach. High-stakes student test scores are the tail that wags the classroom experience. Whereas, for teachers, if the process works, then there should be good results which may or may not be reflected on a standardized test.
Without imputing bad faith, I see why the data-driven approach is comforting. It prioritizes student outcomes first, possibly at the expense of other outcomes that have different criteria. It feels reassuringly objective. It’s also an incomplete look at what a particular student is capable of.
Without imputing bad faith, I also see why the teacher/peer-driven approach is appealing. It accounts for the process as much as the result, and it prioritizes the whole child and in its own way puts children first. But at its worst, it focuses too much on the teacher’s process and sets aside the student.
I’ve written before about how student evaluations of teachers can be one way to introduce the student’s voice in assessing good teaching and how well and how much the student needed to learn. And to some extent, I think adding essays to the SAT is tacit recognition that multiple choice questions alone slice achievement very narrowly–too narrowly.
But I think there may be an unresolvable tension that the scale and diversity of American students presents–the need to quantify across a large population of students results that can be compared using consistent measures, and the fact that teaching is a hand-made, artisanal, highly personal dialogue between unique individuals. Scale prompts a Wal-Mart sort of response to testing that relies on measurement and mechanization; individuality necessitates an Etsy-like approach that takes into account quirky kids and idiosyncratic teachers. So long as elements of one can be brought into the other method, maybe it can be a more productive tension than a pendulum swing toward one end or the other.