Good tests make children fail – here’s why

Many parents and teachers are critical of the Standardised Assessment Tests (SATs) that have recently been taken by primary school children. One common complaint is that they are too hard. Teachers at my son’s school sent children home with example questions to quiz their parents on, hoping to show that getting full marks is next to impossible.

Invariably, when parents try out these tests, they focus on the most difficult or confusing items. Some parents and teachers can be heard complaining on social media that if they get questions wrong, surely the tests are too hard for ten-year-olds.

But how hard should tests for children be?

As a psychologist, I know we have some well-developed principles that can help us address the question. If we look at the SATs as measures of some kind of underlying ability, then we can turn to one of the oldest branches of psychology – “psychometrics” – for some guidance.

Getting it just right

A good test shouldn’t be too hard. If most people get most questions wrong, then you have what is called a “floor effect”. The result is that you can’t tell any difference in ability between the people taking the test.

If we started the school sports day high jump with the bar at two metres high (close to the world record), then we’d finish sports day with everybody getting the same – zero successful jumps – and no information about how good anyone is at the high jump.

But at the same time, a good test shouldn’t be too easy. If most people get everything right, then the effect is, as you might expected, called a “ceiling effect”. If everybody gets everything right then again we don’t get any information from the test.

The key idea is that tests must discriminate. In psychometric terms, the value of a test is about the match between the thing it is supposed to measure and the difficulty of the items on the test. If I wanted to gauge maths ability in six-year-olds and I gave them all an A-Level paper, we can presume that nearly everyone would score zero. Although the A-Level paper might be a good test, it is completely uninformative if it is badly matched to the ability of the people taking the test.

Here’s the rub: for a test to be sensitive to differences in ability, it must contain items which people get wrong. Actually, there’s a precise answer to the proportion that you should get wrong – in the most sensitive test it should be half of the items. Questions which you are 50% likely to get right are the ones which are most informative.

How we feel about measuring and labelling children according to their skill at taking these tests is a big issue, but it is important that we recognise that this is what tests do. A well designed test will make all children get some items wrong – it is inherent in their design. It is up to us how we conceptualise that: whether tests are an unnecessary distraction from true education, or a necessary challenge we all need to be exposed to.

Better tests?

If you adopt this psychometric perspective, it becomes clear that the tests we use are an inefficient way of measuring any individual child’s particular ability to do the test. Most children will be asked a bunch of questions which are too easy for them, before they get to the informative ones which are at the edge of their ability. Then they will go on to attempt a bunch of questions which are far too hard. And pity the people for who the test is poorly matched to their ability and consists mostly of questions they’ll get wrong – which is both uninformative in psychometric terms, and dispiriting emotionally.

A hundred years ago, when we began our modern fixation with testing and measuring, it was hard to avoid the waste where many uninformative and potentially depressing questions were asked. This was simply because all children had to take the same exam paper.

Nowadays, however, examiners can administer tests via computer, and algorithmically identify the most informative questions for each child’s ability – making the tests shorter, more accurate, and less focused on the experience of failure. You could throw in enough easy questions that no child would ever have the experience of getting most of the questions wrong. But still there’s no getting around the fact that an informative test has to contain questions most people sitting it will get wrong.

Even a good test can measure an educationally irrelevant ability (such as merely the ability to do the test, or memorise abstract grammar rules), or be used in ways that harm children. But having difficult items isn’t a problem with the SATs, it’s a problem with all tests.

The Conversation

This article was originally published on The Conversation. Read the original article.

21 thoughts on “Good tests make children fail – here’s why”

  1. I’ve been studying neuroscience as a post-doc for the last 12 years. Last year I took Peggy Mason’s Neuroscience online course on Coursera. What I noticed when I took the quizes is that every time I came upon a question I didn’t know the answer to, I could feel a familiar flood of stress/dread hormones rushing through my body. They essentially compromised much of my reasoning/cognitive ability in the moment – I simply could not fully focus on the material to the point where I had difficulty even reading and understanding the question.

    To me, this was clear validation of neurophysiologist Steve Porges’s Neuroception Theory – the body and brain perceive all evaluation as threat. Why do we need to threaten children who have immature neural networks that are more powerfully adversely impacted than mine?

    1. I think you could stand to think the problem through more carefully. You have no reason to believe that your experience is typical. Some people have a fear of test-taking. Some people don’t. Determining how many are in each category would require a large sample of a lot of people. It seems likely to me that it is the case that most children when taking a test that is meant to evaluate their teachers and not them are less powerfully adversely impacted by the experience than you were then.

      1. The issue isn’t whether my experience is typical. The issue is the uber-vulnerability of children’s nervous systems to stress and threat. There are many reasons that, as Ken Robinson points out, kids go from reporting themselves as 98% creative in the beginning of schooling and 2% by it’s end. I would love to see some well-designed, controlled experiments assessing the impacts of evaluative testing on that outcome. Based upon numerous ancillary stress studies, I would hypothesize testing’s impact to be significant. Imagine grading Picasso on his Blue Nude or Mary Oliver on Wild Geese.

    2. Children aren’t made of sugar. Testing their knowledge won’t harm them a bit. There is plenty in life that presents difficulty. Some school test is a lot less of a problem than how to snuggle up to a cutie or hit a baseball.

      1. It’s not a question of what children are made of or if they should be tested. It’s what happens to their neurophysiology when we do what we do to them. Do we want kids to be 98% creative in kindergarten and only 2% creative by high school? If not, then we need to explore and examine what the critical cause-and-effect variables are. I would posit that creativity is a product of body and brain interacting with environments in novel and unexpected ways. Kindergarteners seem to have that capacity in spades. My hypothesis would be that allostatic load (bad stress) out of balance with eustress (good stress) adversely impacts children to a much greater extent than it does adults without us ever realizing it. And that the way we currently administer tests is very likely a part of that adverse impact. I don’t think it’s an accident that Sergey Brin and Larry Page (along with all these people http://gamontessori.com/newgams/famous.htm) went to Montessori Schools as kids.

    3. Where is Ken Robinson’s evidence for the statistics that you quote? You see the problem with quoting some random nonsense is that people will spot it. The fact that you don’t like tests is an issue. I have taught primary aged children for over a decade. There is maybe 3 or 4 children out of hundreds who really found tests painful. Most didn’t and a greater percentage actually liked them and wanted more.

      There is a whole generation of educationalists who don’t seem to have grown up and learnt to cope with their feelings – ditto journalists. Here’s an example of how figures are distorted by anti-testing people http://www.bbc.co.uk/news/education-36229995

      There are stresses and strains in life that are worse than testing. What are you going to do ban everything? Children have to cope with childhood diseases, not all of which are preventable, bereavement, parents with addiction and MH problems. I grew up in a dysfunctional family and not only managed to do well at school but it was the way I could remove myself and find a better life.

      It is also incredibly of Sir Ken and his followers to ignore the hundreds of years of creativity and innovation by people who were tested. Shakespeare shouldn’t exist according this line of thinking because he was reciting Latin verbs at the beginning of every day. Creativity can not be taught, to be creative one needs to know what is already there in order to move away from it. Picasso knew all the standard art techniques and moved away from it. He could play around with techniques because he knew what they were and had mastered them.

      Small children need to master the basics – end of.

      1. If this is your actual experience, I suggest you start gathering real evidence based on paying much closer attention …

        “The fact that you don’t like tests is an issue. I have taught primary aged children for over a decade. There is maybe 3 or 4 children out of hundreds who really found tests painful. Most didn’t and a greater percentage actually liked them and wanted more.”

      2. Or maybe you could take the evidence for what it is and stop projecting your feelings onto millions of others? Occams Razor on this one, vast majority of those who take tests do not report MH issues or problems because they don’t have them.

    4. There’s something wrong with you if you freak out about getting one question wrong. Good grief.

      I also don’t like that the SAT has been dumbed down to the point of near meaninglessness. Severe ceiling effects are now in play.

  2. There’s a tension between using testing to distinguish and using it to evaluate. My professional organization created a test for the topic I teach, but the test is designed to distinguish, and an average program would get 50% on it. So I have never used it myself, because a 50% score would be viewed as unacceptable by every stakeholder in my program.
    And why should tests be designed to distinguish, anyway? A test should measure the content it measures. If all the students pass, that means they all learned the content.

    1. To add to this, some subjects don’t lend themselves to difficult tests. You can easily turn the difficulty dial on math and reading tests; history tests generally only achieve “difficulty” by drawing on increasingly obscure vocabulary, pressuring teachers to sacrifice depth for breadth.

      1. Define depth and breadth as you have used it here.

        Are you claiming that it is impossible to test for depth in history? If something cannot be measured, how do you know that it exists? If history tests are only achieving difficulty by increasing breadth and never depth, then that is a problem with the design of the tests.

    2. It’s different in primary school. In your profession, you’re teaching them on a skill for them to use in their job, and your job is to ensure everyone who passes has mastered that skill. The goal then is to get them to mastery in the shortest amount of time possible. In primary school, none of the children are considered ready to enter the workforce, and the goal of primary school is to prepare them as much as possible. There is always more content which can be added to the curriculum. Teachers should be trying to maximize the amount of content taught in the curriculum. Further, if students are only learning the content they are taught, and unable ever to generalize the skills to a real world context, then many courses in primary school are a waste of time. Tests need to include questions the children were not explicitly taught the answers to so that teachers cannot use pure rote memorization of simple concepts.

  3. Did you know your site makes people log in to wordpress before commenting? I’ve often tried to comment here and given up because I didn’t have my wordpress password handy.

    1. Yes, we did know. Although you can also login via twitter or fb I think. If you saw the amount of comment spam there is you’d appreciate that we have to force a login, however regrettable this is for free-and-easy discourse among commentators

  4. “Nowadays, however, examiners can administer tests via computer, and algorithmically identify the most informative questions for each child’s ability – making the tests shorter, more accurate, and less focused on the experience of failure.”
    In my son’s school, they’ve been using a test called i-ready, similar to what you are describing, 4 times a year to monitor progress. Basically, if you answer a question wrong, the next question will be easier. If you answer a question right, the next question will be harder,
    http://www.curriculumassociates.com/products/iready/iready-adaptive-diagnostic-assessment.aspx
    The school’s ultimate goal is to prepare students for the PARCC test, a highly controversial test given in many states in the US, used to assess teachers and schools.

    1. Define harder. I can understand a math question can be scaled by difficulty, but how do you scale a history, language, science question? Those are mostly fact based questions, unless it is scaled by the amount of analysis & interpretation that is needed. A kid may know a lot about the civil war, but very little about the revolutionary war. If the questions are mixed,
      1. Revolutionary war
      2. Civil war
      3. Revolutionary war
      3. Gulf war
      4. Civil war
      the computer generator that increases/decreases difficulty is a waste, because mixed knowledge can’t be scored by difficulty due to an individual exposure, interest, & outside educational support.

      1. Example of scaled questions on one of your cited wars:
        Was there a gulf war?
        What countries were in the gulf war?
        We’re there religious differences between the countries?
        What is Islam?
        What Are the main branches of islam?
        What sparked the Shiite Sunni split?
        Which uncle of the prophet was involved?
        Which American president sold the combatants chemical weapons for use in the war?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: