Type I and Type II errors are, respectively, when you allow a statistical test to convinces you of a false effect, and when you allow a statistical test to convince you to dismiss a true effect. Despite being fundamentally important concepts, they are terribly named. Who can ever remember which way around the two errors go? Well now I can, thanks to a comment from a friend I thought so useful I made it into a picture:
The Economist has an excellent video on consciousness, what it is, why and how it evolved.
The science section of The Economist has long had some of the best science reporting in the mainstream press and this video is a fantastic introduction to the science of consciousness.
It’s 12 minutes long and it’s worth every second of your time.
The Reproducibility Project results have just been published in Science, a massive, collaborative, ‘Open Science’ attempt to replicate 100 psychology experiments published in leading psychology journals. The results are sure to be widely debated – the biggest result being that many published results were not replicated. There’s an article in the New York Times about the study here: Many Psychology Findings Not as Strong as Claimed, Study Says
This is a landmark in meta-science : researchers collaborating to inspect how psychological science is carried out, how reliable it is, and what that means for how we should change what we do in the future. But, it is also an illustration of the process of Open Science. All the materials from the project, including the raw data and analysis code, can be downloaded from the OSF webpage. That means that if you have a question about the results, you can check it for yourself. So, by way of example, here’s a quick analysis I ran this morning: does the number of citations of a paper predict how large the effect size will be of a replication in the Reproducibility Project? Answer: not so much
That horizontal string of dots along the bottom is replications with close to zero-effect size, and high citations for the original paper (nearly all of which reported non-zero and statistically significant effects). Draw your own conclusions!
Link: my code for making this graph (in python)
Frontiers in Psychology has just published an article on ‘Fifty psychological and psychiatric terms to avoid’. These sorts of “here’s how to talk about” articles are popular but themselves can often be misleading, and the same applies to this one.
The article supposedly contains 50 “inaccurate, misleading, misused, ambiguous, and logically confused words and phrases”.
The first thing to say is that by recommending that people avoid certain words or phrases, the article is violating its own recommendations. That may seem like a trivial point but it isn’t when you’re giving advice about how to use language in scientific discussion.
It’s fine to use even plainly wrong terms to discuss how they’re used, the multiple meanings and misconceptions behind them. In fact, a lot of scientific writing does exactly this. When there are misconceptions that may cloud people’s understanding, it’s best to address them head on rather than avoid them.
Sometimes following the recommendations for ‘phrases to avoid’ would actually hinder this process.
For example, the piece recommends you avoid the term ‘autism epidemic’ as there is no good evidence that there is an actual epidemic. But this is not advice about language, it’s just an empirical point. According to this list, all the research that has used the term, to discuss the actual evidence in contrary to the popular idea, should have avoided the term and presumably referred to it as ‘the concept that shall not be named’.
The article also recommends against using ‘ambiguous’ words but this recommendation would basically kill the English language as many words have multiple meanings – like the word ‘meaning’ for example – but that doesn’t mean you should avoid them.
If you’re a fan of pedantry you may want to go through the article and highlight where the authors have used other ambiguous psychological phrases (starter for 10, “memory”) and post it to some obscure corner of the internet.
Many of the recommendations also rely on you agreeing with the narrow definition and limits of use that the authors premise their argument on. Do you agree that “antidepressant medication” means that the medication has a selective and specific effect on depression and no other conditions – as the authors suggest? Or do you think this just describes a property of the medication? This is exactly how medication description works throughout medicine. Aspirin is an analgesic medication and an anti-inflammatory medication, as well as having other properties. No banning needed here.
And in fact, this sort of naming is just a property of language. If you talk about an ‘off-road vehicle’, and someone pipes up to tell you “actually, off-road vehicles can also go on-road so I recommend you avoid that description” I recommend you ignore them.
The same applies to many of the definitions in this list. The ‘chemical imbalance’ theory of depression is not empirically supported, so don’t claim it is, but feel free to use the phrase if you want to discuss this misconception. Some conditions genuinely do involve a chemical imbalance though – like the accumulation of copper in Wilson’s disease, so you can use the phrase accurately in this case, being aware of how its misused in other contexts. Don’t avoid it, just use it clearly.
With ‘Lie detector test’ no accurate test has ever been devised to detect lies. But you may be writing about research which is trying to develop one or research that has tested the idea. ‘No difference between groups’ is fine if there is genuinely no difference in your measure between the groups (i.e. they both score exactly the same).
Some of the recommendations are essentially based on the premise that you ‘shouldn’t use the term except for how it was first defined or defined where we think is the authoritative source’. This is just daft advice. Terms evolve over time. Definitions shift and change. The article recommends against using ‘Fetish’ except for in its DSM-5 definition, despite the fact this is different to how it’s used commonly and how it’s widely used in other academic literature. ‘Splitting’ is widely used in a form to mean ‘team splitting’ which the article says is ‘wrong’. It isn’t wrong – the term has just evolved.
I think philosophers would be surprised to hear ‘reductionism’ is a term to be avoided – given the massive literature on reductionism. Similarly, sociologists might be a bit baffled by ‘medical model’ being a banned phrase, given the debates over it and, unsurprisingly, its meaning.
Some of the advice is just plain wrong. Don’t use “Prevalence of trait X” says the article because apparently prevalence only applies to things that are either present or absent and “not dimensionally distributed in the population, such as personality traits and intelligence”. Many traits are defined by cut-off scores along dimensionally defined constructs, making them categorical. If you couldn’t talk about the prevalence in this way, we’d be unable to talk about prevalence of intellectual disability (widely defined as involving an IQ of less than 70) or dementia – which is diagnosed by a cut-off score on dimensionally varying neuropsychological test performance.
Some of the recommended terms to avoid are probably best avoided in most contexts (“hard-wired”, “love molecule”) and some are inherently self-contradictory (“Observable symptom”, “Hierarchical stepwise regression”) but again, use them if you want to discuss how they’re used.
I have to say, the piece reminds me of Stephen Pinker’s criticism of ‘language mavens’ who have come up with rules for their particular version of English which they decide others must follow.
To be honest, I think the Frontiers in Psychology article is well-worth reading. It’s a great guide to how some concepts are used in different ways, but it’s not good advice for things to avoid.
The best advice is probably: communicate clearly, bearing in mind that terms and concepts can have multiple meanings and your audience may not be aware of which you want to communicate, so make an effort to clarify where needed.
Link to Frontiers in Psychology article.
Online testing is sure to play a large part in the future of Psychology. Using Mechanical Turk or other crowdsourcing sites for research, psychologists can quickly and easily gather data for any study where the responses can be provided online. One concern, however, is that online samples may be less motivated to pay attention to the tasks they are participating in. Not only is nobody watching how they do these online experiments, they whole experience is framed as a work-for-cash gig, so there is pressure to complete any activity as quickly and with as low effort as possible. To the extent that the online participants are satisficing or skimping on their attention, can we trust the data?
A newly submitted paper uses data from the Many Labs 3 project, which recruited over 3000 participants from both online and University campus samples, to test the idea that online samples are different from the traditional offline samples used by academic psychologists:
The findings strike a note of optimism, if you’re into online testing (perhaps less so if you use traditional university samples):
Mechanical Turk workers report paying more attention and exerting more effort than undergraduate students. Mechanical Turk workers were also more likely to pass an instructional manipulation check than undergraduate students. Based on these results, it appears that concerns over participant inattentiveness may be more applicable to samples recruited from traditional university participant pools than from Mechanical Turk
This fits with previous reports showing high consistency when classic effects are tested online, and with reports that satisficing may have been very high in offline samples, we just weren’t testing for it.
However, an issue I haven’t seen discussed is whether, because of the relatively small pool of participants taking experiments on MTurk, online participants have an opportunity to get familiar with typical instructional manipulation checks (AKA ‘catch questions’, which are designed to check if you are paying attention). If online participants adapt to our manipulation checks, then the very experiments which set out to test if they are paying more attention may not be reliable.
This paper provides a useful overview: Conducting perception research over the internet: a tutorial review
“Face It,” says psychologist Gary Marcus in The New York Times, “Your Brain is a Computer”. The op-ed argues for understanding the brain in terms of computation which opens up to the interesting question – what does it mean for a brain to compute?
Marcus makes a clear distinction between thinking that the brain is built along the same lines as modern computer hardware, which is clearly false, while arguing that its purpose is to calculate and compute. “The sooner we can figure out what kind of computer the brain is,” he says, “the better.”
In this line of thinking, the mind is considered to be the brain’s computations at work and should be able to be described in terms of formal mathematics.
The idea that the mind and brain can be described in terms of information processing is the main contention of cognitive science but this raises a key but little asked question – is the brain a computer or is computation just a convenient way of describing its function?
Here’s an example if the distinction isn’t clear. If you throw a stone you can describe its trajectory using calculus. Here we could ask a similar question: is the stone ‘computing’ the answer to a calculus equation that describes its flight, or is calculus just a convenient way of describing its trajectory?
In one sense the stone is ‘computing’. The physical properties of the stone and its interaction with gravity produce the same outcome as the equation. But in another sense, it isn’t, because we don’t really see the stone as inherently ‘computing’ anything.
This may seem like a trivial example but there are in fact a whole series of analog computers that use the physical properties of one system to give the answer to an entirely different problem. If analog computers are ‘really’ computing, why not our stone?
If this is the case, what makes brains any more or less of a computer than flying rocks, chemical reactions, or the path of radio waves? Here the question just dissolves into dust. Brains may be computers but then so is everything, so asking the question doesn’t tell us anything specific about the nature of brains.
One counter-point to this is to say that brains need to algorithmically adjust to a changing environment to aid survival which is why neurons encode properties (such as patterns of light stimulation) in another form (such as neuronal firing) which perhaps makes them a computer in a way that flying stones aren’t.
But this definition would also include plants that also encode physical properties through chemical signalling to allow them to adapt to their environment.
It is worth noting that there are other philosophical objections to the idea that brains are computers, largely based on the the hard problem of consciousness (in brief – could maths ever feel?).
And then there are arguments based on the boundaries of computation. If the brain is a computer based on its physical properties and the blood is part of that system, does the blood also compute? Does the body compute? Does the ecosystem?
Psychologists drawing on the tradition of ecological psychology and JJ Gibson suggest that much of what is thought of as ‘information processing’ is actually done through the evolutionary adaptation of the body to the environment.
So are brains computers? They can be if you want them to be. The concept of computation is a tool. Probably the most useful one we have, but if you say the brain is a computer and nothing else, you may be limiting the way you can understand it.
Link to ‘Face It, Your Brain Is a Computer’ in The NYT.
Understanding statistical power is essential if you want to avoid wasting your time in psychology. The power of an experiment is its sensitivity – the likelihood that, if the effect tested for is real, your experiment will be able to detect it.
Statistical power is determined by the type of statistical test you are doing, the number of people you test and the effect size. The effect size is, in turn, determined by the reliability of the thing you are measuring, and how much it is pushed around by whatever you are manipulating.
Since it is a common test, I’ve been doing a power analysis for a two-sample (two-sided) t-test, for small, medium and large effects (as conventionally defined). The results should worry you.
This graph shows you how many people you need in each group for your test to have 80% power (a standard desirable level of power – meaning that if your effect is real you’ve an 80% chance of detecting it).
Things to note:
- even for a large (0.8) effect you need close to 30 people (total n = 60) to have 80% power
- for a medium effect (0.5) this is more like 70 people (total n = 140)
- the required sample size increases drammatically as effect size drops
- for small effects, the sample required for 80% is around 400 in each group (total n = 800).
What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.
Implications for anyone planning an experiment:
- Is your effect very strong? If so, you may rely on a smaller sample (For illustrative purposes the effect size of male-female heigh difference is ~1.7, so large enough to detect with small sample. But if your effect is this obvious, why do you need an experiment?)
- You really should prefer within-sample analysis, whenever possible (power analysis of this left as an exercise)
- You can get away with smaller samples if you make your measure more reliable, or if you make your manipulation more impactful. Both of these will increase your effect size, the first by narrowing the variance within each group, the second by increasing the distance between them