Week 11
1879 - world’s psychology lab established by Wilhelm Wundt
Defining point for Psychology as a separate discipline
Prior to that - part of philosophy
Use of experimental methods for the first time - smooth(ish) sciencing followed
Open Science Collaboration (2015): Estimating the reproducibility of psychological science
Attempted to replicate findings published in high profile journals
Replicated effect sizes were half the size of the ones than originally reported
97% of original p-values were statistically significant (p < 0.05)
Only 36% of replicated p-values were statistically significant
Vocabulary
Replication: The process of repeating (re-running) the same study using identical methodology.
Some famous findings that we can’t replicate:
Smiling will make you feel happier
Power posing will make you act bolder
Self-control is a limited resource
Revising after your exams can improve your earlier performance (Daryl Bem’s “pre-cognition” experiments)
Babies are born with the power to imitate
Read more here: https://www.bps.org.uk/research-digest/ten-famous-psychology-findings-have-been-difficult-replicate
Two teams of researchers set out to study the effect of mindfulness meditation on well-being. Both teams apply the same intervention using identical protocol and each team collects 200 participants. The teams analyse their own data using the t-test, comparing the well-being of the intervention group against the control group.
Team 1
Team 1 finds a statistically significant difference in well-being between the two groups (p = .034)
Team 2
Team 2 finds a non-significant difference in well-being between the two groups (p = .27)
Both teams write-up their reports as a scientific paper. Which paper should get published?
Some results are more interesting than others. But are they more important?
The publication record is over-populated with statististically significant findings
Papers reporting significant p-values are 9 times more likely to get published compared to papers reporting non-significant findings.
Problem for evidence synthesis - publishing only (or mostly) significant findings might make it seem like an effect exists when in reality it doesn’t
Vocabulary
Publication bias or the “File drawer effect” - is the bias in the publication system where statistically significant results are favoured for publication over non-significant findings. Papers reporting non-significant findings often end up in researchers’ file-drawers, never to be seen again.
Conduct research (requires funding)
Train research students (requires funding)
Publish papers (requires funding)
Teach (it’s free! 🤩)
Secure funding (apply for grants)
Funding bodies look at researchers’ publication records (among other things) to decide who gets awarded a grant.
failure to publish |> failure to secure funding |> failure to conduct further research |> job insecurity
Tight deadlines + high pressure competitive environment + “publish or perish” + lack of training + personal investment -> QRPs
Vocabulary
Questionable research practices - a range of practices that distort (intentionally or unintentionally) the results (often) motivated by the desire to find support for hypotheses and make research more publishable.
From UK Research Integrity Office:
Hanlon’s Razor: “Never attribute to malice that which is adequately explained by ignorance or incompetence.”
Each analysis has many decision points - which tests to run, which participants to exclude, which steps to take during data cleaning, etc..
Each decision results in a unique analysis “path”
Different analysts might not necessarily take the same path and arrive at the same conclusion (Silberzahn, 2018)
Some “paths” become can seem more sensible depending on your motivations
Selective inclusion/removal of cases
Subsetting/combining groups
Variable dichotomisation
Data transformation
Collecting more data (“data peeking”)
NOTE: None of these are “questionable” in their own right. Motivation matters!
Vocabulary
p-hacking: taking specific analytic steps in order to achieve statistical significance rather than (pre-planned) steps that are more appropriate to answer the research question.
Masicampo, E. J., & Lalande, D. R. (2012). A peculiar prevalence of p values just below .05:
NOT the same as exploratory research
NOT the same as discussing explanations for your (suprising) results
Vocabulary
HARKing: Hypothesizing After The Results are Known. Often involves collecting data without clear a hypothesis, deciding on a hypothesis based on what’s significant, and then presenting that hypothesis as if it was decided on before running any analyses.
Often goes hand-in-hand with HARKing
Collecting a lot of variables and only reporting statistically significant relationships (without making in clear that you’ve also collected other data)
Picking and choosing which papers to cite in a way that fit your narrative
Citing papers as supporting a specific point when they don’t
Item | Self admission rate (%) |
---|---|
Failing to report all of study’s dependent measures | 63.4 |
Deciding whether to collect more data after looking to see whether the results were significant | 55.9 |
Failing to report all of a study’s conditions | 27.7 |
Stopping collecting data earlier than planned because one found the result that one had been looking for | 16.6 |
“Rounding off” a p value (e.g., reporting that a p value of .054 is less than .05) | 22.2 |
Selectively reporting studies that “worked” | 45.8 |
Deciding whether to exclude data after looking at the impact of doing so on the results | 38.2 |
Reporting an unexpected finding as having been predicted from the start | 27.0 |
Claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do) | 3.0 |
Falsifying data | 0.6 |
YOU are the future
researchers (academic or industry)
practitioners
educators
Understanding how research works (and how the environment can affects the research process) allows you to:
How can we make science better?
Simine Vazire (University of Melbourne)
Openness (transparency) prevents researchers from being able to hide their QRPs
The Open Science movement has inspired many innovations in transparent research, only a few of which we’ll explore today:
Preregistration involves publicly sharing a time-stamped research plan (e.g., on the OSF) that includes:
Precise hypotheses to prevent HARKing
Information about all your variables and how they’re operationalised to prevent selective reporting
A detailed data analysis plan to prevent p-hacking
Other benefits include making sure you are collecting data you can actually analyse and front-loading a lot of the work
Registered Reports are like preregistration insomuch as you’re specifying what you’re going to do in advance, but they have additional benefits.
Like preregistration, they aim to reduce QRPs such as HARKing, p-hacking, and selective reporting
They also improve research quality more generally as the methods are reviewed by peers before data collection commences
They are also more likely to reduce publication bias as journals agree to publish the study based on the quality of the methods, regardless of the results
Protzko et al. (2023) claimed that OS efforts are improving replicability
Methods are complex, but essentially compared the replication rates of studies that had used OS practices with the replication rate of studies that hadn’t used OS practices
Concluded 86% replication rate in the OS sample was due to “rigor-enhancing practices” such as preregistration, methodological transparency, confirmatory tests, and large sample sizes.
Bak-Coleman & Devezer (2023) wrote a response, arguing (amongst other things):
The definition of ‘replication’ was less conservative in the OS than the non-OS sample
The OS group of studies were chosen because they were more likely to replicate
The hypotheses and analyses in the study were not as preregistered
The story unfolds further in this blogpost.
Original reviews pointed out the flaws in this study
One of those reviewers spoke out on Twitter, saying the original authors’ response was that they were aware of the flaws, but thought the ends justified the means
In other words, it was okay to fudge the figures a bit to get more people on the OS bandwagon
This (meta)bias is the kind of QRP that OS set out to eradicate
Read the full thread here on Bluesky
Szöllősi, Navarro, van Rooij, and colleagues (2021) wrote a paper called “Is Preregistration Worthwhile?”, which argued:
Preregistration isn’t sufficient for good science
You can still preregister bad science from weak theories
Strong theories generate very specific hypotheses that are less susceptible to HARKing
Psychology needs stronger theories, not preregistration
It caused quite the Twitter storm!
Rubin (2023) argued that many OS proponents engage in questionable metascience practices, for example:
A lack of evidence of its efficacy
Metabias towards blaming researcher bias (QRPs) for the replication crisis, when there are other explanations (e.g., weak theory, poor measurement practice)
Rejecting or ignoring criticisms of metascience and/or science reform
Quick, superficial, dismissive, and/or mocking style of criticising others, predominantly from those in positions of power and privilege (bropen science)
Whitaker & Guest (2020): “#bropenscience is a tongue-in-cheek expression but also has a serious side, shedding light on the narrow demographics and off-putting behavioural patterns seen in open science.”
“Not all bros are men. And that’s true, but they are more likely to be from one or more of the following dominant social groups: male, white, cisgender, heterosexual, able-bodied, neurotypical, high socioeconomic status, English-speaking. That’s because structural privileges exist that benefit certain groups of people.”
The Society for the Improvement of Psychological Science’s Code of Conduct warns against behaviour typical of #bropenscience and is (visibly) more committed to diversity and inclusion
The Center for Open Science’s Symposium: Critical Perspectives on the Metascience Reform Movement is an example of embracing criticism
Non-English speaking OS groups are being set up, such as the Chinese Open Science Network
The Replication Crisis in Psychology triggered a chain of events, intended to improve Psychological Science
Initially, the discipline focused on identifying problems and a list of Questionable Research Practices emerged
In response, Open Science initiatives were developed that attempted to prevent QRPs
However, OS has been subject to Questionable Metascience Practices and must be held to the same standards as the research it is trying to improve
OS is showing signs of becoming more humble, more reflective, more inclusive, and will have no choice but to provide robust evidence of its effectiveness
FORRT (including the Open Science Glossary)
Mark Rubin’s Critical Metascience Reading List
Twitter/Bluesky (follow the folks cited in this lecture)
More references: