A reminder image so that I don't forget to record the lecture on Zoom. Again.

Today

  • Recap of sampling from distributions and confidence intervals

  • Forming a research question

  • Moving from research questions to hypotheses

  • Formally testing hypotheses with statistics

  • (Some of many) pitfalls of NHST

Recap

  • WEEK 2: Determining whether an individual’s score is unusual given a distribution with known characteristics.
  • If a population distribution of anxiety scores is normally distributed with a M = 30 and SD = 10, how common or unusual is an individual with a score of 41?
pnorm(41, mean = 30, sd = 10, lower.tail = FALSE)
[1] 0.1356661

density plot of a normal distribution centred at 30. Area from the score of 41 to the tail of the distribution is shaded

  • If a population distribution of anxiety scores is normally distributed with a M = 30 and SD = 10, what score would an individual need to get to be in the top 5%?
qnorm(p = 0.95, mean = 30, sd = 10)
[1] 46.44854

density plot of a normal distribution centred at 30. Area from the score of 46 to the tail of the distribution is shaded

Recap

  • WEEK 3: Estimating population distribution properties from the sample and quantifying uncertainty around our estimates.
  • Sample estimate - best guess of the population parameter based on our samples
  • Sampling distribution -
    • Distribution of (infinite) sample estimates (e.g. means) - normal regardless of sample or population shape.
    • Centred on the population value

Recap

  • Standard error is the standard deviation of sample means in a sampling distribution
  • Constructing confidence intervals around our sample estimate.

A histogram of a normal sampling distribution, with highlighted portion that falls under the 95% confidence interval

The "ladder" of confidence intervals. Means of different samples are plotted on the X axis along with confidence intervals, different samples are plotted on the y axis. Vertical line goes through the population value. 95% of confidence intervals cross this line.

Where are we?

Roadmap on the module. Box "Null hypothesis significance testing" is highlighted." Top row contains boxes "Introduction and distributions", "Standard error and confidence intervals" and "null hypothesis significance testing". Second box is labelled as "We're here!". Middle row is "t-test", "correlation" and "chi-square". Bottom row is "equation of a straight line", "linear model with one predictor", "linear model with multiple predictors"

The research process

The (quantitative) research process

Graphic of the hypothesis testing process with the following steps: 1. Literature search. 2. Generate a research question. 3. Define the null and the alternative hypothesis. 4. Decide on an acceptable rate of false-positive findings. 4. Calculate statistical power. 5. Collect a random sample. 6. Calculate the test statistic. 7. Compote the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true. 8. Compare this probability to the acceptable rate of false positives - if it's smaller, reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

Confirmatory vs exploratory research

CONFIRMATORY RESEARCH

  • Hypothesis is determined before we run any statistical tests
  • Follows formula on the right.

  • “Null Hypothesis Significance Testing” (NHST)

  • Other approaches also exist

Graphic of the hypothesis testing process with the following steps: 1. Literature search. 2. Generate a research question. 3. Define the null and the alternative hypothesis. 4. Decide on an acceptable rate of false-positive findings. 4. Calculate statistical power. 5. Collect a random sample. 6. Calculate the test statistic. 7. Compote the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true. 8. Compare this probability to the acceptable rate of false positives - if it's smaller, reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

Confirmatory vs exploratory research

EXPLORATORY RESEARCH

  • No hypothesis prior to running tests

  • Useful for generating hypotheses, but statistical tests can be difficult to interpret.

  • You should still have some analysis plan specifying how you’re going to explore the data

  • Should be followed up with replication to confirm the findings.

Fishing for results

If you torture your data for long enough, some “findings” will emerge eventually. These findings will not necessarily be reliable.

Graphic of the hypothesis testing process with the following steps: 1. Literature search. 2. Generate a research question. 3. Define the null and the alternative hypothesis. 4. Decide on an acceptable rate of false-positive findings. 4. Calculate statistical power. 5. Collect a random sample. 6. Calculate the test statistic. 7. Compote the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true. 8. Compare this probability to the acceptable rate of false positives - if it's smaller, reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

Generate a research question

A good research question:

  • Researchable and realistic

  • Informed by the prior research (a gap in the literature or a need for replication)

  • Not too broad and not too narrow - proportional to the project at hand

  • What’s too broad for one project can be just right for another!

Diagram of hypothesis testing with highlighted step: "Generate a research question"

Generate a research question

How can we address the global mental health crisis? ❌ - Too broad

What are the key factors contributing to the mental health crisis among PhD students? (Woolston 2019) ✅ - Better!

Some other examples:

Can domesticated cats play fetch? (Forman, Renner, and Leavens 2023)

How do word frequency and semantic transparency influence synaesthetic colouring of compound words? (Mankin et al. 2016)

Can embedding statistical teaching within a fictional narrative help to reduce anxiety and increase comprehension? (Field and Terry 2018)

What are the effects of breathwork on stress and mental health? (Fincham et al. 2023)

What are the roles of socioeconomic status, ethnicity and teacher beliefs in academic grading? (Doyle, Easterbrook, and Harris 2023)

Can a digital intervention be helpful for individuals with subthreshold borderline personality disorder? (Drews-Windeck et al. 2022)

Define the hypotheses

A good research question allows us to generate testable predictions (hypotheses) that are relevant to our research aims.

Vocabulary

Hypothesis: A testable prediction - a statement about what we reasonably believe our data will show.

  • The prediction is based on some prior information (literature and prior research)

  • A hypothesis can be defined on different levels - conceptual, operational, statistical

Diagram of hypothesis testing with highlighted step: "Define the null and the alternative hypothesis"

Define the hypotheses

CONCEPTUAL HYPOTHESIS

  • Describes our prediction in conceptual terms. Can be defined in terms of the direction of the effect that we’re studying:
    • Non-directional:There will be differences in anxiety levels between participants in group A and group B

    • Directional:Participants in group A will show higher levels of anxiety than those in group B

Define the hypotheses

OPERATIONALISATION - the process of translating concepts into measures - i.e. how are we going to measure the concepts that we’re studying?

CONCEPTUAL HYPOTHESIS

  • “Participants in group A will show higher levels of anxiety than those in group B.”

OPERATIONAL HYPOTHESIS

  • Participants in group A will score higher on the State-trait anxiety inventory than those in group BOR:

  • “Participants in group A will show higher skin conductance response than those in group B”

STATISTICAL HYPOTHESIS

  • What do we expect to happen numerically? E.g.

  • \(M_{\text{GroupA}} \ne M_{\text{GroupB}}\) OR \(M_{\text{GroupA}} \gt M_{\text{GroupB}}\)

PollEverywhere: Operationalising variables

We’re interested in a relationship between wearing a hat and confidence.

How can we operationalise (merasure) someone’s confidence?

Null vs alternative hypothesis

Once we know the operational hypothesis, we can define the null and the alternative hypotheses.

  • ALTERNATIVE HYPOTHESIS - our prediction, denoted as H1.

  • NULL HYPOTHESIS - denoted as H0. It represents the negation of the prediction that we’re making. Often describes a reality where the effect we’re interested in doesn’t exist.

Two realities

Reality 1: H1 is true

  • Effect we’re interested in exists. e.g. “There will be a difference in scores between group A and group B”

  • Sampling distribution is centred at the population value for this difference:

Reality 2: H0 is true

  • Effect we’re interested in doesn’t exist. e.g. “There will be no difference in scores between group A and group B.”

  • Sampling distribution is centred at 0:

Two realities

Null Hypothesis Significance Testing

  • If H0 is true, how likely/unlikely is the result (e.g. the group difference) we observed?

  • Can we reject H0 based on what we observed in our data?

  • “Statistically significant result” - result that is sufficiently unlikely under the null hypothesis

Reality 2: H0 is true

  • Effect we’re interested in doesn’t exist. e.g. “There will be no difference in scores between group A and group B.”

  • Sampling distribution is centred at 0:

Example: Heavy metal orcas 🤘

  • Increased orca attacks on boats crossing the Gibraltar Strait

  • Sailors started using heavy metal to deter them

  • Some (less than reputable) sources claim it works. Others advise against doing so!

An epic looking graphic of 5 orcas in a crowd at an Iron Maiden concert

Headline of an article saying "Is THIS how to stop killer whales ramming boats? Sailors say they're blasting heavy metal music out underwarter to deter orcas"

Headline of an article saying "Orcas pummel boat after crew tries to deter them with heavy metal music"

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

Animated image of a little boat with Jennifer behind the wheel and Martina sitting at the back. There's a huge orca jumping over the boat, looking like it's going to body slam the boat.

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

  • Boat 1: Martina is blasting heavy metal music
  • Boat 2: Jennifer is blasting Shrek music (control group)

CONCEPTUAL HYPOTHESIS: Playing heavy metal music will be associated with reduced hostility in orcas.

Reduced hostility in orcasShorter orca attack duration

OPERATIONAL HYPOTHESIS: Playing heavy metal music will be associated with shorter orca attack duration compared to Shrek music.

Animated image of a little boat with Jennifer behind the wheel and Martina sitting at the back. There's a huge orca jumping over the boat, looking like it's going to body slam the boat.

Our hypotheses

  • H1: Orca attacks will be significantly shorter when playing heavy metal music compared to playing Shrek soundtracks.

  • H0: There will be no significant difference in attack duration between the two music styles.

Our hypotheses

STATISTICAL HYPOTHESES:

  • H1:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} \ne 0\\ M_{\text{(Shrek)}} > M_{\text{(Metal)}} \]

  • H0:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} = 0 \\ M_{\text{(Shrek)}} = M_{\text{(Metal)}} \]

Decide on the α (alpha) level

  • \(\alpha\) level - the rate of false-positive findings that we’re willing to accept if we’re living in a reality where the null hypothesis is true.

  • If we were to repeat our experiment over and over again, how often are we willing to incorrectly reject the null hypothesis?1

  • The decision should be based on a cost benefit analysis - how risky is it to be wrong?

  • Psychologists often use the \(\alpha\) rate of 5% as a blanket rule with no justification ¯\_(ツ)_/¯

Diagram of hypothesis testing with highlighted step: "Decide on the acceptable rate of false positive findings"

Calculate statistical power

It’s a way of deciding how large a sample you need to get meaningful results on a statistical test.

Let’s come back to this later!

Diagram of hypothesis testing with highlighted step: "Calculate statistical power"

Calculate the test statistic

  • Some numeric value that we use to test the hypothesis

  • There are different test statistics for different situations.

  • Some examples you might see reported in a paper: t, F, \(\chi^2\)

  • In our case, we can look at the mean difference between the two music styles i.e.:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \]

Diagram of hypothesis testing with highlighted step: "Calculate the test statistic"

Calculate the test statistic

Example 1:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 16.00 = 35.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 16

Over 35 minutes of different is quite a lot!

Calculate the test statistic

Example 2:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 41 = 10.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 41

How about this difference? Large? Small? Could we realistically find this difference if the null hypothesis is true?

p-values

Compute the p-value

H0: There will be no significant difference in attack duration between the two music styles.

  • We want to know the probability of observing the test statistic at least as large as the one we observed if the null hypothesis is true - the p-value.

  • In our case, the “test statistic” is the mean difference between the two groups.

  • What kind of difference would we expect to find if there the null hypothesis is true?

Diagram of hypothesis testing with highlighted step: "Compute the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true."

Compute the p-value

  • If the null hypothesis is true, we would expect a difference of 0

  • But we’re taking a random sample - so 0 is the most probable value under the null hypothesis but other values are also possible.

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'

Compute the p-value

  • The difference in attack duration between Shrek music and Metal music detected in our sample was equal to 10.28.

  • Familiar territory: Given a normal sampling distribution centred at 0, how common is the value of 10.28 ?

  • The shaded proportion is 0.05 - top 5% of possible samples


A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

Compute the p-value

  • The probability of obtaining difference of 10.28 under the null hypothesis is 0.043 \(\times\) 100 = 4.3%.
  • Therefore, given a distribution centred at 0, we would expect to find a difference of 10.28 in 4.3% of cases

Vocab

p-value: The probability of observing a test statistic at least as large as the one observed if the null hypothesis is true.

pnorm(10.28, mean = 0, sd = 6, lower.tail = FALSE)
[1] 0.04332562

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -50 to 50 and is labelled 'M(diff) in different samples'

Retain or reject the null hypothesis

H0: There will be no significant difference in attack duration between the two music styles.

  • Critical \(\alpha\): 5% (0.05 in probability terms)

  • To reject the null hypothesis, the probability of obtaining our test statistic (10.28) under the null hypothesis needs to be less than the critical \(\alpha\)

  • Our p-value was 0.043

  • 0.043 is smaller than 0.05 ➞ we reject the null hypothesis and conclude that this difference is statistically significant at \(\alpha\) of 0.05.

Diagram of hypothesis testing with highlighted step: "Reject the null hypothesis is the p-value is smaller than the acceptable rate of false positive findings". There's an additional label that says "Remember this?". The label is next to a step "Decide on the acceptable rate of false-positive findings" which was covered earlier.

One more example…

Example 3:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 44 = 7.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 44

One more example…

  • Shaded area is top 5%

  • The probability of finding a difference of 7.28 in a distribution centred at 0 is 0.112 (or in 11.2% of samples)

  • Is the difference of 7.28 in the shaded area?

pnorm(7.28, 0, 6, lower.tail = FALSE)
[1] 0.1125012

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

One more example…

  • The value of 0.11 is greater than 0.05, which is our acceptable threshold of false-positives.

  • The difference between the means we detected (MDiff = 7.28) is not considered statistically significant at \(\alpha\) of 0.05. We therefore cannot reject the null hypothesis.

pnorm(7.28, 0, 6, lower.tail = FALSE)
[1] 0.1125012

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

PollEverywhere

Practising statistical significance

Let’s practice (1)

  • We set the alpha level to be 0.05 (or 5%)

  • After receiving an intervention, a group of patients reports average anxiety score of 25.

  • The control group reports a score of 31, Mdiff = 6

  • If the null hypothesis is true, probability of finding a difference of 6 is p = 0.02, or 2%.

  • Is this difference statistically significant?

Let’s practice (2)

  • We set the alpha level to be 0.05

  • We randomly sample and find that individuals who spend time with puppies report stress levels of 16, while individuals without puppies report stress levels of 19, Mdiff = 3

  • We work out that: p = 0.04

  • Is this difference statistically significant?

Let’s practice (3)

Reported in a journal:

Participants in the active control group (M = 645, SD = 112) showed slower reaction times in milliseconds compared to those in the intervention group (M = 598, SD = 132), Mdiff = 47, p = 0.073.

  • We set the alpha to be 0.05
  • Is this difference statistically significant?

What if we’re wrong?

  • Type I error:

    • False-positive

    • Concluding that an effect exists when it doesn’t

    • Incorrectly rejecting the null hypothesis

  • Type II error:

    • False-negative

    • Concluding that an effect doesn’t exist when it does

    • Failing to reject the null hypothesis

P-values and confidence intervals

  • Closely interlinked concepts

  • Extent of overlap of confidence intervals is indicative of statistical (non) significance

P-values and confidence intervals

  • Closely interlinked concepts

  • Extent of overlap of confidence intervals is indicative of statistical (non) significance

  • No overlap - statistically significant

  • Up to half the length of CI whisker overlap statistically significant p < 0.05

  • About a half overlap - p value close to 0.05

P-values and confidence intervals

  • More than half CI whisker overlap - non-significant

  • Useful but not a universal rule of thumb - depends on the design.

Interpreting p-values

Interpreting p-values

Tip

p-value is the probability of obtaining a test statistics at least as large as the one observed if the null hypothesis is true.

  • If the p-value is smaller than some pre-defined \(\alpha\) cut-off (often 0.05), we reject the null hypothesis

  • Like confidence intervals, p-values should be interpreted in the context of wider research

    • If the effect (difference in means, correlation, etc.) is really 0, we can still find a whole range of effects in different samples because of random sampling.

    • With p-values, we’re answering the question: “If the sampling distribution of this effect is centered at 0, how probable is the effect that we found in our sample”?

PollEverywhere:

What even are p-values?

🚨 Some INCORRECT definitions of p-values 🚨

Warning

The p-value is NOT the probability of a chance result.

Warning

The p-value is NOT the probability of the alternative hypothesis being true.

Warning

The p-value is NOT the probability of the null hypothesis being true.

The inverse probability fallacy

Warning

The p-value is NOT the probability of the null hypothesis being true.

  • p-value doesn’t tell us anything about the probability of null or alternative hypothesis.

  • It tells us how likely the detected effect is IF the null is true.

  • We don’t know whether the null is true or not

  • Other statistical approaches - like Bayesian hypothesis testing - can tells more about the probability of null/alternative hypothesis, but we don’t cover them on this module.

Some pitfalls of NHST

  • p-values are sensitive to sample size - any tiny difference can be statistically significant with large enough sample

A dot plot. The two groups - shrek and metal - are on the x axis. The value of the mean in each group is on the y axis. There are error bars around the dots representing confidence intervals. The difference between the two dots is 2. Confidence intervals overlap substantially.

\[ n = 50 \\ M_{Diff} = \text{2 minutes} \\ p = 0.17 \]

A dot plot. The two groups - shrek and metal - are on the x axis. The value of the mean in each group is on the y axis. There are error bars around the dots representing confidence intervals. The difference between the two dots is 2. Confidence intervals are narrow and don't overlap at all.

\[ n = 10000 \\ M_{Diff} = \text{2 minutes} \\ p = 0.0000006 \]

Some pitfalls of NHST

  • All or nothing thinking - is there a practical difference between 0.049 and 0.051?

🥳 “The mean difference between the groups was statistically significant,
MDiff = 10.4, p = 0.049.”

🥴 “The mean difference between the groups was not statistically significant,
MDiff = 10.2, p = 0.051.”

🤯 “The mean difference between the groups was *&%@@!%)£)@)!__!&^%_()!(!,
MDiff = 10.3, p = 0.05.”

  • A non-significant result is still a result

Getting the most out of p-values

  • EFFECT SIZES - how large/meaningful is the effect (e.g. mean difference or other estimate) that we found

    • E.g. a group difference of 2 of a scale 1-10 vs same group difference on a scale 1-100
  • CONFIDENCE INTERVALS - what are the plausible limits of our effect?

  • POWER ANALYSIS - determining the necessary sample size before we begin data collection

Power analysis

POWER ANALYSIS - determining the necessary sample size before we begin data collection to make sure we don’t under- or over-sample

STATISTICAL POWER - the probability of detecting an effect of a certain size as statistically significant, assuming the effect exists.

  • We assume a certain effect size - say that we’re only interested in an effect if it’s at least a difference of 10 minutes between Metal and Shrek music

  • We calculate the sample size necessary to detect this effect.

  • More participants are needed for (1) smaller effects and (2) complicated designs

Diagram of hypothesis testing with highlighted step: "Calculate statistical power" which is a previously skipped section.

Power analysis

  • Power analysis makes non-significant effects easier to interpret

  • Reduces the probability of missing an important finding (false-negative finding)1

  • Reduces over-sampling and wasting resources

  • Generally, a statistical power of 80% is considered the standard to aim for

Diagram of hypothesis testing with highlighted step: "Calculate statistical power" which is a previously skipped section

In a nutshell…

  • NHST is the most commonly used way of testing hypotheses, but other methods exist

  • We start by defining our research questions and hypotheses.

  • We decide whether or not reject the null hypothesis based on the p-value associated with our test statistic

  • p-values can be a useful tool when combined with effect sizes, confidence intervals, and power analysis

Complete diagram of hypothesis testing same as on one of the earlier slides

In a nutshell…

A p-value is:

  • The probability of observing a test statistic at least as large as the one we observed in our sample if the null hypothesis is true.

A p-value is NOT:

  • The probability that the null hypothesis is true

  • The probability that the alternative hypothesis is true

  • The probability of a chance result

Complete diagram of hypothesis testing same as on one of the earlier slides

Next week

  • Statistical foundations are over 🥳
  • Lectures, skills labs and tutorials start overlapping
  • Applying the NHST principles in practice - testing group differences using the t-test

References

Doyle, Lewis, Matthew J. Easterbrook, and Peter R. Harris. 2023. “Roles of Socioeconomic Status, Ethnicity and Teacher Beliefs in Academic Grading.” British Journal of Educational Psychology 93 (1): 91–112. https://doi.org/https://doi.org/10.1111/bjep.12541.
Drews-Windeck, Elea, Lindsay Evans, Kathryn Greenwood, and Kate Cavanagh. 2022. “The Implementation of a Digital Group Intervention for Individuals with Subthreshold Borderline Personality Disorder.” Procedia Computer Science 206: 23–33. https://doi.org/https://doi.org/10.1016/j.procs.2022.09.082.
Field, Andy P., and Jenny Terry. 2018. “A Pilot Study of Whether Fictional Narratives Are Useful in Teaching Statistical Concepts.” RSS Conference. https://discoveringstatistics.com/docs/rss_poster_2018.pdf.
Fincham, Guy William, Clara Strauss, Jesus Montero-Marin, and Kate Cavanagh. 2023. “Effect of Breathwork on Stress and Mental Health: A Meta-Analysis of Randomised-Controlled Trials.” Scientific Reports 13 (1): 432. https://doi.org/10.1038/s41598-022-27247-y.
Forman, Jemma, Elizabeth Renner, and David A. Leavens. 2023. “Fetching Felines: A Survey of Cat Owners on the Diversity of Cat (Felis Catus) Fetching Behaviour.” Scientific Reports 13 (1): 20456. https://doi.org/10.1038/s41598-023-47409-w.
Haller, Heiko, and Stefan Krauss. 2002. “Misinterpretations of Significance: A Problem Students Share with Their Teachers.” Methods of Psychological Research 7 (1): 1–20.
Mankin, Jennifer L., Christopher Thompson, Holly P. Branigan, and Julia Simner. 2016. “Processing Compound Words: Evidence from Synaesthesia.” Cognition 150: 1–9. https://doi.org/https://doi.org/10.1016/j.cognition.2016.01.007.
Woolston, Chris. 2019. “PhDs: The Tortuous Truth.” Nature. https://www.nature.com/articles/d41586-019-03459-7.