Null Hypothesis Significance Testing

Martina Sladekova

A reminder image so that I don't forget to record the lecture on Zoom. Again.

Today

  • Recap of sampling from distributions and confidence intervals

  • Forming a research question

  • Moving from research questions to hypotheses

  • Formally testing hypotheses with statistics

  • (Some of many) pitfalls of NHST

Recap

  • WEEK 2: Determining whether an individual’s score is unusual given a distribution with known characteristics.
  • If a population distribution of anxiety scores is normally distributed with a \(\mu\) = 30 and \(\sigma\) = 10, how common or unusual is an individual with a score of 41?
pnorm(41, mean = 30, sd = 10, lower.tail = FALSE)
[1] 0.1356661

density plot of a normal distribution centred at 30. Area from the score of 41 to the tail of the distribution is shaded

  • If a population distribution of anxiety scores is normally distributed with a \(\mu\) = 30 and \(\sigma\) = 10, what score would an individual need to get to be in the top 5%?
qnorm(p = 0.95, mean = 30, sd = 10)
[1] 46.44854

density plot of a normal distribution centred at 30. Area from the score of 46 to the tail of the distribution is shaded

Recap

  • WEEK 3: Estimating population distribution properties from the sample and quantifying uncertainty around our estimates.
  • Sample estimate - some numeric value we calculate based on the sample. We want to use this value to generalise to the population.
  • Standard error is the standard deviation of sample means (in a sampling distribution)

  • Constructing confidence intervals around our sample estimate.

A histogram of a normal sampling distribution, with highlighted portion that falls under the 95% confidence interval

The "ladder" of confidence intervals. Means of different samples are plotted on the X axis along with confidence intervals, different samples are plotted on the y axis. Vertical line goes through the population value. 95% of confidence intervals cross this line.

What’s the point? Dealing with uncertainty in science

  • We don’t know how close our estimate is to the population value

  • We also don’t know if our confidence interval even contains a population value.

  • Should we even bother?

TO THE RESCUE:

  1. Sample size
  2. Team effort

What’s the point? - sample size

Sample size matters - the larger our sample size, the closer our sample resembles the distribution.

histogram of a normal population distribution centred at the value of 15.3

histogram of a sample of 10 drawn from the normal population distribution, centred at the value of 16.6. The distribution is quite messy

histogram of a sample of 100 drawn from the normal population distribution, centred at the value of 16.4. The distribution is looking a little more like normal

histogram of a sample of 1000 drawn from the normal population distribution, centred at the value of 15.2. The distribution is extremely close to normal

What’s the point? - sample size

Sample size matters - the larger our sample size, the closer our sample resembles the distribution.

histogram of a normal population distribution centred at the value of 15.3

histogram of a sample of 10 drawn from the normal population distribution, centred at the value of 21. The distribution is quite messy

histogram of a sample of 100 drawn from the normal population distribution, centred at the value of 16.7. The distribution is looking a little more like normal

histogram of a sample of 1000 drawn from the normal population distribution, centred at the value of 15.4. The distribution is extremely close to normal

What’s the point? - team effort

  • Scientific work is cumulative - we build on each others ideas, rarely starting from scratch.

  • Small samples are not useless, we just need more of them.

    • Multi-study papers

    • Multi-lab studies (e.g. SMARVUS data)

  • Replication is crucial (especially for novel findings)

Vocab

Replication - attempt to reproduce a finding from another study in a new sample using identical methodology

Where are we?

Roadmap on the module. Box "Null hypothesis significance testing" is highlighted." Top row contains boxes "Introduction and distributions", "Standard error and confidence intervals" and "null hypothesis significance testing". Second box is labelled as "We're here!". Middle row is "t-test", "correlation" and "chi-square". Bottom row is "equation of a straight line", "linear model with one predictor", "linear model with multiple predictors"

The research process

The (quantitative) research process

CONFIRMATORY RESEARCH - focus on testing specific hypotheses and in general follows formula on the right.1

  • “Null Hypothesis Significance Testing” (NHST)

  • Other approaches also exist

EXPLORATORY RESEARCH - doesn’t start with a hypothesis.

  • Can collect new data or work with existing data.

  • Useful for generating hypotheses, but statistical tests can be difficult to interpret.

Graphic of the hypothesis testing process with the following steps: 1. Literature search. 2. Generate a research question. 3. Define the null and the alternative hypothesis. 4. Decide on an acceptable rate of false-positive findings. 4. Calculate statistical power. 5. Collect a random sample. 6. Calculate the test statistic. 7. Compote the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true. 8. Compare this probability to the acceptable rate of false positives - if it's smaller, reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

Generate a research question

A good research question:

  • Researchable and realistic

  • Informed by the prior research (a gap in the literature or a need for replication)

  • Not too broad and not too narrow - proportional to the project at hand

  • What’s too broad for one project can be just right for another!

Diagram of hypothesis testing with highlighted step: "Generate a research question"

Generate a research question

How can we address the global mental health crisis? ❌ - Too broad

What are the key factors contributing to the mental health crisis among PhD students? (Woolston 2019) ✅ - Better!

Some other examples:

Can domesticated cats play fetch? (Forman, Renner, and Leavens 2023)

How do word frequency and semantic transparency influence synaesthetic colouring of compound words? (Mankin et al. 2016)

Can embedding statistical teaching within a fictional narrative help to reduce anxiety and increase comprehension? (Field and Terry 2018)

What are the effects of breathwork on stress and mental health? (Fincham et al. 2023)

What are the roles of socioeconomic status, ethnicity and teacher beliefs in academic grading? (Doyle, Easterbrook, and Harris 2023)

Can a digital intervention be helpful for individuals with subthreshold borderline personality disorder? (Drews-Windeck et al. 2022)

Define the hypotheses

A good research question allows us to generate testable predictions (hypotheses) that are relevant to our research aims.

Vocab

Hypothesis: A testable prediction - a statement about what we reasonably believe our data will show.

  • The prediction is based on some prior information (literature and prior research)

  • A hypothesis can be defined on different levels - conceptual, operational, statistical

Diagram of hypothesis testing with highlighted step: "Define the null and the alternative hypothesis"

Define the hypotheses

CONCEPTUAL HYPOTHESIS

  • Describes our prediction in conceptual terms. Can be defined in terms of the direction of the effect that we’re studying:
    • There will be differences in anxiety levels between participants in group A and group B” - non-directional hypothesis.

    • Participants in group A will show higher levels of anxiety than those in group B” - directional hypothesis.

Define the hypotheses

OPERATIONALISATION - the process of translating concepts into measures - i.e. how are we going to measure the concepts that we’re studying?

CONCEPTUAL HYPOTHESIS

  • “Participants in group A will show higher levels of anxiety than those in group B.”

OPERATIONAL HYPOTHESIS

  • Participants in group A will score higher on the State-trait anxiety inventory than those in group BOR:

  • “Participants in group A will show higher skin conductance response than those in group B”

STATISTICAL HYPOTHESIS

  • What do we expect to happen numerically? E.g.

  • \(M_{\text{GroupA}} \ne M_{\text{GroupB}}\) OR \(M_{\text{GroupA}} \gt M_{\text{GroupB}}\)

Example: Heavy metal orcas 🤘

  • Reports of increased orca attacks on boats crossing the Gibraltar Strait

  • Sailors started using heavy metal to deter them

  • Mixed anecdotal results - some (less than reputable) sources claim it works. Others advise against doing so!

Headline of an article saying "Is THIS how to stop killer whales ramming boats? Sailors say they're blasting heavy metal music out underwarter to deter orcas"

Headline of an article saying "Orcas pummel boat after crew tries to deter them with heavy metal music"

An epic looking graphic of 3 orcas jumping out of the stage at a heavy metal concert

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

Animated image of a little boat with Jennifer behind the wheel and Martina sitting at the back. There's a huge orca jumping over the boat, looking like it's going to body slam the boat.

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

  • Boat 1: Martina is blasting heavy metal music
  • Boat 2: Jennifer is blasting Shrek music (control group)

CONCEPTUAL HYPOTHESIS: Playing heavy metal music will be associated with reduced hostility in orcas.

Reduced hostility in orcasShorter orca attack duration

OPERATIONAL HYPOTHESIS: Playing heavy metal music will be associated with shorter orca attack duration compared to Shrek music.

Animated image of a little boat with Jennifer behind the wheel and Martina sitting at the back. There's a huge orca jumping over the boat, looking like it's going to body slam the boat.

Null vs Alternative hypotheses

Once we know the operational hypothesis, we can define the null and the alternative hypotheses.

  • ALTERNATIVE HYPOTHESIS - our prediction, denoted as H1.

  • NULL HYPOTHESIS - denoted as H0. It represents the negation of the prediction that we’re making. Often describes a reality where the effect we’re interested in doesn’t exist.

H1: Orca attacks will be significantly shorter when playing heavy metal music compared to playing Shrek soundtracks.

H0: There will be no significant difference in attack duration between the two music styles.

Tip

H0 and H1 represent two alternative realities (like parallel world). When using Null Hypothesis Significance Testing, we’re trying to decide whether we can reject the null hypothesis.

Null vs Alternative hypotheses

STATISTICAL HYPOTHESES:

  • H1:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} \ne 0 \]

\[ M_{\text{(Shrek)}} > M_{\text{(Metal)}} \]

  • H0:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} = 0 \]

Decide on the α (alpha) level

  • \(\alpha\) level - the rate of false-positive findings that we’re willing to accept if we’re living in a reality where the null hypothesis is true.

  • If we were to repeat our experiment over and over again, how often are we willing to incorrectly reject the null hypothesis?1

  • The decision should be based on a cost benefit analysis - how risky is it to be wrong?

  • Psychologists often use the \(\alpha\) rate of 5% as a blanket rule with no justification ¯\_(ツ)_/¯

Diagram of hypothesis testing with highlighted step: "Decide on the acceptable rate of false positive findings"

Calculate statistical power

It’s a way of deciding how large a sample you need to get meaningful results on a statistical test.

Let’s come back to this later!

Diagram of hypothesis testing with highlighted step: "Calculate statistical power"

Calculate the test statistic

  • Some numeric value that we use to test the hypothesis

  • There are different test statistics for different situations.

  • Some examples you might see reported in a paper: t, F, \(\chi^2\)

  • In our case, we can look at the mean difference between the two music styles i.e.:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \]

Diagram of hypothesis testing with highlighted step: "Calculate the test statistic"

Calculate the test statistic

Example 1:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 16.00 = 35.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 16

Over 35 minutes of different is quite a lot!

Calculate the test statistic

Example 2:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 41 = 10.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 41

How about this difference? Large? Small? We need a more formal way of deciding whether we can reject the null hypothesis.

p-values

Compute the p-value

H0: There will be no significant difference in attack duration between the two music styles.

  • We want to know the probability of observing the test statistic at least as large as the one we observed if the null hypothesis is true - the p-value.

  • In our case, the “test statistic” is the mean difference between the two groups.

  • What kind of difference would we expect to find if there the null hypothesis is true?

Diagram of hypothesis testing with highlighted step: "Compute the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true."

Compute the p-value

  • If the null hypothesis is true, we would expect a difference of 0

  • But we’re taking a random sample - so 0 is the most probable value under the null hypothesis but other values are also possible.

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'

Compute the p-value

  • The difference in attack duration between Shrek music and Metal music detected in our sample was equal to 10.28.

  • Familiar territory: Given a normal sampling distribution centred at 0, how common is the value of 10.28 ?

pnorm(m_diff, 0, 6, lower.tail = FALSE)
[1] 0.04332562

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

Compute the p-value

  • The shaded proportion is 0.05 - top 5% of possible samples
  • The probability of obtaining difference of 10.28 under the null hypothesis is 0.043 \(\times\) 100 = 4.3%.
  • Therefore, given a distribution centred at 0, we would expect to find a difference of 10.28 in 4.3% of cases

Vocab

p-value: The probability of observing a test statistic at least as large as the one observed if the null hypothesis is true.

pnorm(m_diff, 0, 6, lower.tail = FALSE)
[1] 0.04332562

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -50 to 50 and is labelled 'M(diff) in different samples'

Retain or reject the null hypothesis

H0: There will be no significant difference in attack duration between the two music styles.

  • We previously decided that our critical \(\alpha\) is 5% - or 0.05 in probability terms

  • To reject the null hypothesis, the probability of obtaining our test statistic (10.28) under the null hypothesis needs to be less than the critical \(\alpha\)

  • Our p-value was 0.043

  • 0.043 is smaller than 0.05 ➞ we reject the null hypothesis and conclude that this difference is statistically significant at \(\alpha\) of 0.05.

Diagram of hypothesis testing with highlighted step: "Reject the null hypothesis is the p-value is smaller than the acceptable rate of false positive findings". There's an additional label that says "Remember this?". The label is next to a step "Decide on the acceptable rate of false-positive findings" which was covered earlier.

One more example…

Example 3:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 44 = 7.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 44

One more example…

  • Shaded area is top 5%

  • The probability of finding a difference of 7.28 in a distribution centred at 0 is 0.112 (or in 11.2% of samples)

  • Is the difference of 7.28 in the shaded area?

pnorm(7.28, 0, 6, lower.tail = FALSE)
[1] 0.1125012

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

One more example…

  • The value of 0.11 is greater than 0.05, which is our acceptable threshold of false-positives.

  • The difference between the means we detected (MDiff = 7.28) is not considered statistically significant at \(\alpha\) of 0.05. We therefore cannot reject the null hypothesis.

pnorm(7.28, 0, 6, lower.tail = FALSE)
[1] 0.1125012

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

Let’s practice (1)

  • We set the alpha level to be 0.05 (or 5%)

  • After receiving an intervention, a group of patients reports average anxiety score of 25.

  • The control group reports a score of 31, Mdiff = 6

  • If the null hypothesis is true, probability of finding a difference of 6 is p = 0.02, or 2%.

  • Is this difference statistically significant?

Let’s practice (2)

  • We set the alpha level to be 0.05

  • We randomly sample and find that individuals who spend time with puppies report stress levels of 16, while individuals without puppies report stress levels of 19, Mdiff = 3

  • We work out that: p = 0.04

  • Is this difference statistically significant?

Let’s practice (3)

Reported in a journal:

Participants in the active control group (M = 645, SD = 112) showed slower reaction times in milliseconds compared to those in the intervention group (M = 598, SD = 132), Mdiff = 47, p = 0.073.

  • We set the alpha to be 0.05
  • Is this difference statistically significant?

Interpreting p-values

Interpreting p-values

Tip

p-value is the probability of obtaining a test statistics at least as large as the one observed if the null hypothesis is true.

  • If the p-value is smaller than some pre-defined \(\alpha\) cut-off (often 0.05), we reject the null hypothesis

  • Like confidence intervals, p-values should be interpreted in the context of wider research

    • If the effect (difference in means, correlation, etc.) is really 0, we can still find a whole range of effects in different samples because of random sampling.

    • With p-values, we’re answering the question: “If the sampling distribution of this effect is centered at 0, how probable is the effect that we found in our sample”?

🚨 Some INCORRECT definitions of p-values 🚨

Warning

The p-value is NOT the probability of a chance result.

Warning

The p-value is NOT the probability of the alternative hypothesis being true.

Warning

The p-value is NOT the probability of the null hypothesis being true.

The inverse probability fallacy

Warning

The p-value is NOT the probability of the null hypothesis being true.

  • p-value doesn’t tell us anything about the probability of null or alternative hypothesis.

  • It tells us how likely the detected effect is IF the null is true.

  • We don’t know whether the null is true or not

  • Other statistical approaches - like Bayesian hypothesis testing - can tells more about the probability of null/alternative hypothesis, but we don’t cover them on this module.

Some pitfalls of NHST

  • p-values are sensitive to sample size - any tiny difference can be statistically significant with large enough sample

A dot plot. The two groups - shrek and metal - are on the x axis. The value of the mean in each group is on the y axis. There are error bars around the dots representing confidence intervals. The difference between the two dots is 2. Confidence intervals overlap substantially.

\[ n = 50 \\ M_{Diff} = \text{2 minutes} \\ p = 0.17 \]

A dot plot. The two groups - shrek and metal - are on the x axis. The value of the mean in each group is on the y axis. There are error bars around the dots representing confidence intervals. The difference between the two dots is 2. Confidence intervals are narrow and don't overlap at all.

\[ n = 10000 \\ M_{Diff} = \text{2 minutes} \\ p = 0.0000006 \]

Getting the most out of p-values

  • EFFECT SIZES - how large/meaningful is the effect (e.g. mean difference) that we found,

  • CONFIDENCE INTERVALS - what are the plausible limits of our effect?

  • POWER ANALYSIS - determining the necessary sample size before we begin data collection

Power analysis

POWER ANALYSIS - determining the necessary sample size before we begin data collection to make sure we don’t under- or over-sample

STATISTICAL POWER - the probability of detecting an effect of a certain size as statistically significant, assuming the effect exists.

  • We assume a certain effect size - say that we’re only interested in an effect if it’s at least a difference of 10 minutes between Metal and Shrek music

  • We calculate the sample size necessary to detect this effect.

  • More participants are needed for (1) smaller effects and (2) complicated designs

Diagram of hypothesis testing with highlighted step: "Calculate statistical power" which is a previously skipped section.

Power analysis

  • Power analysis makes non-significant effects easier to interpret

  • Reduces the probability of missing an important finding (false-negative finding)1

  • Reduces over-sampling and wasting resources

  • Generally, a statistical power of 80% is considered the standard to aim for

Diagram of hypothesis testing with highlighted step: "Calculate statistical power" which is a previously skipped section

In a nutshell…

  • NHST is the most commonly used way of testing hypotheses, but other methods exist

  • We start by defining our research questions and hypotheses.

  • We decide whether or not reject the null hypothesis based on the p-value associated with our test statistic

  • p-values can be a useful tool when combined with effect sizes, confidence intervals, and power analysis

Complete diagram of hypothesis testing same as on one of the earlier slides

In a nutshell…

A p-value is:

  • The probability of observing a test statistic at least as large as the one we observed in our sample if the null hypothesis is true.

A p-value is NOT:

  • The probability that the null hypothesis is true

  • The probability that the alternative hypothesis is true

  • The probability of a chance result

Complete diagram of hypothesis testing same as on one of the earlier slides

Next week

  • Statistical foundations are over 🥳
  • Lectures, skills labs and tutorials start overlapping
  • Applying the NHST principles in practice - testing group differences using the t-test

References

Doyle, Lewis, Matthew J. Easterbrook, and Peter R. Harris. 2023. “Roles of Socioeconomic Status, Ethnicity and Teacher Beliefs in Academic Grading.” British Journal of Educational Psychology 93 (1): 91–112. https://doi.org/https://doi.org/10.1111/bjep.12541.
Drews-Windeck, Elea, Lindsay Evans, Kathryn Greenwood, and Kate Cavanagh. 2022. “The Implementation of a Digital Group Intervention for Individuals with Subthreshold Borderline Personality Disorder.” Procedia Computer Science 206: 23–33. https://doi.org/https://doi.org/10.1016/j.procs.2022.09.082.
Field, Andy P. 2018. “Discovering Statistics Using SPSS.”
Field, Andy P., and Jenny Terry. 2018. “A Pilot Study of Whether Fictional Narratives Are Useful in Teaching Statistical Concepts.” RSS Conference. https://discoveringstatistics.com/docs/rss_poster_2018.pdf.
Fincham, Guy William, Clara Strauss, Jesus Montero-Marin, and Kate Cavanagh. 2023. “Effect of Breathwork on Stress and Mental Health: A Meta-Analysis of Randomised-Controlled Trials.” Scientific Reports 13 (1): 432. https://doi.org/10.1038/s41598-022-27247-y.
Forman, Jemma, Elizabeth Renner, and David A. Leavens. 2023. “Fetching Felines: A Survey of Cat Owners on the Diversity of Cat (Felis Catus) Fetching Behaviour.” Scientific Reports 13 (1): 20456. https://doi.org/10.1038/s41598-023-47409-w.
Haller, Heiko, and Stefan Krauss. 2002. “Misinterpretations of Significance: A Problem Students Share with Their Teachers.” Methods of Psychological Research 7 (1): 1–20.
Mankin, Jennifer L., Christopher Thompson, Holly P. Branigan, and Julia Simner. 2016. “Processing Compound Words: Evidence from Synaesthesia.” Cognition 150: 1–9. https://doi.org/https://doi.org/10.1016/j.cognition.2016.01.007.
Woolston, Chris. 2019. “PhDs: The Tortuous Truth.” Nature. https://www.nature.com/articles/d41586-019-03459-7.