Null Hypothesis Significance Testing

Dr. Martina Sladekova

A reminder image so that I don't forget to record the lecture on Zoom. Again.

Session links

linktr.ee/analysingdata

Today

Recap of sampling from distributions and confidence intervals
Forming a research question
Moving from research questions to hypotheses
Formally testing hypotheses with statistics
(Some of many) pitfalls of NHST

Recap

WEEK 2: Determining whether an individual’s score is unusual given a distribution with known characteristics.

If a population distribution of anxiety scores is normally distributed with a M = 30 and SD = 10, how common or unusual is an individual with a score of 41?

pnorm(41, mean = 30, sd = 10, lower.tail = FALSE)

[1] 0.1356661

density plot of a normal distribution centred at 30. Area from the score of 41 to the tail of the distribution is shaded

If a population distribution of anxiety scores is normally distributed with a M = 30 and SD = 10, what score would an individual need to get to be in the top 5%?

qnorm(p = 0.95, mean = 30, sd = 10)

[1] 46.44854

density plot of a normal distribution centred at 30. Area from the score of 46 to the tail of the distribution is shaded

Recap

WEEK 3: Estimating population distribution properties from the sample and quantifying uncertainty around our estimates.
Sample estimate - best guess of the population parameter based on our samples
Sampling distribution -
- Distribution of (infinite) sample estimates (e.g. means) - normal regardless of sample or population shape.
- Centred on the population value

Recap

Standard error is the standard deviation of sample means in a sampling distribution
Constructing confidence intervals around our sample estimate.

A histogram of a normal sampling distribution, with highlighted portion that falls under the 95% confidence interval

The "ladder" of confidence intervals. Means of different samples are plotted on the X axis along with confidence intervals, different samples are plotted on the y axis. Vertical line goes through the population value. 95% of confidence intervals cross this line.

Where are we?

Roadmap on the module. Box "Null hypothesis significance testing" is highlighted." Top row contains boxes "Introduction and distributions", "Standard error and confidence intervals" and "null hypothesis significance testing". Second box is labelled as "We're here!". Middle row is "t-test", "correlation" and "chi-square". Bottom row is "equation of a straight line", "linear model with one predictor", "linear model with multiple predictors"

The research process

The (quantitative) research process

Graphic of the hypothesis testing process with the following steps: 1. Literature search. 2. Generate a research question. 3. Define the null and the alternative hypothesis. 4. Decide on an acceptable rate of false-positive findings. 4. Calculate statistical power. 5. Collect a random sample. 6. Calculate the test statistic. 7. Compote the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true. 8. Compare this probability to the acceptable rate of false positives - if it's smaller, reject the null hypothesis. If it's greater, we cannot reject the null hypothesis.

Confirmatory vs exploratory research

CONFIRMATORY RESEARCH

Hypothesis is determined before we run any statistical tests

Follows formula on the right.
“Null Hypothesis Significance Testing” (NHST)
Other approaches also exist

Confirmatory vs exploratory research

EXPLORATORY RESEARCH

No hypothesis prior to running tests
Useful for generating hypotheses, but statistical tests can be difficult to interpret.
You should still have some analysis plan specifying how you’re going to explore the data
Should be followed up with replication to confirm the findings.

Fishing for results

If you torture your data for long enough, some “findings” will emerge eventually. These findings will not necessarily be reliable.

Generate a research question

A good research question:

Researchable and realistic
Informed by the prior research (a gap in the literature or a need for replication)
Not too broad and not too narrow - proportional to the project at hand
What’s too broad for one project can be just right for another!

Generate a research question

How can we address the global mental health crisis? ❌ - Too broad

What are the key factors contributing to the mental health crisis among PhD students? (Woolston 2019) ✅ - Better!

Some other examples:

Can domesticated cats play fetch? (Forman, Renner, and Leavens 2023)

How do word frequency and semantic transparency influence synaesthetic colouring of compound words? (Mankin et al. 2016)

Can embedding statistical teaching within a fictional narrative help to reduce anxiety and increase comprehension? (Field and Terry 2018)

What are the effects of breathwork on stress and mental health? (Fincham et al. 2023)

What are the roles of socioeconomic status, ethnicity and teacher beliefs in academic grading? (Doyle, Easterbrook, and Harris 2023)

Can a digital intervention be helpful for individuals with subthreshold borderline personality disorder? (Drews-Windeck et al. 2022)

Define the hypotheses

A good research question allows us to generate testable predictions (hypotheses) that are relevant to our research aims.

Vocabulary

Hypothesis: A testable prediction - a statement about what we reasonably believe our data will show.

The prediction is based on some prior information (literature and prior research)
A hypothesis can be defined on different levels - conceptual, operational, statistical

Diagram of hypothesis testing with highlighted step: "Define the null and the alternative hypothesis"

Define the hypotheses

CONCEPTUAL HYPOTHESIS

Describes our prediction in conceptual terms. Can be defined in terms of the direction of the effect that we’re studying:
- Non-directional: “There will be differences in anxiety levels between participants in group A and group B”
- Directional: “Participants in group A will show higher levels of anxiety than those in group B”

Define the hypotheses

OPERATIONALISATION - the process of translating concepts into measures - i.e. how are we going to measure the concepts that we’re studying?

CONCEPTUAL HYPOTHESIS

“Participants in group A will show higher levels of anxiety than those in group B.”

➞

OPERATIONAL HYPOTHESIS

“Participants in group A will score higher on the State-trait anxiety inventory than those in group B” OR:
“Participants in group A will show higher skin conductance response than those in group B”

STATISTICAL HYPOTHESIS

What do we expect to happen numerically? E.g.
\(M_{\text{GroupA}} \ne M_{\text{GroupB}}\) OR \(M_{\text{GroupA}} \gt M_{\text{GroupB}}\)

PollEverywhere: Operationalising variables

We’re interested in a relationship between wearing a hat and confidence.

How can we operationalise (merasure) someone’s confidence?

linktr.ee/analysingdata

Null vs alternative hypothesis

Once we know the operational hypothesis, we can define the null and the alternative hypotheses.

ALTERNATIVE HYPOTHESIS - our prediction, denoted as H₁.
NULL HYPOTHESIS - denoted as H₀. It represents the negation of the prediction that we’re making. Often describes a reality where the effect we’re interested in doesn’t exist.

Two realities

Reality 1: H₁ is true

Effect we’re interested in exists. e.g. “There will be a difference in scores between group A and group B”
Sampling distribution is centred at the population value for this difference:

Reality 2: H₀ is true

Effect we’re interested in doesn’t exist. e.g. “There will be no difference in scores between group A and group B.”
Sampling distribution is centred at 0:

Two realities

Null Hypothesis Significance Testing

If H₀ is true, how likely/unlikely is the result (e.g. the group difference) we observed?
Can we reject H₀ based on what we observed in our data?
“Statistically significant result” - result that is sufficiently unlikely under the null hypothesis

Reality 2: H₀ is true

Effect we’re interested in doesn’t exist. e.g. “There will be no difference in scores between group A and group B.”
Sampling distribution is centred at 0:

Example: Heavy metal orcas 🤘

Increased orca attacks on boats crossing the Gibraltar Strait
Sailors started using heavy metal to deter them
Some (less than reputable) sources claim it works. Others advise against doing so!

An epic looking graphic of 5 orcas in a crowd at an Iron Maiden concert

Headline of an article saying "Is THIS how to stop killer whales ramming boats? Sailors say they're blasting heavy metal music out underwarter to deter orcas"

Headline of an article saying "Orcas pummel boat after crew tries to deter them with heavy metal music"

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

Animated image of a little boat with Jennifer behind the wheel and Martina sitting at the back. There's a huge orca jumping over the boat, looking like it's going to body slam the boat.

Example: Heavy metal orcas 🤘

RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?

Boat 1: Martina is blasting heavy metal music
Boat 2: Jennifer is blasting Shrek music (control group)

CONCEPTUAL HYPOTHESIS: Playing heavy metal music will be associated with reduced hostility in orcas.

Reduced hostility in orcas ➞ Shorter orca attack duration

OPERATIONAL HYPOTHESIS: Playing heavy metal music will be associated with shorter orca attack duration compared to Shrek music.

Our hypotheses

H₁: Orca attacks will be significantly shorter when playing heavy metal music compared to playing Shrek soundtracks.
H₀: There will be no significant difference in attack duration between the two music styles.

Our hypotheses

STATISTICAL HYPOTHESES:

H₁:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} \ne 0\\ M_{\text{(Shrek)}} > M_{\text{(Metal)}} \]

H₀:

\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} = 0 \\ M_{\text{(Shrek)}} = M_{\text{(Metal)}} \]

Decide on the α (alpha) level

\(\alpha\) level - the rate of false-positive findings that we’re willing to accept if we’re living in a reality where the null hypothesis is true.
If we were to repeat our experiment over and over again, how often are we willing to incorrectly reject the null hypothesis?¹
The decision should be based on a cost benefit analysis - how risky is it to be wrong?
Psychologists often use the \(\alpha\) rate of 5% as a blanket rule with no justification ¯\_(ツ)_/¯

Diagram of hypothesis testing with highlighted step: "Decide on the acceptable rate of false positive findings"

Calculate statistical power

It’s a way of deciding how large a sample you need to get meaningful results on a statistical test.

Let’s come back to this later!

Diagram of hypothesis testing with highlighted step: "Calculate statistical power"

Calculate the test statistic

Some numeric value that we use to test the hypothesis
There are different test statistics for different situations.
Some examples you might see reported in a paper: t, F, \(\chi^2\)
In our case, we can look at the mean difference between the two music styles i.e.:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \]

Calculate the test statistic

Example 1:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 16.00 = 35.28 \]

Two histograms labelled as Shrek and Metal respectively. The attack duration is on the x axis which goes from 0 to 70 The histogram for Shrek centres around the value 51 minutes, the histogram for metal centres around the value 16

Over 35 minutes of different is quite a lot!

Calculate the test statistic

Example 2:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 41 = 10.28 \]

How about this difference? Large? Small? Could we realistically find this difference if the null hypothesis is true?

p-values

Compute the p-value

H₀: There will be no significant difference in attack duration between the two music styles.

We want to know the probability of observing the test statistic at least as large as the one we observed if the null hypothesis is true - the p-value.
In our case, the “test statistic” is the mean difference between the two groups.
What kind of difference would we expect to find if there the null hypothesis is true?

Diagram of hypothesis testing with highlighted step: "Compute the probability of obtaining a test statistic at least as large as the one observed if the null hypothesis is true."

Compute the p-value

If the null hypothesis is true, we would expect a difference of 0
But we’re taking a random sample - so 0 is the most probable value under the null hypothesis but other values are also possible.

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'

Compute the p-value

The difference in attack duration between Shrek music and Metal music detected in our sample was equal to 10.28.
Familiar territory: Given a normal sampling distribution centred at 0, how common is the value of 10.28 ?
The shaded proportion is 0.05 - top 5% of possible samples

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -30 to 30 and is labelled 'M(diff) in different samples'. Top 5% are shaded.

Compute the p-value

The probability of obtaining difference of 10.28 under the null hypothesis is 0.043 \(\times\) 100 = 4.3%.
Therefore, given a distribution centred at 0, we would expect to find a difference of 10.28 in 4.3% of cases

Vocab

p-value: The probability of observing a test statistic at least as large as the one observed if the null hypothesis is true.

pnorm(10.28, mean = 0, sd = 6, lower.tail = FALSE)

[1] 0.04332562

A density distribution plot, with a shape of a bell curve centred on 0. the x axis ranges from -50 to 50 and is labelled 'M(diff) in different samples'

Retain or reject the null hypothesis

H₀: There will be no significant difference in attack duration between the two music styles.

Critical \(\alpha\): 5% (0.05 in probability terms)
To reject the null hypothesis, the probability of obtaining our test statistic (10.28) under the null hypothesis needs to be less than the critical \(\alpha\)
Our p-value was 0.043
0.043 is smaller than 0.05 ➞ we reject the null hypothesis and conclude that this difference is statistically significant at \(\alpha\) of 0.05.

Diagram of hypothesis testing with highlighted step: "Reject the null hypothesis is the p-value is smaller than the acceptable rate of false positive findings". There's an additional label that says "Remember this?". The label is next to a step "Decide on the acceptable rate of false-positive findings" which was covered earlier.

One more example…

Example 3:

\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 44 = 7.28 \]

One more example…

Shaded area is top 5%
The probability of finding a difference of 7.28 in a distribution centred at 0 is 0.112 (or in 11.2% of samples)
Is the difference of 7.28 in the shaded area?

pnorm(7.28, 0, 6, lower.tail = FALSE)

[1] 0.1125012

One more example…

The value of 0.11 is greater than 0.05, which is our acceptable threshold of false-positives.
The difference between the means we detected (M_Diff = 7.28) is not considered statistically significant at \(\alpha\) of 0.05. We therefore cannot reject the null hypothesis.

pnorm(7.28, 0, 6, lower.tail = FALSE)

[1] 0.1125012

PollEverywhere

Practising statistical significance

linktr.ee/analysingdata

Let’s practice (1)

We set the alpha level to be 0.05 (or 5%)
After receiving an intervention, a group of patients reports average anxiety score of 25.
The control group reports a score of 31, M_diff = 6
If the null hypothesis is true, probability of finding a difference of 6 is p = 0.02, or 2%.
Is this difference statistically significant?

Let’s practice (2)

We set the alpha level to be 0.05
We randomly sample and find that individuals who spend time with puppies report stress levels of 16, while individuals without puppies report stress levels of 19, M_diff = 3
We work out that: p = 0.04
Is this difference statistically significant?

Let’s practice (3)

Reported in a journal:

Participants in the active control group (M = 645, SD = 112) showed slower reaction times in milliseconds compared to those in the intervention group (M = 598, SD = 132), M_diff = 47, p = 0.073.

We set the alpha to be 0.05
Is this difference statistically significant?

What if we’re wrong?

Type I error:
- False-positive
- Concluding that an effect exists when it doesn’t
- Incorrectly rejecting the null hypothesis

Type II error:
- False-negative
- Concluding that an effect doesn’t exist when it does
- Failing to reject the null hypothesis

P-values and confidence intervals

Closely interlinked concepts
Extent of overlap of confidence intervals is indicative of statistical (non) significance

P-values and confidence intervals

Closely interlinked concepts
Extent of overlap of confidence intervals is indicative of statistical (non) significance

No overlap - statistically significant
Up to half the length of CI whisker overlap statistically significant p < 0.05
About a half overlap - p value close to 0.05

P-values and confidence intervals

More than half CI whisker overlap - non-significant

Useful but not a universal rule of thumb - depends on the design.

Interpreting p-values

Tip

p-value is the probability of obtaining a test statistics at least as large as the one observed if the null hypothesis is true.

If the p-value is smaller than some pre-defined \(\alpha\) cut-off (often 0.05), we reject the null hypothesis
Like confidence intervals, p-values should be interpreted in the context of wider research
- If the effect (difference in means, correlation, etc.) is really 0, we can still find a whole range of effects in different samples because of random sampling.
- With p-values, we’re answering the question: “If the sampling distribution of this effect is centered at 0, how probable is the effect that we found in our sample”?

PollEverywhere:

What even are p-values?

linktr.ee/analysingdata

🚨 Some INCORRECT definitions of p-values 🚨

Many(!) researchers find p-values difficult to interpret (Haller and Krauss 2002)

Warning

The p-value is NOT the probability of a chance result.

Warning

The p-value is NOT the probability of the alternative hypothesis being true.

Warning

The p-value is NOT the probability of the null hypothesis being true.

The inverse probability fallacy

Warning

The p-value is NOT the probability of the null hypothesis being true.

p-value doesn’t tell us anything about the probability of null or alternative hypothesis.
It tells us how likely the detected effect is IF the null is true.
We don’t know whether the null is true or not
Other statistical approaches - like Bayesian hypothesis testing - can tells more about the probability of null/alternative hypothesis, but we don’t cover them on this module.

Some pitfalls of NHST

p-values are sensitive to sample size - any tiny difference can be statistically significant with large enough sample

A dot plot. The two groups - shrek and metal - are on the x axis. The value of the mean in each group is on the y axis. There are error bars around the dots representing confidence intervals. The difference between the two dots is 2. Confidence intervals overlap substantially.

\[ n = 50 \\ M_{Diff} = \text{2 minutes} \\ p = 0.17 \]

\[ n = 10000 \\ M_{Diff} = \text{2 minutes} \\ p = 0.0000006 \]

Some pitfalls of NHST

All or nothing thinking - is there a practical difference between 0.049 and 0.051?

🥳 “The mean difference between the groups was statistically significant,
M_Diff = 10.4, p = 0.049.”

🥴 “The mean difference between the groups was not statistically significant,
M_Diff = 10.2, p = 0.051.”

🤯 “The mean difference between the groups was *&%@@!%)£)@)!__!&^%_()!(!,
M_Diff = 10.3, p = 0.05.”

A non-significant result is still a result

Getting the most out of p-values

EFFECT SIZES - how large/meaningful is the effect (e.g. mean difference or other estimate) that we found
- E.g. a group difference of 2 of a scale 1-10 vs same group difference on a scale 1-100
CONFIDENCE INTERVALS - what are the plausible limits of our effect?
POWER ANALYSIS - determining the necessary sample size before we begin data collection

Power analysis

POWER ANALYSIS - determining the necessary sample size before we begin data collection to make sure we don’t under- or over-sample

STATISTICAL POWER - the probability of detecting an effect of a certain size as statistically significant, assuming the effect exists.

We assume a certain effect size - say that we’re only interested in an effect if it’s at least a difference of 10 minutes between Metal and Shrek music
We calculate the sample size necessary to detect this effect.

More participants are needed for (1) smaller effects and (2) complicated designs

Diagram of hypothesis testing with highlighted step: "Calculate statistical power" which is a previously skipped section.

Power analysis

Power analysis makes non-significant effects easier to interpret
Reduces the probability of missing an important finding (false-negative finding)¹
Reduces over-sampling and wasting resources
Generally, a statistical power of 80% is considered the standard to aim for

In a nutshell…

NHST is the most commonly used way of testing hypotheses, but other methods exist
We start by defining our research questions and hypotheses.
We decide whether or not reject the null hypothesis based on the p-value associated with our test statistic
p-values can be a useful tool when combined with effect sizes, confidence intervals, and power analysis

Complete diagram of hypothesis testing same as on one of the earlier slides

In a nutshell…

A p-value is:

The probability of observing a test statistic at least as large as the one we observed in our sample if the null hypothesis is true.

A p-value is NOT:

The probability that the null hypothesis is true
The probability that the alternative hypothesis is true
The probability of a chance result

Next week

Statistical foundations are over 🥳
Lectures, skills labs and tutorials start overlapping
Applying the NHST principles in practice - testing group differences using the t-test

References

Doyle, Lewis, Matthew J. Easterbrook, and Peter R. Harris. 2023. “Roles of Socioeconomic Status, Ethnicity and Teacher Beliefs in Academic Grading.” British Journal of Educational Psychology 93 (1): 91–112. https://doi.org/https://doi.org/10.1111/bjep.12541.

Drews-Windeck, Elea, Lindsay Evans, Kathryn Greenwood, and Kate Cavanagh. 2022. “The Implementation of a Digital Group Intervention for Individuals with Subthreshold Borderline Personality Disorder.” Procedia Computer Science 206: 23–33. https://doi.org/https://doi.org/10.1016/j.procs.2022.09.082.

Field, Andy P., and Jenny Terry. 2018. “A Pilot Study of Whether Fictional Narratives Are Useful in Teaching Statistical Concepts.” RSS Conference. https://discoveringstatistics.com/docs/rss_poster_2018.pdf.

Fincham, Guy William, Clara Strauss, Jesus Montero-Marin, and Kate Cavanagh. 2023. “Effect of Breathwork on Stress and Mental Health: A Meta-Analysis of Randomised-Controlled Trials.” Scientific Reports 13 (1): 432. https://doi.org/10.1038/s41598-022-27247-y.

Forman, Jemma, Elizabeth Renner, and David A. Leavens. 2023. “Fetching Felines: A Survey of Cat Owners on the Diversity of Cat (Felis Catus) Fetching Behaviour.” Scientific Reports 13 (1): 20456. https://doi.org/10.1038/s41598-023-47409-w.

Haller, Heiko, and Stefan Krauss. 2002. “Misinterpretations of Significance: A Problem Students Share with Their Teachers.” Methods of Psychological Research 7 (1): 1–20.

Mankin, Jennifer L., Christopher Thompson, Holly P. Branigan, and Julia Simner. 2016. “Processing Compound Words: Evidence from Synaesthesia.” Cognition 150: 1–9. https://doi.org/https://doi.org/10.1016/j.cognition.2016.01.007.

Woolston, Chris. 2019. “PhDs: The Tortuous Truth.” Nature. https://www.nature.com/articles/d41586-019-03459-7.