[1] 0.1356661
Recap of sampling from distributions and confidence intervals
Forming a research question
Moving from research questions to hypotheses
Formally testing hypotheses with statistics
(Some of many) pitfalls of NHST
Standard error is the standard deviation of sample means (in a sampling distribution)
Constructing confidence intervals around our sample estimate.
We don’t know how close our estimate is to the population value
We also don’t know if our confidence interval even contains a population value.
Should we even bother?
TO THE RESCUE:
Sample size matters - the larger our sample size, the closer our sample resembles the distribution.
Sample size matters - the larger our sample size, the closer our sample resembles the distribution.
Scientific work is cumulative - we build on each others ideas, rarely starting from scratch.
Small samples are not useless, we just need more of them.
Multi-study papers
Multi-lab studies (e.g. SMARVUS data)
Replication is crucial (especially for novel findings)
Vocab
Replication - attempt to reproduce a finding from another study in a new sample using identical methodology
CONFIRMATORY RESEARCH - focus on testing specific hypotheses and in general follows formula on the right.1
“Null Hypothesis Significance Testing” (NHST)
Other approaches also exist
EXPLORATORY RESEARCH - doesn’t start with a hypothesis.
Can collect new data or work with existing data.
Useful for generating hypotheses, but statistical tests can be difficult to interpret.
A good research question:
Researchable and realistic
Informed by the prior research (a gap in the literature or a need for replication)
Not too broad and not too narrow - proportional to the project at hand
What’s too broad for one project can be just right for another!
How can we address the global mental health crisis? ❌ - Too broad
What are the key factors contributing to the mental health crisis among PhD students? (Woolston 2019) ✅ - Better!
Some other examples:
Can domesticated cats play fetch? (Forman, Renner, and Leavens 2023)
How do word frequency and semantic transparency influence synaesthetic colouring of compound words? (Mankin et al. 2016)
Can embedding statistical teaching within a fictional narrative help to reduce anxiety and increase comprehension? (Field and Terry 2018)
What are the effects of breathwork on stress and mental health? (Fincham et al. 2023)
What are the roles of socioeconomic status, ethnicity and teacher beliefs in academic grading? (Doyle, Easterbrook, and Harris 2023)
Can a digital intervention be helpful for individuals with subthreshold borderline personality disorder? (Drews-Windeck et al. 2022)
A good research question allows us to generate testable predictions (hypotheses) that are relevant to our research aims.
Vocab
Hypothesis: A testable prediction - a statement about what we reasonably believe our data will show.
The prediction is based on some prior information (literature and prior research)
A hypothesis can be defined on different levels - conceptual, operational, statistical
CONCEPTUAL HYPOTHESIS
“There will be differences in anxiety levels between participants in group A and group B” - non-directional hypothesis.
“Participants in group A will show higher levels of anxiety than those in group B” - directional hypothesis.
OPERATIONALISATION - the process of translating concepts into measures - i.e. how are we going to measure the concepts that we’re studying?
CONCEPTUAL HYPOTHESIS
➞
OPERATIONAL HYPOTHESIS
“Participants in group A will score higher on the State-trait anxiety inventory than those in group B” OR:
“Participants in group A will show higher skin conductance response than those in group B”
STATISTICAL HYPOTHESIS
What do we expect to happen numerically? E.g.
\(M_{\text{GroupA}} \ne M_{\text{GroupB}}\) OR \(M_{\text{GroupA}} \gt M_{\text{GroupB}}\)
Reports of increased orca attacks on boats crossing the Gibraltar Strait
Sailors started using heavy metal to deter them
Mixed anecdotal results - some (less than reputable) sources claim it works. Others advise against doing so!
RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?
RESEARCH QUESTION: Does heavy metal music affect hostile behaviour in orcas?
CONCEPTUAL HYPOTHESIS: Playing heavy metal music will be associated with reduced hostility in orcas.
Reduced hostility in orcas ➞ Shorter orca attack duration
OPERATIONAL HYPOTHESIS: Playing heavy metal music will be associated with shorter orca attack duration compared to Shrek music.
Once we know the operational hypothesis, we can define the null and the alternative hypotheses.
ALTERNATIVE HYPOTHESIS - our prediction, denoted as H1.
NULL HYPOTHESIS - denoted as H0. It represents the negation of the prediction that we’re making. Often describes a reality where the effect we’re interested in doesn’t exist.
H1: Orca attacks will be significantly shorter when playing heavy metal music compared to playing Shrek soundtracks.
H0: There will be no significant difference in attack duration between the two music styles.
Tip
H0 and H1 represent two alternative realities (like parallel world). When using Null Hypothesis Significance Testing, we’re trying to decide whether we can reject the null hypothesis.
STATISTICAL HYPOTHESES:
\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} \ne 0 \]
\[ M_{\text{(Shrek)}} > M_{\text{(Metal)}} \]
\[ M_{\text{(Shrek)}} - M_{\text{(Metal)}} = 0 \]
\(\alpha\) level - the rate of false-positive findings that we’re willing to accept if we’re living in a reality where the null hypothesis is true.
If we were to repeat our experiment over and over again, how often are we willing to incorrectly reject the null hypothesis?1
The decision should be based on a cost benefit analysis - how risky is it to be wrong?
Psychologists often use the \(\alpha\) rate of 5% as a blanket rule with no justification ¯\_(ツ)_/¯
It’s a way of deciding how large a sample you need to get meaningful results on a statistical test.
Let’s come back to this later!
Some numeric value that we use to test the hypothesis
There are different test statistics for different situations.
Some examples you might see reported in a paper: t, F, \(\chi^2\)
In our case, we can look at the mean difference between the two music styles i.e.:
\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \]
Example 1:
\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 16.00 = 35.28 \]
Over 35 minutes of different is quite a lot!
Example 2:
\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 41 = 10.28 \]
How about this difference? Large? Small? We need a more formal way of deciding whether we can reject the null hypothesis.
H0: There will be no significant difference in attack duration between the two music styles.
We want to know the probability of observing the test statistic at least as large as the one we observed if the null hypothesis is true - the p-value.
In our case, the “test statistic” is the mean difference between the two groups.
What kind of difference would we expect to find if there the null hypothesis is true?
If the null hypothesis is true, we would expect a difference of 0
But we’re taking a random sample - so 0 is the most probable value under the null hypothesis but other values are also possible.
Vocab
p-value: The probability of observing a test statistic at least as large as the one observed if the null hypothesis is true.
H0: There will be no significant difference in attack duration between the two music styles.
We previously decided that our critical \(\alpha\) is 5% - or 0.05 in probability terms
To reject the null hypothesis, the probability of obtaining our test statistic (10.28) under the null hypothesis needs to be less than the critical \(\alpha\)
Our p-value was 0.043
0.043 is smaller than 0.05 ➞ we reject the null hypothesis and conclude that this difference is statistically significant at \(\alpha\) of 0.05.
Example 3:
\[ M_{Diff} = M_{(Shrek)} - M_{(Metal)} \\ M_{Diff} = 51.28 - 44 = 7.28 \]
The value of 0.11 is greater than 0.05, which is our acceptable threshold of false-positives.
The difference between the means we detected (MDiff = 7.28) is not considered statistically significant at \(\alpha\) of 0.05. We therefore cannot reject the null hypothesis.
We set the alpha level to be 0.05 (or 5%)
After receiving an intervention, a group of patients reports average anxiety score of 25.
The control group reports a score of 31, Mdiff = 6
If the null hypothesis is true, probability of finding a difference of 6 is p = 0.02, or 2%.
Is this difference statistically significant?
We set the alpha level to be 0.05
We randomly sample and find that individuals who spend time with puppies report stress levels of 16, while individuals without puppies report stress levels of 19, Mdiff = 3
We work out that: p = 0.04
Is this difference statistically significant?
Reported in a journal:
Participants in the active control group (M = 645, SD = 112) showed slower reaction times in milliseconds compared to those in the intervention group (M = 598, SD = 132), Mdiff = 47, p = 0.073.
Tip
p-value is the probability of obtaining a test statistics at least as large as the one observed if the null hypothesis is true.
If the p-value is smaller than some pre-defined \(\alpha\) cut-off (often 0.05), we reject the null hypothesis
Like confidence intervals, p-values should be interpreted in the context of wider research
If the effect (difference in means, correlation, etc.) is really 0, we can still find a whole range of effects in different samples because of random sampling.
With p-values, we’re answering the question: “If the sampling distribution of this effect is centered at 0, how probable is the effect that we found in our sample”?
Warning
The p-value is NOT the probability of a chance result.
Warning
The p-value is NOT the probability of the alternative hypothesis being true.
Warning
The p-value is NOT the probability of the null hypothesis being true.
Warning
The p-value is NOT the probability of the null hypothesis being true.
p-value doesn’t tell us anything about the probability of null or alternative hypothesis.
It tells us how likely the detected effect is IF the null is true.
We don’t know whether the null is true or not
Other statistical approaches - like Bayesian hypothesis testing - can tells more about the probability of null/alternative hypothesis, but we don’t cover them on this module.
\[ n = 50 \\ M_{Diff} = \text{2 minutes} \\ p = 0.17 \]
\[ n = 10000 \\ M_{Diff} = \text{2 minutes} \\ p = 0.0000006 \]
EFFECT SIZES - how large/meaningful is the effect (e.g. mean difference) that we found,
CONFIDENCE INTERVALS - what are the plausible limits of our effect?
POWER ANALYSIS - determining the necessary sample size before we begin data collection
POWER ANALYSIS - determining the necessary sample size before we begin data collection to make sure we don’t under- or over-sample
STATISTICAL POWER - the probability of detecting an effect of a certain size as statistically significant, assuming the effect exists.
We assume a certain effect size - say that we’re only interested in an effect if it’s at least a difference of 10 minutes between Metal and Shrek music
We calculate the sample size necessary to detect this effect.
Power analysis makes non-significant effects easier to interpret
Reduces the probability of missing an important finding (false-negative finding)1
Reduces over-sampling and wasting resources
Generally, a statistical power of 80% is considered the standard to aim for
NHST is the most commonly used way of testing hypotheses, but other methods exist
We start by defining our research questions and hypotheses.
We decide whether or not reject the null hypothesis based on the p-value associated with our test statistic
p-values can be a useful tool when combined with effect sizes, confidence intervals, and power analysis
A p-value is:
A p-value is NOT:
The probability that the null hypothesis is true
The probability that the alternative hypothesis is true
The probability of a chance result