Distributions describe how often different values occur - in a sample or a population.
“Mathematically defined” dsitribution are useful - we kow how probable scores are.
Normal distribution - defined by mean, standard deviation, and proportions of scores expected above/below critical values
Key idea
Assuming a distribution of a particular shape, how common is a given value?
Average individual attends 127 social events per year.
Assume a population with M = 127 and SD = 40 (e.g. based on collected sample)
Is an individual who attends 57 social events per year unusual?
Key idea
Assuming a distribution of a particular shape, is the value we’re interested in above or below a specific cut-off point?
Assuming the same distribution (M = 127, SD = 40) is an individual who attends 190 social events per year among the top 10% of event-goers?
Example 1: Average individual drinks 730 cups of coffee per year. Jennifer drinks 1100 cups of coffee per year. Is Jennifer in top 5% of the distribution (shaded)?
Example 1: Average individual drinks 730 cups of coffee per year. Jennifer drinks 1100 cups of coffee per year. Is Jennifer in top 5% of the distribution (shaded)?
Example 2: A critical value for the bottom 5% on an anxiety scale is 7.08. A study participant receives a score of 8. Are they in the bottom 5%?
Example 2: A critical value for the bottom 5% on an anxiety scale is 7.08. A study participant receives a score of 8. Are they in the bottom 5%?
Sampling from populations - the broader picture
Uncertainty in research and estimation
Sampling distributions and the Central Limit Theorem
Standard error of the mean
Confidence intervals: what they are, and what they are not
So far we:
In research, we often want to know:
THE PROBLEM: Samples are not perfect representations of populations
There is uncertainty around how close the sample mean matches true population value.
Standard errors and confidence intervals are tools we use to quantify that uncertainty.
The sample estimate is our best guess
The population value we’re trying to estimate is called the parameter
What is an estimate?
A parameter estimate can take many different forms. We might want to:
Estimate a typical value of a single variable in a population
Estimate group differences for a variable
Estimate strength of association between two or more variables (e.g. a correlation coefficient, or slope of a straight line).
The average person…
drinks 730 cups of coffee per year (twice as much for academics, incl. students) ☕
spends 192 minutes a day watching TV 📺
eats 250 cloves of garlic per year 🧄
takes 3500 steps each day 🚶
falls asleep in 7 minutes 😴
What is an estimate?
A parameter estimate can take many different forms. We might want to:
Estimate a typical value of a single variable in a population
Estimate group differences for a variable
Estimate strength of association between two or more variables (e.g. a correlation coefficient, or slope of a straight line).
What is an estimate?
A parameter estimate can take many different forms. We might want to:
Estimate a typical value of a single variable in a population
Estimate group differences for a variable
Estimate strength of association between two or more variables (e.g. a correlation coefficient, or slope of a straight line).
Doomscrolling
“… refers to a unique media habit where social media users persistently attend to negative information in their newsfeeds about crises, disasters, and tragedies.”
- Sharma, Lee, and Johnson (2022)
Research question…
How much does an average person doomscroll?
Group 1: Right side of the room
Group 2: Middle of the room
Group 3: Left side of the room
Each time we take a sample, we get a different estimate
This is because of random sampling
How do we know if our estimate is accurate and close to the real population value?
Each time we take a sample, we get a different estimate
This is because of random sampling
How do we know if our estimate is accurate and close to the real population value?
Describes how sampling distributions arise
Imagine we repeat the following process thousands (infinite) number of times:
Collect a sample of individuals
Calculate mean doom-scrolling time in that sample
Save this mean value and put it on the plot
Repeat
We can describe every normal distribution using:
Mean - the central value
Standard deviation (SD) - the average difference from the mean
Proportions of scores at cut-off points
Around 68% of scores are within 1 SD of the mean
95% of scores are within \(\pm\) 1.96 SDs of the mean
Same rules apply!
The mean of a sampling distribution will be centered on the population value
LANGUAGE CHANGE: when talking about standard deviation in the context of a sampling distribution, we call it standard error.
Standard deviation
The average difference between each score and the sample mean
Standard error
Standard deviation of sample means
The average difference between each sample mean and the population value
Standard error is a useful metric for quantifying uncertainty in estimates - it describes the extent to which samples differ from each other in a sampling distribution
We can use it to construct an interval within which a certain percentage of sample means will fall
However…
Sampling distributions don’t exist “in the wild”. They are a hypothetical statistical concept.
Remember: standard error refers to the standard deviation of the sampling distribution (created by re-sampling and computing the mean infinite number of times), but we only have access one sample with one mean.
Therefore, if we want to use the standard error to construct an interval, we need to estimate it from our sample.
Equation:
\[ SE = \frac{SD}{\sqrt N} \]
Translation:
\[ \text{standard error} = \frac{\text{sample standard deviation}}{\text{(the square root of) the sample size}} \]
In R
:
We collect a sample of 4 individuals.
Each person reports their daily doomscrolling time (in minutes): 86, 114, 97, 107
The mean for the sample is 101 minutes
The standard deviation is:
\[ SD = \sqrt\frac{\sum(x_i - x)^2}{N} = \sqrt\frac{(86-101)^2 + (114-101)^2 + (97 - 101)^2+(107-101)^2}{4} = 12.19 \]
\[ SE = \frac{SD}{\sqrt{N}} = \frac{12.19}{\sqrt{4}} = 6.095 \]
Average doomscrolling time for the sample: 101 minutes
Standard deviation: 12.19
Standard error: 6.095
\[ \text{Lower CI limit} = \text{sample mean} - 1.96 \times\text{SE} \\ \text{Upper CI limit} = \text{sample mean} + 1.96 \times\text{SE} \]
\[ \text{Lower CI limit} = 101 - 1.96 \times6.095 = 89.054\\ \text{Upper CI limit} = 101 + 1.96 \times6.095 = 112.946 \]
You might see in a paper…
“The average doomscrolling time in our sample was 101 minutes (SD = 12.19) 95% CI [89.05, 112.95].”
In papers:
“Error bars” on plots will often represent confidence intervals - labelled as “CI”
Sometimes they might represent standard errors - labelled as “SE” or “SEM” - always check the plot description.
Other times the authors just keep it a secret
📌 Sampling distribution of the mean will have a normal shape as long as the sample size large enough
Smaller samples don’t approximate the normal sampling distribution very well. Because of this, we can’t rely on the value 1.96 to give us accurate intervals.
Instead, we can use the t-distribution
Looks like normal, by isn’t.
Defined by degrees of freedom (df) - calculated as N-1 (number of observations minus 1)
The “critical t value” will change for different degrees of freedom.
Instead of multiplying the standard error by 1.96, we multiply by the critical t value.
Critical t gets closer to 1.96 with larger sample - the t-distribution itself will approximate normal distribution more closely
For example, in our sample of 4, the df is 4 - 1 = 3. Move the slider to df = 3 to see that the critical t value for 3 is 3.182
Average doomscrolling time for the sample: 101 minutes
Standard error: 6.095
Critical t value: 3.182
\[ \text{CI Limits} = \text{mean} \pm3.182 \times\text{SE} \\ \text{CI Limits} = 101 \pm3.182 \times\text{6.095} \\ \text{CI Limits} = [81.606, 120.394] \]
CIs will generally be wider in smaller samples - more uncertainty
t-distribution additionally accounts for the fact that small samples don’t always generate normal sampling distributions.
The larger the sample, the narrower the confidence intervals
note how t approaches 1.96 as the sample size (df) increases
We take samples over and over again, compute the mean for each, and construct confidence intervals around that mean - 95% of them will contain the population value, the remaining 5% will not.
This is known as an interval with 95% coverage. 95% is the most common value that we choose, but it can take on other values as well (e.g 50%, 90%, 99%).
If we use the wrong critical value for calculation - e.g. assuming normal sampling distribution when it’s not there - the coverage will be inaccurate
I.e. we might expect 95% of CIs to contain the population value, when in reality the coverage is lower.
\[ \text{"The average doomscrolling time in our sample was} \\ \text{101 minutes (SD = 12.19) 95% CI [81.61, 120.39]."} \]
Correct interpretation
ASSUMING THAT our sample is one of the 95% producing confidence intervals that contain the population value, then the population value for time spent doomscrolling per day falls somewhere between 81.61 and 120.39 minutes.
However…
There is no guarantee that the assumption above is correct! And we just have to live our lives not knowing…
Hoekstra et al. (2014) :
Both researchers and students endorsed, on average, more than three [incorrect] statements [about confidence intervals], indicating a gross misunderstanding of CIs. Self-declared experience with statistics was not related to researchers’ performance […] Researchers hardly outperformed the students, even though the students had not received any education on statistical inference whatsoever.
No:
“We can be 95% confident that the population value falls between 81.61 and 120.39.”
Also no:
“There is 95% probability that the population value falls between 81.61 and 120.39.”
Correct interpretation
ASSUMING THAT our sample is one of the 95% producing confidence intervals that contain the population value, then the population value for time spent doomscrolling per day falls somewhere between 81.61 and 120.39 minutes.
More general correct interpretation
ASSUMING THAT our sample is one of the 95% producing confidence intervals that contain the population value, then the population value for the estimate of interest falls somewhere between the lower limit the upper limit of the interval we’ve computed for our sample.
Memorise and practice!
When interpreting estimates and confidence intervals for your sample - always consider them as just one of many different possible estimates
This is why replication is important in science - our sample could easily be the one that misses the population value
Always be vary of studies placing too much certainty on a single finding
\[ SE = \frac{SD}{\sqrt N} \]
\[ \text{CI limits} = mean \pm (1.96 \times{SE}) \\ \]
Putting it all into practice:
Research questions
Good and less good hypotheses
Testing hypotheses with Null Hypothesis Significance Testing
A disappointing answer to why we’re so obsessed with the value 95%.