Lecture 02
Make sure you can see all the channels on Discord:
Speaking stats
Distributions
Sampling
Sampling distribution
Central Limit Theorem
Learning to think in statistical terms is a skill just like:
Drawing/art/music
Weightlifting
Speaking a new language
You don’t need “innate talent” for any of these!
You do need (1) patience and (2) lots of practice (3) take things step by step
Skill | Language | Statistics | Year |
---|---|---|---|
Beginner | Learn vocabulary, basic sentences and grammar | Learn terminology, fundamental concepts | 1 |
Intermediate | Extend to more situations, how to deal with irregular forms | Extend to more types of tests/data, how to deal with bias | 2 |
Advanced | Create own sentences, have conversations | Create own study, apply to own data | 3 |
Treat stats (and R
!) like you would learning a language
The core of generalising to new situations is grammar 😱
Generally: the rules for how to create new combinations in new situations
Not everyone’s favourite thing! But essential for fluency
Today’s lecture focuses on the “grammar” of statistics
See PaaS lectures 6 and 7 for thorough revision!
Mean
The sum of all the numbers in a set, divided by the number of numbers.
Example: The mean of 1, 4, 6, and 3 is \(\frac{1 + 4 + 6 + 3}{4}\) = 3.5
Standard Deviation (SD)
A measure of the spread of data around the mean, on average
Calculated as the average difference from the mean:
Example: The SD of 1, 4, 6, and 3 is
\[ \sqrt{\frac{(1-3.5)^2+(4-3.5)^2+(6-3.5)^2+(3-3.5)^2}{4}} = 2.08 \]
Symbols
Greek is for populations, Latin is for samples, hat is for population estimates
Meaning | Mean | SD |
---|---|---|
Population value | \(\mu\) | \(\sigma\) |
Sample value | \(\bar{x}\) | \(s\) |
Population estimate | \(\hat{\mu}\) | \(\hat{\sigma}\) |
Distribution
Numerically speaking, the number of observations per each value of a variable
6-sided die (unbiased)
Each value is equally likely (e.g. the probability of getting a 6 is as likely as the probability of getting a 3)
Roll it 10 times, 50 times, and 1000 times - the more times we roll it, the more the distributions of rolls resembles the shape of the probability function
Also called the Gaussian distribution, the “bell curve”
The one you need to understand
Continuous, unimodal, symmetrical, and bell-shaped
It’s also about the proportions
Proportions of a Normal Distribution
This essentially allows us to (numerically) quantify whether something “unusual” or “surprising” given a particular baseline
How common is Charlie’s score of 57 ?
Shaded area: proportion of the population that attends more social events than Charlie.
Non-shaded area: proportion of people who attend fewer events than Charlie
We can convert our distributions in standard normal distribution by standardising the scores (number of events attended)
In a standard normal distribution: \(\mu\) = 0, \(\sigma\) = 1
Standardisation
The process of transforming any distribution into one with a mean of 0 and SD of 1. Also known as the process of transforming variables into Z-scores. We can transform into Z-scores by subtracting each score from the mean and then dividing by standard deviation.
For example, take scores 1, 4, 6 and 3. Their mean is M = 3.5 , and SD = 2.08 . We can work out the Z-scores as:
\(Z_1 = \frac{1-3.5}{2.08}\) = -0.72
\(Z_4 = \frac{4-3.5}{2.08}\) = 0.73
\(Z_6 = \frac{6-3.5}{2.08}\) = 1.68
\(Z_3 = \frac{3-3.5}{2.08}\) = 0.24
Let’s transform Charlie’s score of 57 into a Z-score:
M = 127
SD = 40
Z = (57 - 127) / 40 = -1.75
Perks of standardized scores - we know exactly how probable each Z-score is
Optional: Look up the probability of ZCharlie = -1.75 in a Z-table - e.g. https://www.z-table.com/ (or use R)
The probability of obtaining a is around 0.04 - so only 4 % of people go to fewer social events than Charlie (non-shaded area)
The area under the curve is equal to 1 - so the remaning shaded area represents 1- 0.04 = 0.96 or 96 % (people who attend more social events per year than charlie)
Critical Value
A value that cuts off a specific proportion of a distribution
We could work backwards through Z-scores: find a Z-score that corresponds to the probability of 0.95, then transform the Z-score into the original score.
Charlie would need to attend 193 social events per year (that’s 3.7 events per week! 😱 ) if he wanted to be in the top 5% of event-goers.
This is 136 more events than he attends at the moment.
Patty attends 190 events per year. Is she in the top 10% of event goers?
In a normal distribution with M = 127 (social events attended per year) and SD = 40, an individual would have to attend 178 events or more to be in the top 10% of event goers.
Therefore, with 190 events attended per year Patty is in the top 10%.
This idea of the probability of encountering a certain value, given a specific distribution, is absolutely fundamental to everything we will do this term!
For now, focus on:
Revising the logic above
Learning the definitions
We just saw the relationship between a value and its (known) distribution
Next let’s talk about the relationship between a sample statistic and its sampling distribution
Sample
A (usually randomly) selected subset of values of a particular size (e.g., 10, 50) taken from a larger pool of values, often the population
Many variables come from a normal distribution
Some variables might come from other distributions
Reaction times: log-normal distribution
Number of annual casualties due to horse kicks: Poisson distribution
Passes/fails on an exam: binomial distribution
Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different
Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))
Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different
Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))
Sampling Distribution (of the Mean)
The distribution of the means of many samples of a particular size.
The distribution is normal and centred around the true population mean, \(\mu\)
many_sample_means <- replicate(100000, mean(rnorm(50, mean = 127, sd = 40)))
mean(many_sample_means)
[1] 127.0259
You’ve just seen the Central Limit Theorem in action.
As N gets larger, the sampling distribution of \(\bar{x}\) tends towards a normal distribution with mean = \(\mu\)
True no matter the shape of the population distribution!
Take a sample
Compute the mean
Put it on the plot below
Repeat
We know that dice rolls are uniformly distributed - each number is equally likely
What if we calculate an average roll?
dice_rolls_6 <- replicate(10000, mean(sample(50, x = 1:6, TRUE)))
ggplot2::ggplot() +
geom_histogram(aes(x = dice_rolls_6), binwidth = 0.1, fill = "darkcyan", colour = "white") +
labs(x = "Average roll value") +
theme_minimal(base_size = 15)
The CLT governs a lot of processes where randomness and sampling are involved.
This is extremely useful for research - our tests are not immediately doomed if we collect a messy sample
There are many mathematically well-described distributions
Normal (Gaussian) distribution is one of them
Continuous, unimodal, symmetrical, bell-shaped
Must have the right proportions to be normal!
We can use these proportions to work out critical values
Statistics of random samples differ from parameters of a population
As N gets bigger, sample distribution approaches population distribution
Distribution of sample means (or other statistics) is the sampling distribution
Central Limit Theorem
Really important!
Sampling distribution of the mean tends to normal even if population distribution is not normal
Understanding distributions, sampling distributions and CLT it most of what you need to understand all the stats techniques we will cover.
More sampling distributions
Quantifying uncertainty with standard errors and confidence intervals