Fundamentals of Statistical Testing

Lecture 02

Martina Sladekova

A reminder image so that I don't forget to record the lecture on Zoom. Again.

PSAs

https://canvas.sussex.ac.uk/courses/27531/quizzes

Screenshot of canvas > analysing data > quizzes. "Submit your kahoot username" is highlighted."

Make sure you can see all the channels on Discord:

Screenshot of discord. To enable channels, click on "Analysing Data" when on the server and then tick "show all channels"

Analysing Data Roadmap

Roadmap on the module. Top row contains boxes "Introduction and distributions", "Standard error and confidence intervals" and "null hypothesis significance testing". Middle row is "t-test", "correlation" and "chi-square". Bottom row is "equation of a straight line", "linear model with one predictor", "linear model with multiple predictors"

Today

Speaking stats
Distributions
- More about the normal distribution
Sampling
- Sampling distribution
- Central Limit Theorem

Speaking Stats

Learning to think in statistical terms is a skill just like:

Drawing/art/music
Weightlifting
Speaking a new language

You don’t need “innate talent” for any of these!

You do need (1) patience and (2) lots of practice (3) take things step by step

Titled "How to draw a horse". Steps 1-4 show incremental steps on how to draw a simple cartoonish horse. Step 5 jumps to a complicated realistic looking horse, skipping a lot of intermittent tests.

Speaking Stats

Speaking Stats

Skill	Language	Statistics	Year
Beginner	Learn vocabulary, basic sentences and grammar	Learn terminology, fundamental concepts	1
Intermediate	Extend to more situations, how to deal with irregular forms	Extend to more types of tests/data, how to deal with bias	2
Advanced	Create own sentences, have conversations	Create own study, apply to own data	3

Speaking Stats

Treat stats (and R!) like you would learning a language

The core of generalising to new situations is grammar 😱

Generally: the rules for how to create new combinations in new situations
Not everyone’s favourite thing! But essential for fluency

Today’s lecture focuses on the “grammar” of statistics

It’s hard work, but essential for everything that follows

A two-panel meme. First panel: Screenshot from 'The Shining' of Wendy screaming as an axe comes through the door, the word 'Me' above her head. Second panel: The gap in the door with the Duolingo owl looking through, with the caption 'LEARN STATISTICS' at the bottom.

Recap and Distributions

Quick recap: Means and SDs

See PaaS lectures 6 and 7 for thorough revision!

Mean

The sum of all the numbers in a set, divided by the number of numbers.

Example: The mean of 1, 4, 6, and 3 is \(\frac{1 + 4 + 6 + 3}{4}\) = 3.5

Standard Deviation (SD)

A measure of the spread of data around the mean, on average

Calculated as the average difference from the mean:

Example: The SD of 1, 4, 6, and 3 is

\[ \sqrt{\frac{(1-3.5)^2+(4-3.5)^2+(6-3.5)^2+(3-3.5)^2}{4}} = 2.08 \]

It’s all Greek to me!

Symbols

Greek is for populations, Latin is for samples, hat is for population estimates

Meaning	Mean	SD
Population value	\(\mu\)	\(\sigma\)
Sample value	\(\bar{x}\)	\(s\)
Population estimate	\(\hat{\mu}\)	\(\hat{\sigma}\)

Distributions

Distribution

Numerically speaking, the number of observations per each value of a variable

Shows us which values occur more often and which less often
The shape formed by the bars of a bar chart/histogram

two panel plot. Left panel is a bar plot with different eye colours on the x axis. The bars represent the counts in each eye-colour category. Second panel is a histogram, with age on the x axis and counts on the y axis. — Distribution of Categorical Data

Known Distributions

Some shapes are “algebraically tractable”, i.e., there is a maths formula to draw the line
We can use them for statistics because they have particular known properties

Bottom row shows histograms of 5 distributions - normal, chi-square, t, beta, and uniform. The top row shows their equivalent density plots.

A dicey example

6-sided die (unbiased)
Each value is equally likely (e.g. the probability of getting a 6 is as likely as the probability of getting a 3)
Roll it 10 times, 50 times, and 1000 times - the more times we roll it, the more the distributions of rolls resembles the shape of the probability function

3 histograms showing a distribution of rolls on a 6-sided die. First histogram of 10 rolls - distribution is quite messy. Second histogram of 50 rows - distribution is starting to resemble uniform distribution. Third histogram of 1000 rolls - distribution is almost perfect uniform, with all bars reaching very similar height.

The normal distribution

Also called the Gaussian distribution, the “bell curve”
The one you need to understand
Continuous, unimodal, symmetrical, and bell-shaped
- Not every symmetrical bell-shaped distribution is normal (e.g. t)
It’s also about the proportions
- The normal distribution has fixed proportions
- A function of two parameters: \(\mu\) (mean) and \(\sigma\) (SD)

Histogram of a standard normal distribution centred at 0.

Area below the normal curve

No matter the particular shape of the given normal distribution, the proportions with respect to SD are the same

Proportions of a Normal Distribution

∼68% of the area below the curve is within ±1 SD from the mean
95% of the area below the curve is within ±1.96 SD from the mean
99% of the area below the curve is within ±2.58 SD from the mean

These proportions make a continuous, unimodal, symmetrical, and bell-shaped distribution “normal”

Area below the normal curve

Density plot of a normal distribution with shaded proportions. The X axis shows standard deviations. Middle portion is shaded from -1 to +1 SD from the mean, representing 68.2 percent of scores. Outer portion is shaded from -2 to + 2 SD from the mean, represeting 95.4% of scores

Critical Values

Proportions to Probability

The proportions are always the same in a normal distribution
If we know that a particular quantity is normally distributed…
- We know something about the probability of observing a particular value!

This essentially allows us to (numerically) quantify whether something “unusual” or “surprising” given a particular baseline

What’s “unusual”?

The average adult goes to 127 social events per year 👀
The standard deviation is 40 - this is the average difference between each individual’s number of social events and the population value of 127
This means that:
- 68% of people attend between 87 and 167 social events per year (127 \(\pm\) 40)
- 95% of people attend between 48.6 and 205.4 social events per year (127 \(\pm\) 1.96 \(\times\) 40)

density plot of a normal distribution centred at 127 with standard deviation of 40. Middle shaded area goes from 87 to 167. Outer shaded area goes from 49 to 205

What’s “unusual”?

illustrative image of Charlie from the Peanuts comics

Meet Charlie
Charlie goes to 57 social events per year.
This is 70 events below the average
Is Charlie unusual?

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 57

What’s “unusual”?

How common is Charlie’s score of 57 ?
Shaded area: proportion of the population that attends more social events than Charlie.
Non-shaded area: proportion of people who attend fewer events than Charlie

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 57. Area under the curve from the line all the right is shaded.

Working out proportions - the long way around

We can convert our distributions in standard normal distribution by standardising the scores (number of events attended)
In a standard normal distribution: \(\mu\) = 0, \(\sigma\) = 1

Standardisation

The process of transforming any distribution into one with a mean of 0 and SD of 1. Also known as the process of transforming variables into Z-scores. We can transform into Z-scores by subtracting each score from the mean and then dividing by standard deviation.

For example, take scores 1, 4, 6 and 3. Their mean is M = 3.5 , and SD = 2.08 . We can work out the Z-scores as:

\(Z_1 = \frac{1-3.5}{2.08}\) = -0.72

\(Z_4 = \frac{4-3.5}{2.08}\) = 0.73

\(Z_6 = \frac{6-3.5}{2.08}\) = 1.68

\(Z_3 = \frac{3-3.5}{2.08}\) = 0.24

Working out proportions - the long way around

Let’s transform Charlie’s score of 57 into a Z-score:

M = 127
SD = 40

Z = (57 - 127) / 40 = -1.75

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 57. Area under the curve from the line all the right is shaded.

density plot of a normal distribution centred at 0 with standard deviation of 1 Vertical line crosses x axis at the value -1.75. Area under the curve from the line all the right is shaded.

Working out proportions - the long way around

Perks of standardized scores - we know exactly how probable each Z-score is
Optional: Look up the probability of Z_Charlie = -1.75 in a Z-table - e.g. https://www.z-table.com/ (or use R)
The probability of obtaining a is around 0.04 - so only 4 % of people go to fewer social events than Charlie (non-shaded area)
The area under the curve is equal to 1 - so the remaning shaded area represents 1- 0.04 = 0.96 or 96 % (people who attend more social events per year than charlie)

density plot of a normal distribution centred at 0 with standard deviation of 1 Vertical line crosses x axis at the value -1.75 Area under the curve from the line all the right is shaded.

Working out proportions - the quickR way

Using the Z-score and standard normal distribution:

charlie_z = -1.75

pnorm(charlie_z, mean = 0, sd = 1, lower.tail = FALSE)

[1] 0.9599408

density plot of a normal distribution centred at 0 with standard deviation of 1 Vertical line crosses x axis at the value -1.75 Area under the curve from the line all the right is shaded.

Using the original scores and distributional properties:

charlie_events = 57

pnorm(charlie_events, mean = 127, sd = 40, lower.tail = FALSE)

[1] 0.9599408

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 57. Area under the curve from the line all the right is shaded.

Working out critical values

Sometimes we want to find a value in a distribution that defines a cut-off point - for example top 5%.

Critical Value

A value that cuts off a specific proportion of a distribution

How many events would Charlie need to go to, if he wanted to be among top 5% of social-event-goers on the planet?

2-panel Peanuts comic strip. (1) Charlie says "I realise now that I am part of this world... I am not alone... I have friends!" he looks very happy. Patty is standing next to him. (2) Patty exclaims "Name ONE!"

Working out critical values

We could work backwards through Z-scores: find a Z-score that corresponds to the probability of 0.95, then transform the Z-score into the original score.
- Or use R.

Charlie would need to attend 193 social events per year (that’s 3.7 events per week! 😱 ) if he wanted to be in the top 5% of event-goers.
This is 136 more events than he attends at the moment.

qnorm(p = 0.95, mean = 127, sd = 40)

[1] 192.7941

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 57. Area under the curve from the value 192.79 all the right is shaded (right tail of the distribition)

One more example

Patty attends 190 events per year. Is she in the top 10% of event goers?
In a normal distribution with M = 127 (social events attended per year) and SD = 40, an individual would have to attend 178 events or more to be in the top 10% of event goers.
Therefore, with 190 events attended per year Patty is in the top 10%.

qnorm(p = 0.9, mean = 127, sd = 40)

[1] 178.2621

density plot of a normal distribution centred at 127 with standard deviation of 40. Vertical line crosses x axis at the value 190 Area under the curve from the value 178.26 all the right is shaded (right tail of the distribition)

So What?

This idea of the probability of encountering a certain value, given a specific distribution, is absolutely fundamental to everything we will do this term!
- If you feel a bit shaky on it now, don’t worry - we’ll practice it more
For now, focus on:
- Revising the logic above
- Learning the definitions

Sampling

From Values to Samples

We just saw the relationship between a value and its (known) distribution
Next let’s talk about the relationship between a sample statistic and its sampling distribution

Sampling from distributions

Collecting data on a variable = randomly sampling from distribution

Sample

A (usually randomly) selected subset of values of a particular size (e.g., 10, 50) taken from a larger pool of values, often the population

Many variables come from a normal distribution
Some variables might come from other distributions
- Reaction times: log-normal distribution
- Number of annual casualties due to horse kicks: Poisson distribution
- Passes/fails on an exam: binomial distribution

Sampling more humans

So far, we looked at a score of one (unsociable) individual
In research, we typically work with scores of many individuals - we collect a sample.
Samples from the same population will be different from one another
We can ask 6 people how many events they go to. Then ask 6 different people the next day - everyone’s score will be different.

# as 6 people about their event-going:
rnorm(n = 6, mean = 127, sd = 40)

[1]  94.72075 134.76754 125.31043 120.08560 232.28828 117.47832

# repeat
rnorm(n = 6, mean = 127, sd = 40)

[1] 142.32235 136.06493 118.61234  96.26304  90.73451 162.95180

density plot of a normal distribution centred at 127 with standard deviation of 40. Middle shaded area goes from 87 to 167. Outer shaded area goes from 49 to 205

Sampling from distributions

Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different
Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))

sample_day_1 <- rnorm(n = 50, mean = 127, sd = 40) 
sample_day_2 <- rnorm(n = 50, mean = 127, sd = 40) 

mean(sample_day_1)

[1] 133.3792

mean(sample_day_2)

[1] 126.1106

Sampling from distributions

Statistics ( \(\bar{x}\), \(s\), etc.) of two samples will be different
Sample statistic (e.g., \(\bar{x}\)) will likely differ from the population parameter (e.g., \(\mu\))

two panel figure. There's a histogram in each panel representing scores from a randommly selected group of participants. Both histograms are a bit messy and vaguely resemble the normal distribution.

Sampling distribution

If we took many samples of a given size (say N = 50) from the population and each time calculated \(\bar{x}\), the means would have their own distribution

Sampling Distribution (of the Mean)

The distribution of the means of many samples of a particular size.

The distribution is normal and centred around the true population mean, \(\mu\)

Every statistic has its own sampling distribution (not all normal though!)

Sampling distribution

many_sample_means <- replicate(100000, mean(rnorm(50, mean = 127, sd = 40)))
mean(many_sample_means)

[1] 127.0259

A histogram with sample means on the x axis and frequency of the y axis. The shape is a perfect normal distribution. A vertical line goes through the middle of the histogram, which is centred at the value 127.

The Central Limit Theorem

You’ve just seen the Central Limit Theorem in action.
As N gets larger, the sampling distribution of \(\bar{x}\) tends towards a normal distribution with mean = \(\mu\)
True no matter the shape of the population distribution!
- “Central” as in “really important” because, well, it is!

The Central Limit Theorem

Take a sample
Compute the mean
Put it on the plot below
Repeat

A gif with two panels. Top panel shows a histogram of sample scores. The histogram changes every second representing the process of taking a new sample. Each time a new sample is taken, the mean is computed and added to the histogram on the bottom panel. The bars on the bottom panel are gradually filling in the shape of a normal distribution as more and more samples are taken.

One more peek at the dice

We know that dice rolls are uniformly distributed - each number is equally likely
What if we calculate an average roll?

4 panel figure, each showing a histogram of 50 random die rolls. Mean for each set of rolls is displayed at the top of each panel, moving around 3.5.

One more peek at the dice

dice_rolls_6 <- replicate(10000, mean(sample(50, x = 1:6, TRUE)))

ggplot2::ggplot() + 
  geom_histogram(aes(x = dice_rolls_6), binwidth = 0.1, fill = "darkcyan", colour = "white") + 
  labs(x = "Average roll value") + 
  theme_minimal(base_size = 15)

A histogram of a perfect normal distribution showing mean roll values on the x axis. Centred exactly at 3.5.

The CLT governs a lot of processes where randomness and sampling are involved.
This is extremely useful for research - our tests are not immediately doomed if we collect a messy sample

Take-home messages

There are many mathematically well-described distributions
Normal (Gaussian) distribution is one of them
- Continuous, unimodal, symmetrical, bell-shaped
- Must have the right proportions to be normal!
- We can use these proportions to work out critical values

Take-home messages

Statistics of random samples differ from parameters of a population
As N gets bigger, sample distribution approaches population distribution
Distribution of sample means (or other statistics) is the sampling distribution
Central Limit Theorem
- Really important!
- Sampling distribution of the mean tends to normal even if population distribution is not normal
Understanding distributions, sampling distributions and CLT it most of what you need to understand all the stats techniques we will cover.

NEXT WEEK

More sampling distributions
Quantifying uncertainty with standard errors and confidence intervals

Image of Duo the owl, telling you to "Continue learning" with a threatening aura.