Week 08
The story so far…
This week:
Coming up after the break:
“Is it bad if I don’t understand anything from the lectures?”
After this lecture, you will (begin to) understand:
What a statistical model is and why they are useful
The equation for a linear model with one predictor
\(b_0\) (the intercept)
\(b_1\) (the slope)
How to use the equation to predict an outcome
How to read scatterplots and lines of best fit
Talk to Me!
Open the Lecture Google Doc: bit.ly/and24_lecture08
Vocabulary: The General Model Equation
A conceptual representation of all statistical models, with the following form:
\[outcome = model + error\]
We can use models to predict the outcome for a particular case
The model itself is a set of mathematical assumptions (i.e., a formula) that assume properties about a population
This is always subject to some degree of error
Why might predictive models like this be useful? Can you think of any recent examples?
The linear model is fundamental and extremely common statistical testing paradigm
Most statistical tests (e.g., t-tests and chi-squared) are some form of linear model
Allows us to predict an outcome from one or more predictor variables
Our first (explicit) contact with statistical modeling
… and our variables merely players!
Vocabulary: Predictor(s)
The variable(s) that you hypothesise will predict the outcome.
In experimental studies, this is usually the treatment variable (the thing that is manipulated).
In observational studies (e.g., cross-sectional surveys), this will be one of the variables you’ve measured.
Also - and commonly in experimental research - called the independent variable, or IV.
Vocabulary: Outcome
The variable that we hypothesise will vary, depending on the predictor.
In experimental studies, this is what you think will change because of your manipulation.
In observational studies (e.g., cross-sectional surveys), this will also be one of the variables you’ve measured!
Also - and commonly in experimental research - called the dependent variable, or DV.
The linear model is the equation for a straight line:
\[ y = mx + b \]
It is usually written like this:
\[y_{i} = b_{0} + b_{1} x_{1i} + e_{i}\]
I will write it in full for now, so you can get used to it:
\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]
\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]
Term | Meaning |
\(y_i\) = | The outcome (\(y\)) for an individual’s actual score (\(i\)) is equal to (or, is predicted by)… |
\(b_0\) | … the value of beta-zero (the model’s intercept)… |
+ | … plus… |
\(b_1\) | … the value of beta-one (the model’s slope)… |
\(\times\) | … multiplied by… |
\(x_{1i}\) | … the value of the predictor (\(x_1\)) for an individual’s actual score (\(i\))… |
\(+\) | … plus… |
\(e_i\) | … the error (\(e\)) for the individual’s actual score (\(i\))… |
Example 1: Predicting Masculinity from Femininity
A recognisable example (from your correlation lecture)
Visual, approximate representation (so you can get a sense of where the numbers come from)
Computational, precise calculations (where we actually get the numbers from)
Example 2: Predicting Better Sleep from Positive Psychology
Dr Mankin was interested in the relationship between femininity and masculinity
Participants took part in a cross-sectional, self-report survey that asked them to rate their:
Femininity
Masculinity
& a bunch of other things not relevant for today’s example
ChallengR: Why Not Correlation?
You already saw this same data, and relationship, with the correlation analysis you did with this data in a previous lecture.
Why are we doing something different? What do we get from the linear model that we don’t get from our correlation analysis?
We can make some guesses based on the plot:
The line would cross the y-axis (aka 0 on the x-axis) somewhere between 8 and 9
For every unit increase on the femininity (predictor, \(x\)) scale, masculinity (outcome, \(y\)) decreases by a little less than one point
We can then plug those values into our model…
We have already plugged in our outcome (masculinity) and predictor (femininity):
\[ Masculinity_i = b_0 + b_1\times Femininity_{1i} + e_i \]
We also know the intercept (aka \(b_0\), aka the predicted value of masculinity when femininity is 0) is \(\approx\) 8.5, so we can plug that in:
\[ Masculinity_i = \hat{8.5} + b_1\times Femininity_{1i} + e_i \]
We also know the slope (aka \(b_1\), aka the change in masculinity associated with a unit change in femininity) is \(\approx\) -0.8, so we can plug that in (note the sign change):
\[ Masculinity_i = \hat{8.5} - \hat{0.8}\times Femininity_{1i} + e_i \]
Before we use the equation to predict masculinity, let’s get the real beta values from R…
Call:
lm(formula = gender_masc ~ gender_fem, data = gensex)
Coefficients:
(Intercept) gender_fem
8.8246 -0.7976
Adapt our equation to include the real \(b\) values:
Intercept (\(b_{0}\)): the predicted value of masculinity when femininity is 0
Slope (\(b_{1}\)): change in masculinity associated with a unit change in femininity
\[Masculinity_i = \hat{8.82} - \hat{0.8}\times Femininity_{1i} + e_i\]
\[Masculinity_i = \hat{8.82} - \hat{0.8}\times Femininity_{1i} + e_i\]
For someone with a fairly low (on a scale of 1-9) femininity rating of 3:
\[Masculinity_i = \hat{8.82} - \hat{0.8}\times 3 + e_i\]
\[Masculinity_i = 6.42 + e_i\]
For someone with a fairly high (on a scale of 1-9) femininity rating of 8:
\[Masculinity_i = \hat{8.82} - \hat{0.8}\times 8 + e_i\]
\[Masculinity_i = 2.42 + e_i\]
ChallengR: Why Not Correlation?
You already saw this same data, and relationship, with the correlation analysis you did with this data in a previous lecture.
Why are we doing something different? What do we get from the linear model that we don’t get from our correlation analysis?
Tout et al. (2023) were interested in the effect of positive psychology upon sleep
Participants took part in a cross-sectional, self-report survey that asked them to rate their:
Positive psychology attributes (a composite of gratitude, optimism, self-compassion, and mindfulness)
Sleep quality and quantity (a composite of subjective sleep quality, sleep literacy, sleep duration, sleep efficiency, sleep disturbances, sleep medication, daytime dysfunction)
& a bunch of other things not relevant for today’s example
Hypothesis: Positive psychology attributes are associated with better sleep
Predictor (\(x_1\)): Positive psychology attributes
Outcome (\(y\)) : Sleep quality & quantity
Where would our predictor and outcome fit into the linear model equation: \(y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\)?
Talk to Me!
Open the Lecture Google Doc: bit.ly/and24_lecture08
Model: \(Sleep_{i} = b_{0} + b_{1}\times PositivePsychology_{1i} + e_{i}\)
What about \(b_0\) and \(b_1\)…?
The individual scores (data points) tend to be higher up on the right and lower down on the left
As the variable on \(x\) (here, positive psychology attributes) increases…
\[ Sleep = b_0 + b_1\times Positive Psychology_{1i} + e_i \]
Predictor (\(x_1\)): Positive psychology attributes
Outcome (\(y\)) : Sleep quality & quantity
Intercept (\(b_{0}\)): the predicted value of sleep when positive psychology is 0
Slope (\(b_{1}\)): change in sleep associated with a unit change in positive psychology
Before we plug the intercept and slope into the equation, let’s get the more precise beta values from R…
Call:
lm(formula = sleep ~ pos_psy, data = sleep_tib)
Coefficients:
(Intercept) pos_psy
3.520 2.287
Can you adapt our equation to include the real \(b\) values?
\[ Sleep = b_0 + b_1\times Positive Psychology_{1i} + e_i \]
Intercept (\(b_{0}\)): the predicted value of sleep when positive psychology is 0
Slope (\(b_{1}\)): change in sleep associated with a unit change in positive psychology
\[Sleep_i = \hat{3.52} + \hat{2.29}\times PositivePsychology_{1i} + e_i\]
For someone with a fairly low positive psychology rating (on a scale of 1-5) of 1.5:
\[Sleep_i = \hat{3.52} + \hat{2.29}\times 1.5 + e_i\]
\[Sleep_i = 6.95 + e_i\]
For someone with a fairly high positive psychology rating (on a scale of 1-5) of 4:
\[Sleep_i = \hat{3.52} + \hat{2.29}\times 4 + e_i\]
\[Sleep_i = 12.68 + e_i\]
Vocabulary: The Linear Model
\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]
A statistical model representing the linear relationship between a predictor and outcome
\(y_{i}\): the (predicted value of the) outcome
\(x_{1i}\): the (actual value of the) predictor
\(b_{0}\): the intercept - the value of the outcome when the predictor is 0
\(b_{1}\): the slope - the change in the outcome for every unit change in the predictor
\(e_{i}\): the (unknown and unknowable) error in prediction
lm()
The linear model (lm()
) will be crucial for the rest of your degree
If that was a bit of a blur to you, it’s highly recommended that you spend some time working through it slowly, until it clicks.
These sources may be helpful:
Learning Statistics with R - see Section V, Chapter 15, Linear Regression
Andy Field’s statistics textbooks (SPSS version is fine, edition 4 onwards)
Next week (after the break):
Recap of the Linear Model
How do we know if it is a good model?
How do we know if it is a good prediction?