Linear Model 1: A New Equation

Week 08

Jenny Terry

Looking Ahead (and Behind)

  • The story so far…

    • Fundamentals of NHST & Statistical Tests
  • This week:

    • The Linear Model - Equation of a Line
  • Coming up after the break:

    • The Linear Model - Evaluating the Model with p-values, CIs, \(F\) & \(R^2\)
    • The Linear Model - Models with Multiple Predictors
    • Questionable Research Practices

“Is it bad if I don’t understand anything from the lectures?”

Today’s Objectives

After this lecture, you will (begin to) understand:

  • What a statistical model is and why they are useful

  • The equation for a linear model with one predictor

    • \(b_0\) (the intercept)

    • \(b_1\) (the slope)

  • How to use the equation to predict an outcome

  • How to read scatterplots and lines of best fit

Talk to Me!

Open the Lecture Google Doc: bit.ly/and24_lecture08

Statistical Models

Vocabulary: The General Model Equation

A conceptual representation of all statistical models, with the following form:

\[outcome = model + error\]

  • We can use models to predict the outcome for a particular case

  • The model itself is a set of mathematical assumptions (i.e., a formula) that assume properties about a population

  • This is always subject to some degree of error

  • Why might predictive models like this be useful? Can you think of any recent examples?

One Equation to Rule Them All!

  • The linear model is fundamental and extremely common statistical testing paradigm

  • Most statistical tests (e.g., t-tests and chi-squared) are some form of linear model

  • Allows us to predict an outcome from one or more predictor variables

  • Our first (explicit) contact with statistical modeling

All the World’s a Model…

… and our variables merely players!

Vocabulary: Predictor(s)

The variable(s) that you hypothesise will predict the outcome.

In experimental studies, this is usually the treatment variable (the thing that is manipulated).

In observational studies (e.g., cross-sectional surveys), this will be one of the variables you’ve measured.

Also - and commonly in experimental research - called the independent variable, or IV.

Vocabulary: Outcome

The variable that we hypothesise will vary, depending on the predictor.

In experimental studies, this is what you think will change because of your manipulation.

In observational studies (e.g., cross-sectional surveys), this will also be one of the variables you’ve measured!

Also - and commonly in experimental research - called the dependent variable, or DV.

The Linear Model Equation

The linear model is the equation for a straight line:

\[ y = mx + b \]

It is usually written like this:

\[y_{i} = b_{0} + b_{1} x_{1i} + e_{i}\]

I will write it in full for now, so you can get used to it:

\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]

The Linear Model Equation

\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]

Term Meaning
\(y_i\) = The outcome (\(y\)) for an individual’s actual score (\(i\)) is equal to (or, is predicted by)…
\(b_0\) … the value of beta-zero (the model’s intercept)…
+ … plus…
\(b_1\) … the value of beta-one (the model’s slope)…
\(\times\) … multiplied by…
\(x_{1i}\) … the value of the predictor (\(x_1\)) for an individual’s actual score (\(i\))…
\(+\) … plus…
\(e_i\) … the error (\(e\)) for the individual’s actual score (\(i\))…

Using the Linear Model to Make Predictions

Overview

  • Example 1: Predicting Masculinity from Femininity

    • A recognisable example (from your correlation lecture)

    • Visual, approximate representation (so you can get a sense of where the numbers come from)

    • Computational, precise calculations (where we actually get the numbers from)

  • Example 2: Predicting Better Sleep from Positive Psychology

    • A new example for extrapolation
    • Visual, approximate representation (so you can get a sense of where the numbers come from)
    • Computational, precise calculations (where we actually get the numbers from)

Example 1: Predicting Masculinity from Femininity

  • Dr Mankin was interested in the relationship between femininity and masculinity

  • Participants took part in a cross-sectional, self-report survey that asked them to rate their:

    • Femininity

    • Masculinity

    • & a bunch of other things not relevant for today’s example

  • Hypothesis: Previous research (your correlation lecture!) suggests that… femininity will have a negative relationship with masculinity

Operationalisation

  • Hypothesis: Femininity will have a negative relationship with masculinity
  • Predictor (\(x_1\)): Femininity
  • Outcome (\(y\)) : Masculinity
  • Model: \(Masculinity_{i} = b_{0} + b_{1}\times Femininity_{1i} + e_{i}\)
    • Masculinity doesn’t have a value because that is what we’re estimating
    • Femininity will be given a value, but we can pick different values and plug them in to solve the equation to get the value of masculinity for whatever value of femininity we choose
    • We’re not estimating error for now, so we don’t need to worry about that
    • But, where do the \(b_0\) and \(b_1\) come from?! 🤔
    • Hint: Remember that the linear model is the equation for a straight line…

Visualising our Model (the Line)

  • Where would you draw a straight line through these dots to best capture where they tend to fall?

Visualising our Model (the Line)

  • The line is our statistical model - it is not the data itself, but it is using the data to make a prediction

Visualising our Model (the Line)

  • The individual scores (data points) tend to be higher up on the left and lower down on the right
  • As the variable on \(x\) (here, ratings of femininity) increases…
  • … the variable on \(y\) (here, ratings of masculinity) tends to decrease
  • This represents a negative relationship between \(x\) and \(y\) - as one goes up, the other goes down

ChallengR: Why Not Correlation?

You already saw this same data, and relationship, with the correlation analysis you did with this data in a previous lecture.

Why are we doing something different? What do we get from the linear model that we don’t get from our correlation analysis?

Visualising our Model (the Line)

  • \(b_0\) - the intercept (where the line crosses 0 on the y-axis)
  • \(b_1\) - the slope (the gradient of the line - the difference in \(y\) for every unit increase in \(x\))
  • What would we estimate these values to be?

Estimating the Slope

Using the Model to Predict Masculinity

We can make some guesses based on the plot:

  • The line would cross the y-axis (aka 0 on the x-axis) somewhere between 8 and 9

    • \(b_{0} \approx 8.5\)
  • For every unit increase on the femininity (predictor, \(x\)) scale, masculinity (outcome, \(y\)) decreases by a little less than one point

    • \(b_{1} \approx -0.8\)

Using the Model to Predict Masculinity

We can then plug those values into our model…

We have already plugged in our outcome (masculinity) and predictor (femininity):

\[ Masculinity_i = b_0 + b_1\times Femininity_{1i} + e_i \]

We also know the intercept (aka \(b_0\), aka the predicted value of masculinity when femininity is 0) is \(\approx\) 8.5, so we can plug that in:

\[ Masculinity_i = \hat{8.5} + b_1\times Femininity_{1i} + e_i \]

We also know the slope (aka \(b_1\), aka the change in masculinity associated with a unit change in femininity) is \(\approx\) -0.8, so we can plug that in (note the sign change):

\[ Masculinity_i = \hat{8.5} - \hat{0.8}\times Femininity_{1i} + e_i \]

Using the Model to Predict Masculinity

Before we use the equation to predict masculinity, let’s get the real beta values from R…


Call:
lm(formula = gender_masc ~ gender_fem, data = gensex)

Coefficients:
(Intercept)   gender_fem  
     8.8246      -0.7976  

Adapt our equation to include the real \(b\) values:

  • Intercept (\(b_{0}\)): the predicted value of masculinity when femininity is 0

    • = 8.82
  • Slope (\(b_{1}\)): change in masculinity associated with a unit change in femininity

    • = -0.80

\[Masculinity_i = \hat{8.82} - \hat{0.8}\times Femininity_{1i} + e_i\]

Using the Model to Predict Masculinity

\[Masculinity_i = \hat{8.82} - \hat{0.8}\times Femininity_{1i} + e_i\]

For someone with a fairly low (on a scale of 1-9) femininity rating of 3:

\[Masculinity_i = \hat{8.82} - \hat{0.8}\times 3 + e_i\]

\[Masculinity_i = 6.42 + e_i\]

For someone with a fairly high (on a scale of 1-9) femininity rating of 8:

\[Masculinity_i = \hat{8.82} - \hat{0.8}\times 8 + e_i\]

\[Masculinity_i = 2.42 + e_i\]

Why Not Correlation?

ChallengR: Why Not Correlation?

You already saw this same data, and relationship, with the correlation analysis you did with this data in a previous lecture.

Why are we doing something different? What do we get from the linear model that we don’t get from our correlation analysis?

  • Both correlation and the linear model can tell us about the strength and direction of the relationship…
  • … but only the linear model can predict the outcome for any value of the predictor

Example 2: Predicting Better Sleep from Positive Psychology

  • Tout et al. (2023) were interested in the effect of positive psychology upon sleep

  • Participants took part in a cross-sectional, self-report survey that asked them to rate their:

    • Positive psychology attributes (a composite of gratitude, optimism, self-compassion, and mindfulness)

    • Sleep quality and quantity (a composite of subjective sleep quality, sleep literacy, sleep duration, sleep efficiency, sleep disturbances, sleep medication, daytime dysfunction)

    • & a bunch of other things not relevant for today’s example

  • Hypothesis: Based on the evidence that other positive psychology attributes positively impacted sleep, Tout hypothesised that… positive psychology attributes will have a positive relationship with sleep quality and quantity

Operationalisation

  • Hypothesis: Positive psychology attributes are associated with better sleep

  • Predictor (\(x_1\)): Positive psychology attributes

  • Outcome (\(y\)) : Sleep quality & quantity

  • Where would our predictor and outcome fit into the linear model equation: \(y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\)?

Talk to Me!

Open the Lecture Google Doc: bit.ly/and24_lecture08

  • Model: \(Sleep_{i} = b_{0} + b_{1}\times PositivePsychology_{1i} + e_{i}\)

  • What about \(b_0\) and \(b_1\)…?

Visualising our Model (the Line)

  • Where would you draw a line through these dots that best captures where they tend to fall?

Visualising our Model (the Line)

  • Is this a positive or a negative relationship? How do we know?

Visualising our Model (the Line)

  • The individual scores (data points) tend to be higher up on the right and lower down on the left

  • As the variable on \(x\) (here, positive psychology attributes) increases…

  • … the variable on \(y\) (here, sleep quality & quantity) also increases
  • This represents a positive relationship between \(x\) and \(y\): as one goes up, the other goes up

Visualising our Model (the Line)

  • Two key elements:
    • \(b_0\) - the intercept (where the line crosses 0 on the y-axis)
    • \(b_1\) - the slope (the gradient of the line - the difference in \(y\) for every unit increase in \(x\))
  • What would you estimate these values to be?

Estimating the Intercept

Estimating the Slope

Using the Model to Predict Sleep

\[ Sleep = b_0 + b_1\times Positive Psychology_{1i} + e_i \]

  • Predictor (\(x_1\)): Positive psychology attributes

  • Outcome (\(y\)) : Sleep quality & quantity

  • Intercept (\(b_{0}\)): the predicted value of sleep when positive psychology is 0

    • \(\approx\) 3.5
  • Slope (\(b_{1}\)): change in sleep associated with a unit change in positive psychology

    • \(\approx\) 2.2

Using the Model to Predict Sleep

Before we plug the intercept and slope into the equation, let’s get the more precise beta values from R…


Call:
lm(formula = sleep ~ pos_psy, data = sleep_tib)

Coefficients:
(Intercept)      pos_psy  
      3.520        2.287  

Can you adapt our equation to include the real \(b\) values?

\[ Sleep = b_0 + b_1\times Positive Psychology_{1i} + e_i \]

  • Intercept (\(b_{0}\)): the predicted value of sleep when positive psychology is 0

    • = 3.52
  • Slope (\(b_{1}\)): change in sleep associated with a unit change in positive psychology

    • = 2.29

Using the Model to Predict Sleep

\[Sleep_i = \hat{3.52} + \hat{2.29}\times PositivePsychology_{1i} + e_i\]

For someone with a fairly low positive psychology rating (on a scale of 1-5) of 1.5:

\[Sleep_i = \hat{3.52} + \hat{2.29}\times 1.5 + e_i\]

\[Sleep_i = 6.95 + e_i\]

For someone with a fairly high positive psychology rating (on a scale of 1-5) of 4:

\[Sleep_i = \hat{3.52} + \hat{2.29}\times 4 + e_i\]

\[Sleep_i = 12.68 + e_i\]

Summary

Vocabulary: The Linear Model

\[y_{i} = b_{0} + b_{1}\times x_{1i} + e_{i}\]

  • A statistical model representing the linear relationship between a predictor and outcome

  • \(y_{i}\): the (predicted value of the) outcome

    • What we’re trying to estimate
  • \(x_{1i}\): the (actual value of the) predictor

    • We can plug in any value of the predictor
  • \(b_{0}\): the intercept - the value of the outcome when the predictor is 0

    • Obtained from our model
  • \(b_{1}\): the slope - the change in the outcome for every unit change in the predictor

    • Obtained from our model
  • \(e_{i}\): the (unknown and unknowable) error in prediction

    • More on error next year

Welcome to the World of lm()

  • The linear model (lm()) will be crucial for the rest of your degree

  • If that was a bit of a blur to you, it’s highly recommended that you spend some time working through it slowly, until it clicks.

  • These sources may be helpful:

  • Next week (after the break):

    • Recap of the Linear Model

    • How do we know if it is a good model?

    • How do we know if it is a good prediction?