R Packages/Data for This Session

You should’ve already installed the R packages for this lab session. {tidyverse} will be for all things workflow and {stevedata} will have the data sets we’ll be using today.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.1     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stevedata)

# For those of you without `{tidyverse}`, do this:
# library(dplyr)
# library(tibble)
# library(magrittr)
# library(forcats)
# library(ggplot2)

Building Toward Multiple Regression

Last time we had a lab, we just focused on bivariate linear model with fake data that we created. The goal of that exercise was to show you what the linear model looked like when you had maximum control over the model parameters. We’ll use some actual data this time. This is a survey data set from the United Kingdom in 2018-19, largely probing individual attitudes toward immigration. You can find out more about this data set by typing ?ESS9GB into Rstudio. You can also go here for a web version of the codebook.

ESS9GB

## # A tibble: 1,905 × 19
##    name      essround edition  idno cntry region   brncntr stintrvw   endintrvw 
##    <chr>        <dbl> <chr>   <dbl> <chr> <chr>      <dbl> <date>     <date>    
##  1 ESS9e01_2        9 1.2        39 GB    West Mi…       1 2018-09-11 2018-09-11
##  2 ESS9e01_2        9 1.2       135 GB    South W…       1 2018-09-07 2018-09-07
##  3 ESS9e01_2        9 1.2       142 GB    South E…       1 2018-11-03 2018-11-03
##  4 ESS9e01_2        9 1.2       146 GB    Norther…       1 2018-10-06 2018-10-17
##  5 ESS9e01_2        9 1.2       164 GB    Norther…       1 2018-10-31 2018-10-31
##  6 ESS9e01_2        9 1.2       191 GB    Norther…       1 2018-10-27 2018-10-27
##  7 ESS9e01_2        9 1.2       213 GB    South W…       1 2018-11-08 2018-11-08
##  8 ESS9e01_2        9 1.2       238 GB    London         1 2019-01-31 2019-01-31
##  9 ESS9e01_2        9 1.2       245 GB    East of…       1 2018-09-07 2018-09-07
## 10 ESS9e01_2        9 1.2       265 GB    East Mi…       1 2019-02-10 2019-02-10
## # ℹ 1,895 more rows
## # ℹ 10 more variables: imbgeco <dbl>, imueclt <dbl>, imwbcnt <dbl>,
## #   immigsent <dbl>, agea <dbl>, female <dbl>, eduyrs <dbl>, uempla <dbl>,
## #   hinctnta <dbl>, lrscale <dbl>

You can also see read some blog posts where I make considerable use of these data here:

Understand that some of that is slightly more complex than what we’ll be doing here, but I’d rather inundate you with things to look at and read rather than give you nothing at all.

Our primary dependent variable in this model will be the immigsent variable, which is an additive index of three variables also included in the data set. These prompts, verbatim, each range from 0 to 10 and ask 1) Would you say it is generally bad or good for the United Kingdom’s economy that people come to live here from other countries? (imbgeco), 2) And… would you say that the United Kingdom’s cultural life is generally undermined or enriched by people coming to live here from other countries? (imueclt), and 3) Is the United Kingdom made a worse or a better place to live by people coming to live here from other countries? (imwbcnt). The immigsent variable simply adds all three together, resulting in a scale that ranges from 0 to 30. You can interpret 0 as being maximally anti-immigrant/immigration and 30 being maximally pro-immigrant/immigration. Higher values = more pro-immigration sentiment. Let’s build toward multiple regression by starting in the simple bivariate, like we did in lecture. We’ll start by exploring an uncontroversial argument that we should expect pro-immigration sentiment to be much lower in North East than other parts of the United Kingdom. Why “North East”? This is the region where Newcastle is. Newcastle, rightly or wrongly, has a reputation of being “Brexit Central”. We have the means to explore this with the region variable. Let’s take a look at it.

ESS9GB %>% distinct(region)

## # A tibble: 12 × 1
##    region                  
##    <chr>                   
##  1 West Midlands (England) 
##  2 South West (England)    
##  3 South East (England)    
##  4 Northern Ireland        
##  5 London                  
##  6 East of England         
##  7 East Midlands (England) 
##  8 North West (England)    
##  9 Scotland                
## 10 Yorkshire and the Humber
## 11 North East (England)    
## 12 Wales

Perfect. I see I have 12 familiar regions to me in the data to explore, but I have no “North East” dummy variable. No matter, it’s a simple matter to create one like this.

ESS9GB %>%
  mutate(northeast = ifelse(region == "North East (England)", 1, 0)) -> Data

Note: I’m creating a new data frame here for convenience sake.

Now let’s do a simple regression, regressing the immigration sentiment variable on whether the respondent is in the North East.

M1 <- lm(immigsent ~ northeast, data=Data)
summary(M1)

## 
## Call:
## lm(formula = immigsent ~ northeast, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0235  -4.0235   0.9765   4.9765  15.3495 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.0235     0.1668 102.051  < 2e-16 ***
## northeast    -2.3730     0.7070  -3.357 0.000805 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.972 on 1848 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.00606,    Adjusted R-squared:  0.005522 
## F-statistic: 11.27 on 1 and 1848 DF,  p-value: 0.0008051

By the way, in a super simple case like this where y is continuous and x is binary, this is basically what the t-test will tell you as well.

t.test(immigsent ~ northeast, Data)

## 
##  Welch Two Sample t-test
## 
## data:  immigsent by northeast
## t = 3.3873, df = 114.62, p-value = 0.0009679
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.9852718 3.7606949
## sample estimates:
## mean in group 0 mean in group 1 
##        17.02347        14.65049

In this simple difference of means, the mean immigration sentiment where northeast is 0 is 17.02. The mean immigration sentiment where northeast is 1 is 14.65. The difference between those two numbers is basically the regression coefficient. It’s the “North East” effect, which basically decreases immigration sentiment by about 2.37. That effect is highly unlikely to be zero. You knew that because we went over statistical significance and null hypothesis testing. You also know that, one assumes, from just kind of eye-balling the United Kingdom with a polite curiosity. Again, “rightly or wrongly.”

You can also do the same thing and look for differences between Scotland and not-Scotland.

Data %>%
  mutate(scotland = ifelse(region == "Scotland", 1, 0)) -> Data

M2 <- lm(immigsent ~ scotland, Data)
summary(M2)

## 
## Call:
## lm(formula = immigsent ~ scotland, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5030  -4.7335   0.2665   5.2665  13.2665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.7335     0.1699   98.47   <2e-16 ***
## scotland      1.7695     0.5690    3.11   0.0019 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.975 on 1848 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.005206,   Adjusted R-squared:  0.004668 
## F-statistic: 9.671 on 1 and 1848 DF,  p-value: 0.0019

The model here suggests that pro-immigration sentiment in not-Scotland is 16.73. The “Scotland” effect increases pro-immigration sentiment by 1.76. The t statistic suggest that if the true differences between Scotland and not-Scotland is 0, what we got would have been observed about .19% of the time, on average.

What if we wanted to tease out more information for all regions? This would be a kind of “fixed effects” analysis, which is a somewhat misleading and confusing way of saying “take a categorical variable, and create a series of dummy variables from it.” You could do this manually, or you can have R do it for you.

M3 <- lm(immigsent ~ factor(region), Data)
summary(M3)

## 
## Call:
## lm(formula = immigsent ~ factor(region), data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.503  -4.503   0.497   5.349  15.350 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                             16.4296     0.5811  28.275  < 2e-16 ***
## factor(region)East of England            0.9694     0.7655   1.266  0.20558    
## factor(region)London                     1.5526     0.8751   1.774  0.07619 .  
## factor(region)North East (England)      -1.7791     0.8962  -1.985  0.04727 *  
## factor(region)North West (England)      -0.9162     0.7428  -1.233  0.21755    
## factor(region)Northern Ireland           0.9038     1.0662   0.848  0.39674    
## factor(region)Scotland                   2.0735     0.7926   2.616  0.00897 ** 
## factor(region)South East (England)       1.4281     0.7129   2.003  0.04531 *  
## factor(region)South West (England)       1.1347     0.7781   1.458  0.14496    
## factor(region)Wales                     -0.5905     0.9427  -0.626  0.53115    
## factor(region)West Midlands (England)   -0.8676     0.8567  -1.013  0.31131    
## factor(region)Yorkshire and the Humber   0.2043     0.7744   0.264  0.79194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.924 on 1838 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.02502,    Adjusted R-squared:  0.01919 
## F-statistic: 4.289 on 11 and 1838 DF,  p-value: 2.437e-06

I want to stop right here and explain what this procedure is doing. First, you have 12 regions, but you get 11 regression coefficients. In this procedure, something must be a baseline. Second, if you don’t give R any guidance, the “baseline” becomes whatever is first in alphabetical order. In this case, that is “East Midlands (England)”, which is the region where Leicester and Nottingham are (if Premier League squads may help orient you). What this “fixed effects” communicates, then, is the difference between East of England (i.e. where Norwich, Essex and Cambridge are) relative to East Midlands (and if there is a discernible difference), the difference between London and East Midlands (and if it’s discernible), the difference between North East and East Midlands (and if it’s discernible), and so on. Here, we’d suggest discernible differences relative to the baseline of London, North East, Scotland, and South East (i.e. where Brighton, Canterbury, and Dover are). Pro-immigration sentiment is higher in London, South East, and Scotland relative to East Midlands and lower in North East relative to East Midlands.

A few other things. For one, you could basically tease much of this out through doing a comparison of means. This procedure doesn’t demonstrate “significance”, but calculating standard errors and creating lower and upper bounds around that mean will illustrate a little bit more about what the regression model is telling you.

Data %>%
  summarize(mean = mean(immigsent, na.rm=T),
            sd = sd(immigsent, na.rm=T),
            n = n(),
            se = sd/sqrt(n()),
            lwr = mean - se,
            upr = mean + se,
            .by = region) %>%
  data.frame

##                      region     mean       sd   n        se      lwr      upr
## 1   West Midlands (England) 15.56198 7.485422 125 0.6695165 14.89247 16.23150
## 2      South West (England) 17.56425 7.001609 187 0.5120084 17.05224 18.07625
## 3      South East (England) 17.85765 6.566774 288 0.3869509 17.47070 18.24460
## 4          Northern Ireland 17.33333 7.941879  62 1.0086197 16.32471 18.34195
## 5                    London 17.98214 7.345995 114 0.6880155 17.29413 18.67016
## 6           East of England 17.39896 6.704429 199 0.4752644 16.92370 17.87423
## 7   East Midlands (England) 16.42958 6.764947 149 0.5542061 15.87537 16.98378
## 8      North West (England) 15.51339 6.936599 226 0.4614157 15.05198 15.97481
## 9                  Scotland 18.50303 7.043960 169 0.5418431 17.96119 19.04487
## 10 Yorkshire and the Humber 16.63388 6.767111 190 0.4909378 16.14294 17.12482
## 11     North East (England) 14.65049 6.905111 106 0.6706837 13.97980 15.32117
## 12                    Wales 15.83908 6.594348  90 0.6951054 15.14398 16.53419

Graphing it will help.

Data %>%
  summarize(mean = mean(immigsent, na.rm=T),
            sd = sd(immigsent, na.rm=T),
            n = n(),
            se = sd/sqrt(n()),
            lwr = mean - se,
            upr = mean + se,
            .by = region) %>%
  # The next two lines are super-hacky ways of making the plot read in order.
  arrange(desc(region)) %>%
  mutate(region = fct_inorder(region)) %>%
  # Notice below: region is x-axis, mean is y-axis, and I've outlined that I
  # want the `lwr` and `upr` columns to be mins and maxes for what's coming
  # next...
  ggplot(., aes(region, mean, ymin=lwr, ymax=upr)) + 
  # ...the point-range. This will create a dot at `mean` and bounds 
  # that correspond with the ymin and ymax variables.
  geom_pointrange() +
  # Put your coords down flip them and reverse them.
  coord_flip()

What if you wanted to make London the baseline? You could manually create your own dummy variables, or you can turn that “region” variable into a more informative factor through the use of the {forcats} package (which is loaded with {tidyverse}). Let’s use fct_relevel() for this purpose. Using fct_relevel() in the way I do below, I’m basically telling R that this new region2 variable is a factor and it’s specific order is 1) London and 2) after that, I don’t care, and just proceed alphabetically as you wish.

Data %>%
  mutate(region2 = fct_relevel(region, "London")) -> Data

Data %>% summary

##      name              essround   edition               idno      
##  Length:1905        Min.   :9   Length:1905        Min.   :   39  
##  Class :character   1st Qu.:9   Class :character   1st Qu.:12350  
##  Mode  :character   Median :9   Mode  :character   Median :24627  
##                     Mean   :9                      Mean   :24597  
##                     3rd Qu.:9                      3rd Qu.:37088  
##                     Max.   :9                      Max.   :48626  
##                                                                   
##     cntry              region             brncntr     stintrvw         
##  Length:1905        Length:1905        Min.   :1   Min.   :2018-09-01  
##  Class :character   Class :character   1st Qu.:1   1st Qu.:2018-09-29  
##  Mode  :character   Mode  :character   Median :1   Median :2018-10-18  
##                                        Mean   :1   Mean   :2018-10-27  
##                                        3rd Qu.:1   3rd Qu.:2018-11-12  
##                                        Max.   :1   Max.   :2019-02-16  
##                                                                        
##    endintrvw             imbgeco          imueclt        imwbcnt      
##  Min.   :2018-09-01   Min.   : 0.000   Min.   : 0.0   Min.   : 0.000  
##  1st Qu.:2018-10-01   1st Qu.: 5.000   1st Qu.: 4.0   1st Qu.: 4.000  
##  Median :2018-10-18   Median : 6.000   Median : 6.0   Median : 5.000  
##  Mean   :2018-10-27   Mean   : 5.698   Mean   : 5.7   Mean   : 5.447  
##  3rd Qu.:2018-11-12   3rd Qu.: 7.000   3rd Qu.: 8.0   3rd Qu.: 7.000  
##  Max.   :2019-02-16   Max.   :10.000   Max.   :10.0   Max.   :10.000  
##  NA's   :4            NA's   :18       NA's   :28     NA's   :29      
##    immigsent          agea           female           eduyrs     
##  Min.   : 0.00   Min.   :15.00   Min.   :0.0000   Min.   : 3.00  
##  1st Qu.:12.25   1st Qu.:39.00   1st Qu.:0.0000   1st Qu.:11.00  
##  Median :17.00   Median :55.00   Median :1.0000   Median :13.00  
##  Mean   :16.89   Mean   :53.67   Mean   :0.5412   Mean   :14.05  
##  3rd Qu.:22.00   3rd Qu.:69.00   3rd Qu.:1.0000   3rd Qu.:16.00  
##  Max.   :30.00   Max.   :90.00   Max.   :1.0000   Max.   :41.00  
##  NA's   :55      NA's   :12                       NA's   :12     
##      uempla           hinctnta         lrscale         northeast      
##  Min.   :0.00000   Min.   : 1.000   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.: 2.000   1st Qu.: 4.000   1st Qu.:0.00000  
##  Median :0.00000   Median : 5.000   Median : 5.000   Median :0.00000  
##  Mean   :0.01995   Mean   : 5.171   Mean   : 4.961   Mean   :0.05564  
##  3rd Qu.:0.00000   3rd Qu.: 8.000   3rd Qu.: 6.000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :10.000   Max.   :10.000   Max.   :1.00000  
##                    NA's   :290      NA's   :179                       
##     scotland                           region2   
##  Min.   :0.00000   South East (England)    :288  
##  1st Qu.:0.00000   North West (England)    :226  
##  Median :0.00000   East of England         :199  
##  Mean   :0.08871   Yorkshire and the Humber:190  
##  3rd Qu.:0.00000   South West (England)    :187  
##  Max.   :1.00000   Scotland                :169  
##                    (Other)                 :646

Now, let’s do this again.

M4 <- lm(immigsent ~ region2, Data)
summary(M4)

## 
## Call:
## lm(formula = immigsent ~ region2, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.503  -4.503   0.497   5.349  15.350 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      17.9821     0.6543  27.484  < 2e-16 ***
## region2East Midlands (England)   -1.5526     0.8751  -1.774 0.076188 .  
## region2East of England           -0.5832     0.8225  -0.709 0.478393    
## region2North East (England)      -3.3317     0.9453  -3.524 0.000435 ***
## region2North West (England)      -2.4687     0.8013  -3.081 0.002095 ** 
## region2Northern Ireland          -0.6488     1.1078  -0.586 0.558158    
## region2Scotland                   0.5209     0.8477   0.614 0.538998    
## region2South East (England)      -0.1245     0.7738  -0.161 0.872196    
## region2South West (England)      -0.4179     0.8342  -0.501 0.616473    
## region2Wales                     -2.1431     0.9895  -2.166 0.030460 *  
## region2West Midlands (England)   -2.4202     0.9079  -2.666 0.007752 ** 
## region2Yorkshire and the Humber  -1.3483     0.8307  -1.623 0.104755    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.924 on 1838 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.02502,    Adjusted R-squared:  0.01919 
## F-statistic: 4.289 on 11 and 1838 DF,  p-value: 2.437e-06

This time, you’re benchmarking any comparisons to London. Just because there’s no difference between London and Scotland here doesn’t mean there’s no difference between Scotland and East Midlands. What you do with so-called “fixed effects” depends on what you want to accomplish. Does it matter that something is the baseline? If so, make it the baseline and compare everything else to it. Do you not care, and you’re just doing this because you want to partial out these potential categorical effects while you focus on some other regressors? Then don’t bother with the details, because it’s mostly immaterial to what you want to accomplish.

Multiple Regression

Now, let’s a do a multiple regression. Let’s see if we can’t regress the immigration sentiment variable on the Scotland dummy variable, whether the respondent is a woman, and the respondent’s ideology on a 0-10 scale where 0 is the most to the right, and 10 is the most to the left. You knew how to do this in the bivariate case, but what do you do to the formula when you have multiple independent variables? Simple as this:

M5 <- lm(immigsent ~ scotland + female + lrscale, Data)
summary(M5)

## 
## Call:
## lm(formula = immigsent ~ scotland + female + lrscale, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9503  -4.1965   0.7144   4.9605  14.0774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.38127    0.49521  41.156  < 2e-16 ***
## scotland     1.45623    0.57385   2.538   0.0112 *  
## female      -0.43092    0.32923  -1.309   0.1908    
## lrscale     -0.63696    0.08515  -7.480 1.19e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.757 on 1687 degrees of freedom
##   (214 observations deleted due to missingness)
## Multiple R-squared:  0.03758,    Adjusted R-squared:  0.03587 
## F-statistic: 21.96 on 3 and 1687 DF,  p-value: 6.007e-14

In other words: separate however many x variables you want in the equation with a + sign. The formula above could be read as “using the lm() function, regress immigsent on scotland and female and lrscale from the Data data set.” Let’s walk through the basic takeaways from the model:

The estimated pro-immigration sentiment for a man, not residing in Scotland, who is furthest to the political left is 20.38. In this case, the y-intercept actually exists. This parameter isn’t total nonsense, at least on paper.
Partialing out everything else in the model (“ceteris paribus”), the effect of going from not-Scotland to “Scotland” increases pro-immigration sentiment by 1.45 points on our scale. If the true “Scotland effect” were 0, we’d expect to have observed this effect, on average, about 1.12% of the time. We’d suggest that “Scotland effect” is statistically significant.
The partial effect of being a woman versus a man (“ceteris paribus”) is to decrease pro-immigration sentiment by -.43 points. However, the corresponding t-statistic and p-value suggests that such an effect would be observed about 19% of the time if there truly was no effect. Thus, the parameters are consistent with a null hypothesis of no relationship between pro-immigration sentiment and gender. We can’t discern a difference between men and women exist here with respect to this outcome variable.
Partialing out the Scotland effect and potential gender differences, we see an unsurprising effect of ideology on pro-immigration sentiment. Higher levels of ideological affinity with the right coincide with lower levels of pro-immigration sentiment. The unit effect is -.63 and that effect is incidentally the most precise in the model. There is a clear ideology effect in our model.

Interactive Effects

Let’s walk through the issue of what to do when you think there might be an interactive relationship among x and z in their relationship with y. I’m going to keep this example simple. We noticed there were no gender differences, but is that because it depends on ideology? Let’s see, and notice how we’re going to see it below. When you want to see if two (or more, but let’s focus on just two) things interact, you put an asterisk (*) between the variables in the right-hand side of the equation. Like this:

M7 <- lm(immigsent ~ scotland + female*lrscale, Data)
summary(M7)

## 
## Call:
## lm(formula = immigsent ~ scotland + female * lrscale, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6005  -4.2097   0.7691   5.1946  14.7929 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     21.1989     0.6717  31.562  < 2e-16 ***
## scotland         1.4440     0.5735   2.518   0.0119 *  
## female          -1.9481     0.9046  -2.154   0.0314 *  
## lrscale         -0.7992     0.1239  -6.449 1.47e-10 ***
## female:lrscale   0.3064     0.1702   1.801   0.0720 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.752 on 1686 degrees of freedom
##   (214 observations deleted due to missingness)
## Multiple R-squared:  0.03942,    Adjusted R-squared:  0.03715 
## F-statistic:  17.3 on 4 and 1686 DF,  p-value: 6.435e-14

Notice that two variables we interacted produced three regression coefficients. The female coefficient is the gender difference when lrscale is 0 (i.e. the respondent is furthest to the left). The lrscale coefficient is the effect of increasing ideological affinity to the right when female is 0 (i.e. the effect of ideology on men). The female:lrscale communicates a way of thinking about how the two interact with each other, and whether that interaction is “significant”. Here, it is (albeit at the .10 level). The “Scotland effect” variable is not affected by this and is still an “all else equal” partial effect.

I want to use this as an illustrative case. Be careful in how you interpret your interactions, and how you model them. We got lucky here that that the data have naturally occurring 0s, but what is “0”—and whether there’s a 0 at all—will influence what exactly the regression summary tells you. For example, let’s center the data so that 0 is the people in the middle of the scale. Observe:

Data %>%
  mutate(c_lrscale = lrscale - 5) -> Data

Data %>%
  distinct(lrscale, c_lrscale) %>%
  arrange(lrscale)

## # A tibble: 12 × 2
##    lrscale c_lrscale
##      <dbl>     <dbl>
##  1       0        -5
##  2       1        -4
##  3       2        -3
##  4       3        -2
##  5       4        -1
##  6       5         0
##  7       6         1
##  8       7         2
##  9       8         3
## 10       9         4
## 11      10         5
## 12      NA        NA

Take very quick inventory of what this did. We took the raw data and subtracted 5 from it. Now, 0 are people who would place themselves in the ideological middle. 5 are the people furthest to the right and -5 are the people furthest to the left. We didn’t really “change” 0, as much as we shifted it.

Now, watch.

M8 <- lm(immigsent ~ scotland + female*c_lrscale, Data)
summary(M8)

## 
## Call:
## lm(formula = immigsent ~ scotland + female * c_lrscale, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.6005  -4.2097   0.7691   5.1946  14.7929 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       17.2030     0.2426  70.914  < 2e-16 ***
## scotland           1.4440     0.5735   2.518   0.0119 *  
## female            -0.4161     0.3291  -1.264   0.2063    
## c_lrscale         -0.7992     0.1239  -6.449 1.47e-10 ***
## female:c_lrscale   0.3064     0.1702   1.801   0.0720 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.752 on 1686 degrees of freedom
##   (214 observations deleted due to missingness)
## Multiple R-squared:  0.03942,    Adjusted R-squared:  0.03715 
## F-statistic:  17.3 on 4 and 1686 DF,  p-value: 6.435e-14

Compare this to the output of M7. Notice how the gender dummy variable isn’t significant? Except, again, you have to be careful because the female dummy is now the difference between men and women for those in the middle, not the far left. The interactive coefficient didn’t move, but the gender variable did. The ideology constituent effect didn’t move because men are still the baseline. Just be careful about what’s actually happening in an interaction.

I wanted to close with a caveat about interactions. Just because the overall interactive effect is insignificant, or just because component terms of the interaction are insignificant, doesn’t mean there isn’t something potentially interesting happening in your interactions. If the interaction is statistically significant, you should definitely look at it to see what the interaction looks like. If it’s insignificant, you should still look at it. After all, you’re the one that felt compelled to interact two things. Go look at what you did just to see for yourself.

Here’s what we’ll do. We’ll create a simple prediction grid. Let the ideology variable be either 0 (furthest to the left), 5 (in the middle), or 10 (furthest to the right). Let the female variable range from 0 to 1. Then, given the model we estimated, we’re going to get predictions from the model and standard errors around those predictions. Let the Scotland effect be 0 for convenience sake. Then, we’ll create conceivable lower and upper bounds around these estimates and graph them.

Note: it’s killing me to not ask you to install {modelr} for this, but you should absolutely install {modelr} for this. It makes it way less tedious.

tibble(lrscale = c(0, 5, 10, 0, 5, 10),
       female = c(0,0,0,1,1,1),
       scotland = 0) -> newdat

newdat # Look at our hypothetical data/prediction grid.

## # A tibble: 6 × 3
##   lrscale female scotland
##     <dbl>  <dbl>    <dbl>
## 1       0      0        0
## 2       5      0        0
## 3      10      0        0
## 4       0      1        0
## 5       5      1        0
## 6      10      1        0

Preds <- predict(M7, newdata = newdat, se.fit = TRUE)

Preds # Look at our predictions.

## $fit
##        1        2        3        4        5        6 
## 21.19887 17.20297 13.20707 19.25074 16.78684 14.32294 
## 
## $se.fit
##         1         2         3         4         5         6 
## 0.6716501 0.2425894 0.6591887 0.6158585 0.2340980 0.6426349 
## 
## $df
## [1] 1686
## 
## $residual.scale
## [1] 6.752303

Preds %>% as_tibble() %>% # Coerce to a tibble, and...
  # bind_cols to the prediction grid/hypothetical data, and...
  bind_cols(newdat, .) %>% 
  # select just what we want, to look at  it...
  select(lrscale:se.fit)

## # A tibble: 6 × 5
##   lrscale female scotland   fit se.fit
##     <dbl>  <dbl>    <dbl> <dbl>  <dbl>
## 1       0      0        0  21.2  0.672
## 2       5      0        0  17.2  0.243
## 3      10      0        0  13.2  0.659
## 4       0      1        0  19.3  0.616
## 5       5      1        0  16.8  0.234
## 6      10      1        0  14.3  0.643

Since I can see that performed how I wanted it to. Let’s make it pretty.

Preds %>% as_tibble() %>%
  bind_cols(newdat, .) %>% 
  select(lrscale:se.fit) %>%
  # Okay, there's going to be some hackiness (sic). First, we're going to create
  # lower and upper bounds that are 90% intervals around +/- the standard error.
  # Next, we're going to create a categorical variable from the female variable
  # in a way that's super obvious. Next. We're going to create a categorical
  # variable from the `lrscale` variable (ideo_lbl) that adds more intuitive 
  # labels. The \n you see there forces a linebreak for legibility in the label.
  # You'll see what it does. Using the fct_inorder() forces it to be ordered
  # because, otherwise, it would display 0, 10, 5 (i.e. furthest, furthest, 
  # middle) because of the alphabetical order. Because our data are already
  # ordered, this is just a shortcut/cheat code to do what we want to do. In
  # this case, it's because I know y'all might help to see the left in red and 
  # the right in blue.
  mutate(lwr = fit - 1.645*se.fit,
         upr = fit + 1.645*se.fit,
         gndr_lbl = ifelse(female == 0, "Men", "Women"),
         ideo_lbl = case_when(
           lrscale == 0 ~ "Furthest\nLeft",
           lrscale == 5 ~ "Middle",
           lrscale == 10 ~ "Furthest\nRight"
         ),
         ideo_lbl = fct_inorder(ideo_lbl)) %>%
  ggplot(.,aes(gndr_lbl, fit, ymin=lwr, ymax=upr, color=ideo_lbl)) +
  geom_pointrange(position = position_dodge(width = .1)) +
  coord_flip() +
  theme_minimal() +
  labs(x = "", y = "Estimated Pro-Immigration Sentiment (with 90% Bounds)",
       color = "")

What this tells me: the effect of increasing ideology on pro-immigration sentiment is stronger for the men than it is the women. For women furthest to the left, their mean pro-immigration sentiment is 19.3. Furthest to the right: 14.3. That’s a min-max difference of about 5 points. For the men: the estimated pro-immigration sentiment for non-Scottish dudes furthest to the left is 21.2, which is interestingly higher than the women furthest to the left. However, the estimated pro-immigration sentiment for the men furthest to the right is 13.2. That’s a min-max difference of 8 points. Kinda cool, and kinda interesting as I’m doing this on the fly for the sake of this class.

There, is I suppose, an alternate interpretation that is largely consistent with the data but communicates a different mechanism happening here. In the above interpretation, I largely channeled the effect of ideology through men and women. There could also be a story to tell about why there is an apparent gender difference among those furthest to the left but no real gender differences among those furthest to the right. In this potential interpretation, there is a potential gender mechanism that’s being clustered on ideology, and why is it that the men and women are more like each other on this matter (in this simple model) among the most right whereas they are more different among those most to the left.

I’ll defer to you to make sense of the findings as you see fit, and why you think you see what you see (provided you treat such an obviously simple model as gospel truth about the UK population at this point in time). No matter, always look at your data. You might find something cool in there.

Extending OLS

Steven V. Miller, svmiller.com

18 December 2022

R Packages/Data for This Session

Building Toward Multiple Regression

Multiple Regression

Interactive Effects