Pages

Tuesday, October 12, 2021

(2) Results: Using Traditional and Non-Traditional Predictors to Model Unemployment

1. Abstract

The rise and fall of unemployment during the COVID pandemic has created a very peculiar economic environment. It is under this environment that we think about traditional and non-traditional measures to predict unemployment. For traditional measures we use indexes which have been shown by previous research to relate to unemployment. For non-traditional measures we use time series data of a selection of relevant words from Google Trends. We find that the strongest model for predicting unemployment encompasses both traditional and non-traditional measures. This has implications for governmental bodies and other interested parties and their methodologies for measuring real unemployment.

2. Introduction

2.1 Overview

Unemployment affects thousands of people each year and has ramifications for all aspects of life. This is why it is important to find ways to measure unemployment so that government institutions can act in time to stop problems arising for citizens. Unemployment can indicate the health of a country's economy and in turn, of their people. Unemployment is also related to negative effects such as crime and depression. When long periods of time exist with extended bouts of unemployment, people may become discouraged, and never return to the workforce even once the economy improves, something we may have been seeing during the pandemic. Measuring and predicting unemployment therefore becomes paramount to the well being of a society. The unemployment rate is measured by the Bureau of Labor Statistics of the U.S. Department of Labor each month. They do this by collecting a stratified sample of unemployment insurance claims from each state.

2. 2. Data Collection

Predicting unemployment is a complicated endeavor since it involves multiple causes such as the inflation rate, labor force, job openings and education. In this analysis we will try to create a model that uses both typical and atypical predictors of unemployment. For example, we will be using the CPI (Consumer Price Index) which is known to strongly relate to unemployment and is traditionally used by economists to estimate the true rate of unemployment. We will also be using other measures of inflation as we would like to investigate the strength of inflation when predicting unemployment rates. All of these measures come from the Federal Reserve Economic Data of the St. Louis Federal Reserve. We will also be using Google Trends, a website by Google which allows you to search for queries and their relative volumes. We assume that typical words unemployed people would use could be "job", "job offer", “welfare”, "bored" or even "alcohol".

We will also be looking at different states of the economy. With the recent 2008 recession and the COVID pandemic we have seen many workers lose their jobs, become discouraged, or too sick to work. It would be interesting to see the differences these two recessions have had on unemployment along with the primary motivation of this analysis. This is why we will be creating dummy variables to look at the strength of such effects.

2. 3. The Data Set

The data set is composed of 17 different attributes and 213 observations spanning back to January 2004. Of these attributes, 7 are traditional, 7 are non-traditional, 2 are dummy variables and 1 is the date of the observation. Our data set goes back to 2004 because that is when Google Trends began. We decided to use a monthly frequency as the Federal Reserve calculates the unemployment rate monthly. Since we have the unemployment rate for the United States, we framed all of our variables around only this country.

On the following list, the full name of the variable is written followed by the name in our dataset, shortened for coding convenience:
Unemployment Rate (UNRATE): The unemployment rate is the number of unemployed persons over the age of 16 as a percentage of the labor force.
S&P / Case-Shiller U.S. National HomePrice Index (HPI): The Case-Shiller US National Home Price Index measures the value of single-family housing within the US. It is actually a composite of single- family home price indices.
Personal Consumption Expenditures: The PCE reflects changes in th prices of a wide range of goods and services purchased by consumers in the US. This is a good measure of inflation and deflation.
Producer Price Index (PPI): This is a group of indexes that calculates and represents the average movement in selling prices from domestic production over a month.

  • Job Openings: This is the amount in thousands of job openings available each month in the United States.

  • Consumer Price Index (CPI): The CPI measures the weighted average of prices from a select basket of consumer goods and services spanning from transportation to medical care.

  • Gas: This gives us the retail prices for all grades of gasoline in the US.

  • Trends (Work, Job Offer, Job, Welfare, Depression, Alcohol and Bored): Each trend gives us the relative frequency of google searches of the specific word compared to the overall volume of google searches in that month.

COVID: We consider COVID to have affected the American economy from March 2020 until this day. This is the first of the dummy variables with facets “Yes” or “No”.
Recession: Recessions after 2004 include the 2008 recession starting on December 2007 until June 2009, as according to the Federal Reserve. This is the second of the dummy variables with facets “Yes” or “No”.

3. Pre-data analysis

For our initial analysis we used various
graphs to see how all these variables could possibly
be related. One thing that stood out is that when
visualizing the number of job openings and the
unemployment rate (our response variable) we can
see some very stark differences between the 2008
recession and COVID. As seen in Graph 1 below,
traditional recessions have a small amount of job openings because of the economy contracting. Covid, for various reasons that are being studied right now by economists, the number of job openings increased drastically regardless of there being a high number of unemployed people. You can see from the violin graph that according to our data set unemployment was actually higher over time during COVID. This has implications for making sure that these two variables are both included in our model.

3. 1. Multicollinearity

A lot of these indexes are highly related, this may create issues for a model that includes all of them. It is important for us to note down the variables that are highly correlated because they may create issues with collinearity further on in the model. This is why we initially decided to check for the VIF before doing any additional transformations for the variables. After checking for collinearity we find that we need to remove the Personal Consumption Expenditures, Producer Price Index, Job Openings and the Work Trend. This implies that all the variables that we have in our model were highly related to the ones we had to remove. Figure 1 shows the VIF for the full model.

Once we look at the histograms (Figure 2 in the appendix) we can see that most of the data is mostly evenly distributed. The Unemployment Rate (UNRATE), JobTrend, HPI and CPI may be slightly skewed. The CPI is skewed positively because monetary policy keeps it that way as a form to combat unemployment.

This is why the UNRATE and the CPI are skewed in a mirrored way. In the QQ plots (Figure 3) can see that very few variables are skewed- possibly the HPI, JobTrend and AlcoholTrend seem to be slightly skewed and may need Box-Cox transformations.

3.2. Initial Model

After eliminating multicollinearity our proposed model therefore becomes:

UNRATE = β0 + β1(HPI) + β2 (CPI) + β3 (Gas) + β4 (JobOfferTrend) +β5 (JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c

This model is already a good predictor of Unemployment, since the Adjusted R-squared is 0.784. But it can be significantly improved.

4. Model Diagnosis

4.1. Model Assumptions

First, we checked for a linear relationship between the response variable and the predictors. By looking at the scatterplots in Figure 4, we can see the response variable “UNRATE” seems to have linear relationships with many variables, but also many non-linear relationships. We might need to run some transformations on these variables. Then, we check for normality and constant variance. And we find out that the residual plot does not seem to show random variation, whereas the QQ plot is close to a straight line (Figure 5). In order for our final assumptions to be satisfied, we might have to run transformations in the response variable and check for any possible outliers or influential points.

Finally, our outliers are marked by the COVID pandemic. After investigating them we find that the outlying months are March, April and May 2020. Given the importance of these months within our motivations we do not find it appropriate to eliminate them. We believe that by including the dummy

variables COVID and Recession the leverage that these points have on our analysis will be mitigated. Regardless, the leverage we are seeing is still pretty low.

4.2. Missing Points

Our missing points are not missing at random (NMAR), the lack thereof is related to the data itself since the last few months of certain indexes have not been released yet. We wanted to include as many months of COVID as possible and therefore did not feel the need to eliminate the months with missing data as there are multiple new policies put into effect in the last few months that may have a strong impact on the data. Therefore we decide to maintain the rows in the data set with missing values because the non-missing values may be very relevant to understanding unemployment during COVID.

4.3. Transformations

Before we fit a model, we must first perform some adjustments to the data. As previously mentioned, the histograms of the predictors JobTrend and CPI raise some concern as they are slightly skewed. In hopes of making the distribution look more normal, we impose a log transformation. Additionally, the BoxCox transformation for Unemployment Rate (Figure 6) suggests that the relation is not exactly linear. We should try the inverse of UNRATE squared. We can see in Figure 7 that after the transformation, the residual plot looks random. This is definitely an improvement from the previous model so we will use the transformations in the models we fit in the following sections.

Our new model looks like:

(UNRATE)^(-2) = β0 + β1(HPI) + β2 log(CPI) + β3 (Gas) + β4 (JobOfferTrend) + β5 log(JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c

This model not only did a better job in satisfying the assumptions, but also improved the adjusted R-squared to 0.9419.

5. Model Selection

5.1. Stepwise Selection
We decided that the BIC selection does not work for us because we have a model with a lot of

predictors. We chose the AIC model because our goal was to find a model that does the best at predicting unemployment rate, not to find the best predictors of it. We ran AIC selection for our model*, and the model with the smallest AIC looks like:

(UNRATE)^(-2) = β0 + β1(HPI) + β2 (JobOfferTrend) + β3 (JobTrend) + β4(DepressionTrend) + β5 (AlcoholTrend) + β6 (COVID) + c

*due to errors with the missing values, we removed the log transformations for the stepwise model selection

5.2. Proposed Model

(UNRATE)^(-2) = β0 + β1(HPI) + β2(CPI) + β3 (JobOfferTrend) + β4 (JobTrend) + β5(DepressionTrend) + β6 (AlcoholTrend) + β7 (COVID) + c

This model yields an Adjusted R-squared of 0.928 as seen below. It is important to note that we decided to keep the predictors in the second to last step of the AIC stepwise selection. This way we have a more balanced model in terms of untraditional and traditional predictors, without having a big impact in the Adjusted R-squared. This model performs well in terms of both reasonable fit and prediction.

6. Conclusion

Figure 8. Final Model Summary

After comparing models with only one or the other type of predictor we found that a model which includes both traditional and non-traditional predictors with the highest R squared. This answers our research question finding that a model with both is the most efficient.

The model that we created can be transformed by using a different number and variety of indexes. It would also be really interesting to see what other words could be used to predict unemployment. Since Google Trends was only opened in 2004, there may be other methods to go back and look at words predating 2004. One suggestion would be to take a basket of books and see how often certain words or phrases were used in these books. This could expand our model beyond 2004 and have implications for

internal validity. Further suggested analysis would be to use this model for an out of sample test in the coming months. By looking at what happens with unemployment and our predictors in the rest of 2021 then we can confirm the true strength of our model.

The analysis that we created has various implications for the way that we predict unemployment, that being, expanding the type of predictors that one might use to include unusual ones such as the ones explored here. When we use both predictors we have a better model for predicting our measure of unemployment, but this may be an even stronger measure for true unemployment. There are also implications within the strength of google search queries and their targeted advertising. For example the government could use such ads to target those who may be looking for a job. These different forms of predictors can be used as sentinels to alert for higher or lower periods of unemployment.

7. Appendix

Figure 1: VIF for Full Model



Figure 2: Histograms

Figure 3: QQ Plots

Figure 4: Scatterplots for the variables

Figure 5: Residuals and QQ plot for First Model

Figure 6: BoxCox transformation for UNRATE

Figure 7: Residuals and QQ plot for Model after transformations

8. Code

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(dplyr)
library(car)

library(formatR) library(MASS) library(faraway) library(leaps) library(Rcmdr) ```

```{r CLEAN, include= FALSE}
#load data into R
unemployment <- read.csv("Data.csv") %>%

dplyr::select(Date = Month, UNRA TE, HPI=CSUSHPISA,
PCE,

JobOpenings = JTSJOL,
PPI = PPIACO,
CPI = CPALTT01USM657N,
Gas,
WorkTrend = work...United.States., JobOfferTrend = job.offer...United.States., JobTrend = job...United.States.,
WelfareTrend = welfare...United.States., DepressionTrend = depression...United.States., AlcoholTrend = alcohol...United.States., BoredTrend = bored...United.States.,
COVID,
Recession

) ```

- The Full Model
```{r}
# Model 1: Full Model
mod1 <- lm(UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment) summary(mod1)
```
- Variance Inflation Factors
```{r}
vif(mod1)
```
```{r}
# Model 2: Full Model - PCE
mod2 <- lm(UNRATE ~ HPI + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod2)
```

There are still many variables with high VIF's. Below we will be removing the variables with the highest VIF's until we reach a simplified model with all variables holding a VIF <4.
```{r}
# Model 3: Full Model - PCE - PPI

mod3 <- lm(UNRATE ~ HPI + JobOpenings + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod3)
# Model 4: Full Model - PCE - PPI - JobOpenings

mod4 <- lm(UNRATE ~ HPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod4)
# Model 5: Full Model - PCE - PPI - JobOpenings - WorkTrend

mod5 <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod5)
```

```{r}
# Data Set for Model 5
unemployment5 <- read.csv("Data.csv") %>%

dplyr::select(UNRATE,
HPI=CSUSHPISA,
CPI = CPALTT01USM657N,
Gas,
JobOfferTrend = job.offer...United.States., JobTrend = job...United.States.,
WelfareTrend = welfare...United.States., DepressionTrend = depression...United.States., AlcoholTrend = alcohol...United.States., BoredTrend = bored...United.States.,
COVID,
Recession

)
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i] hist(unemployment5[,i],xlab=x,main=x) } ```
- qq plots
```{r}
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i]

qqnorm(unemployment5[,i],xlab=x)

qqline(unemployment5[,i]) }
```
Box Cox transformations ```{r}

par(mfrow=c(2,2))
for(i in c(6)){
x <- names(unemployment5)[i]
boxcox(I(unemployment5[,i]+0.001)~1,xlab=x)### add 0.001 to make all values positive
}
```
```{r}
par(mfrow=c(2,2))
for(i in c(2:10)){
x <- names(unemployment5)[i]
plot(unemployment5[,i],unemployment5$UNRATE,xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(1,2))
for(i in c(11,12)){
x <- names(unemployment5)[i]
boxplot(unemployment5$UNRATE~unemployment5[,i],xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(2,2))
plot(mod5)
shapiro.test(residuals(mod5))
```
The residual plot does not seem random and the QQ plot is close to a straight line. There might be some influential/outliers.

```{r}
boxcox(mod5, lambda = seq (-3,3))
```
The graph suggests that the relation is not exactly linear. We should try (UNRATE)^2 .
```{r}
model.inv1=lm((UNRATE)^(-2)~HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
plot(model.inv)
summary(model.inv)
```
- Leverage Plots / Cook's distance
```{r}
par(mfrow=c(1,2))
lev <- hatvalues(mod5) #extract the leverage values
labels <- row.names(unemployment)
halfnorm(lev,labs=labels,ylab="Leverages",xlim=c(0,3))
cook <- cooks.distance(mod5)#find Cook's distance
halfnorm(cook,labs=labels,ylab="Cook's distance",xlim=c(0,3))
```
Line 189 and 195 are outliers. We now investigate to see why.

(189) 2019-09

(195) 2020-03 ```{r}

outliers <- unemployment5[c(189,195), ] head(outliers) summary(unemployment5)
```

- Studentized Residuals (graph)
Look at the values of the special cases from graphs ```{r}

###Extract studentized residuals from the fitted model

studres <- rstudent(mod5)#get studentized residuals range(studres)
out.ind <- which(abs(studres)>3) #looking at residuals summary(unemployment5)

unemployment5[out.ind,] ```
-Partial Residual Plot ```{r}

# creating a model without categorical variables
mod5cont <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend , data =unemployment)
par(mfrow=c(4,2))
termplot(mod5cont,partial.resid=TRUE,pch=16)
```
- TRANSFORMATIONS
```{r}
# Logging the CPI, because it's residuals were skewed. Adding boxcox for Y
mod6 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod6)
```
```{r}
# Logging the JobTrend
mod7 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod7)
```
- Multicollinearity
```{r}
vif(mod7)
```

The VIF of “HPI” and “AlcoholTrend” are a bit high (5.380891, 5.179086), but all
other VIFs seem to be ok. We can try a model selection technique to remove some of the predictors and see if that removes the multicolinearity.
- Mallow's CP
```{r}
b <- regsubsets((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment,nvmax=11)
rs <- summary(b)
par(mfrow=c(1,2))
plot(1:11,rs$cp,ylab="Mallow's Cp",xlab="No.of Predictors",type="l",lwd=2)
plot(1:11,rs$bic,ylab="BIC",xlab="No.of Predictors",type="l",lwd=2)
```
```{r}
# Best model selected by Mallow's cp
rs$which[which.min(rs$cp),]
mod7.cp = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + log(JobTrend) + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
```
```{r}
#Best model selected by BIC
rs$which[which.min(rs$bic),]
mod7.bic = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + WelfareTrend + DepressionTrend + AlcoholTrend + COVID, data =unemployment)
```
```{r}
summary(mod7.cp)
vif(mod7.cp)
```
```{r}
summary(mod7.bic)
vif(mod7.bic)

```
- AIC
```{r}
#using model without logs because of NaNs error
mod5.7 =lm((UNRATE)^(-2) ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
```
```{r}
##Stepwise backward selection based on AIC
step_b <- stepAIC(mod5.7, trace = TRUE, direction= "backward")
##Stepwise forward selection based on AIC
step_f <- stepAIC(mod5.7, trace = TRUE, direction= "forward")
##Stepwise both ways selection based on AIC
step_both <- stepAIC(mod5.7, trace = TRUE, direction= "both")
library(betareg)
coef(mod5.7)
``` 

Sunday, October 10, 2021

(1) Analysis: Using Traditional and Non-Traditional Predictors to Model Unemployment

Initial Report

Introduction

Motivation

Unemployment affects thousands of people each year and has ramifications in all aspects of life. This is why it is important to find ways to measure unemployment so that government institutions can act in time to stop problems arising for citizens. Unemployment can also indicate the health of countries economy and in turn, of their people. Unemployment is also related to negative effects such as crime and depression. When long periods of time exist with extended bouts of unemployment, people may become discouraged, and never return to the workforce even when the economy improves. Measuring and predicting unemployment therefore becomes paramount to the well being of a society. The unemployment rate is measured by the Bureau of Labor Statistics of the U.S. Department of Labor each month. They do this by collecting a stratified sample of unemployment insurance claims from each state.

Predicting unemployment is a complicated endeavor as in involves multiple causes such as the inflation rate, labor force, job openings and education. In this analysis we will try to create a model that uses both typical and atypical predictors of unemployment. For example, we will be using the CPI (Consumer Price Index) which is known to strongly relate to unemployment and is traditionally used by economists to estimate the true rate of unemployment. We will also be using other measures of inflation as we would like to investigate the strength of inflation when predicting unemployment rates. We will also be using Google Trends, a website by Google which allows you to search for queries and their relative volumes. We assume that typical words unemployed people would use could be “job”, “job offer”, “bored” or even “alcohol”.

We will also be looking at different states of the economy. With the recent 2008 recession and the COVID pandemic we have seen many workers lose their jobs, become discouraged, or too sick to work. It would be interesting to see the differences these two recessions have had on unemployment along with the primary motivation of this analysis.

Research Question

Are traditional measures more succesful at predicting unemployment than non traditional measures?

The Data Set

The data set is composed of of 17 different attributes (as shown in Table 1 below) and 213 observations spanning back to January 2004.

This data set includes monthly observations for each variable. We decided to use a monthly frequency as the Federal Reserve calculates the unemployment rate monthly. Since we have the unemployment rate for the United States, we framed all of our variables around only this country.

Predictor Descriptions

Variable Notes
Date Year and Month
UNRATE Unemployment Rate, Percent, Seasonally Adjusted
HPI S&P/Case-Shiller U.S. National Home Price Index, Seasonally Adjusted
PCE Personal Consumption Expenditures, Billions of Dollars, Seasonally Adjusted Annual Rate
JobOpenings Job Openings: Total Nonfarm, Level in Thousands, Seasonally Adjusted
PPI Producer Price Index by Commodity: All Commodities, Not Seasonally Adjusted
CPI Consumer Price Index: Total All Items for the United States, Growth Rate Previous Period, Not Seasonally Adjusted
Gas U.S. All Grades All Formulations Retail Gasoline Prices (Dollars per Gallon)
WorkTrend Frequency of the word ‘Work’ relative to total search volume on Google
JobOfferTrend Frequency of the words ‘Job Offer’ relative to total search volume on Google
JobTrend Frequency of the word ‘Job’ relative to total search volume on Google
WelfareTrend Frequency of the word ‘Welfare’ relative to total search volume on Google
DepressionTrend Frequency of the word ‘Depression’ relative to total search volume on Google
AlcoholTrend Frequency of the word ‘Alcohol’ relative to total search volume on Google
BoredTrend Frequency of the word ‘Bored’ relative to total search volume on Google
COVID If USA was in a recession (Yes/No)
Recession If USA was experiencing the COVID pandemic (Yes/No)

Unemployment Rate (UNRATE)

The unemployment rate is the number of unemployed persons over the age of 16 as a percentage of the labor force.

S&P / Case-Shiller U.S. National Home Price Index (HPI)

The Case-Shiller US National Home Price Index measures the value of single-family housing within the US. It is actually a composite of single-family home price indices.

Personal Consumption Expenditures

The PCE reflects changes in th prices of a wide range of goods and services purchased by consumers in the US. This is a good measure of inflation and deflation. - describe each predictor – where it is from, – what it shows, – how each number was calculated

Job Openings

This is the amount in thousands of job openings available each month in the United States.

Producer Price Index (PPI)

This is a group of indexes that calculates and represents the average movement in selling prices from domestic production over a month.

Consumer Price Index (CPI)

The CPI measures the weighted average of prices from a select basket of consumer goods and services spanning from transportation to medical care.

Gas

This gives us the retail prices for all grades of gasoline in the US.

COVID

We consider COVID to have affected the American economy from March 2020 until this day.

Recession

Recessions after 2004 include the 2008 recession starting on December 2007 until June 2009, as according to the Federal Reserve.

Types of predictors

Variable Type Levels
Date Numerical N/A
UNRATE Numerical N/A
HPI Numerical N/A
PCE Numerical N/A
JobOpenings Numerical N/A
PPI Numerical N/A
CPI Numerical N/A
Gas Numerical N/A
WorkTrend Numerical N/A
JobOfferTrend Numerical N/A
JobTrend Numerical N/A
WelfareTrend Numerical N/A
DepressionTrend Numerical N/A
AlcoholTrend Numerical N/A
BoredTrend Numerical N/A
COVID Categorical 2: Yes, No
Recession Categorical 2: Yes, No

Summary Statistics

The two scatterplots represented in graph 1 and 2 show us the different relationships for (1) traditional predictors of unemployment and (2) non-traditional predictors of unemployment.

While some of the predictors from graph 1 have a very strong linear relationship (for example the HPI, PCE and Job Openings) we see some diversion with the variables PPI, CPI and Gas. The relationship between gas and the CPI stands out, as you would assume the consumer price index to include gas prices and to have a much more linear relationship. We will investigate this further in Graph 7. Another interesting sight is the relationship of the CPI with the PPI, PCE and HPI. You would expect the Consumer Price Index to be more linearly related to the Producer Price Index since consumers buy goods and services from producers. The Personal Consumption Expenditures index should also be more closely related to the CPI since it is highly correlated to the PPI. With such small differences, it will be interesting to see which one of these indexes will be a stronger predictor of unemployment.

Graph 2 shows us the much more erratic relationship between google trends. Of course the search “Job” and “Job Trends” have a very strong relationship. Another relationship that stands out immediately is the positive relationship between the search for the word “Alcohol” and “Work”, implying that as more times the word “Work” is googled, “Alcohol” follows. The word “Bored” and “Depression” has a more loose relationship, which may have multiple implications for psychology and the way we view the relationship between depression and boredom. Sadly, “Depression” also has a weak positive relationship with “Welfare”. This may mean that as people become more unemployed (as measured by an increase in searches for “Welfare”) they also become more depressed.

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).
## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

As you can see from graph 3 and 4 COVID was not your typical recession. Whilst during the 2008 recession job openings decreased, during COVID they increased. This may be due to multiple reasons, for example, a lot of jobs in construction were lost in 2008 and during COVID a lot of people left work for health reasons. This makes little sense when looking at the unemployment rate (in graph 4), that during COVID dipped even more than during the 2008 recession. Further investigation is needed as these two recessions acted very differently on unemployment.

## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 2 row(s) containing missing values (geom_path).

## Warning: Removed 3 row(s) containing missing values (geom_path).

Graph 5 shows how Personal Consumption Expenditures and Producer Price Index seem to be very correlated, since they move in a similar trend. With a significant increase in 2008 and decrease after 2020. The US National Home Index follows them except for the years 2006–2012 due to the largest crash in global real estate markets in recent history.

GRAPH 6

## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).

– TALK ABOUT GAS PRICES AND CPI: THE CPI INCLUDES TRANSPORTATION AND THEREFORE GAS, HIGHLY RELATED, MAYBE ONE CAN GO IN OUR FINAL MODEL

– TALK ABOUT MISSING DATA (SOME OF THE LAST MONTHS WERE NOT AVAILABLE YET), HOW CAN THIS AFFECT OUR ANALYSIS – ARE THERE ANY OUTLIERS? YES THE RECESSION

Pre - Analysis

Correlation

As seen from the first correlation matrix some of the traditional predictors are high correlated with each other (close to 1). This may create issues for a model that includes all of them. Only after checking for residuals will we be able to eliminate the redundant variables that are present in the first correlation matrix. It is important for us to note down the variables that are highly correlated because they may create issues with collinearity further on in the model.

The correlation matrix for the non-traditional predictors gives us variables that are more lightly related to each other. Some of the least correlated predictors are the ones with “Depression” and “Work”, “JobOpenings” and “Job”. Interestingly, depression is correlated differently to Job Offer and Job. While it is still a weak relationship, the relationship with Job Offer is slightly negative and with Job it is slightly positive. There are some very strong negative correlations such as that between “bored” and “work” and some very strong correlations such as that between “alcohol” and “work”.

Proposed Model

\[ UNRATE = \beta_0 + \beta_1*HPI + \beta_2*PCE + \beta_3*JobOpenings + \beta_4*PPI +\beta_5*CPI + \beta_6*gas + \beta_7*WorkTrend + \]

\[ \beta_8*JobOfferTrend + \beta_9*JobTrend + \beta_10*WelfareTrend + \beta_11*DepressionTrend + \beta_12*BoredTrend + \beta_13*COVID + \beta_14*Recession+ u\]

  • The Full Model
# Model 1: Full Model 
mod1 <- lm(UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

summary(mod1)
## 
## Call:
## lm(formula = UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + 
##     WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + 
##     AlcoholTrend + BoredTrend + COVID + Recession, data = unemployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4043 -0.3367 -0.0509  0.3030  3.8841 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.7612278  2.5082142   1.500 0.135353    
## HPI             -0.0305594  0.0089167  -3.427 0.000744 ***
## PCE              0.0001388  0.0002021   0.687 0.492954    
## JobOpenings     -0.0012393  0.0001554  -7.977 1.27e-13 ***
## PPI              0.0683890  0.0184877   3.699 0.000282 ***
## CPI              0.2362076  0.1553136   1.521 0.129928    
## Gas             -1.1909642  0.3465234  -3.437 0.000720 ***
## WorkTrend        0.0119620  0.0155582   0.769 0.442912    
## JobOfferTrend    0.0007640  0.0066946   0.114 0.909265    
## JobTrend        -0.0004922  0.0102250  -0.048 0.961654    
## WelfareTrend    -0.0122233  0.0094054  -1.300 0.195279    
## DepressionTrend -0.0043688  0.0075178  -0.581 0.561829    
## AlcoholTrend     0.0129157  0.0142024   0.909 0.364267    
## BoredTrend       0.0311738  0.0065705   4.745 4.05e-06 ***
## COVIDYes         4.4751654  0.3976397  11.254  < 2e-16 ***
## RecessionYes    -0.9210334  0.2067848  -4.454 1.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7306 on 194 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8857, Adjusted R-squared:  0.8769 
## F-statistic: 100.2 on 15 and 194 DF,  p-value: < 2.2e-16
  • Variance Inflation Factors
vif(mod1)
##             HPI             PCE     JobOpenings             PPI             CPI 
##       22.916956       61.923449       22.157429       43.516928        1.460353 
##             Gas       WorkTrend   JobOfferTrend        JobTrend    WelfareTrend 
##       16.179010       27.399694        5.452429        1.658003        3.584180 
## DepressionTrend    AlcoholTrend      BoredTrend           COVID       Recession 
##        2.861029        5.965220        4.371834        4.378716        1.514133

When looking at the first model (model 1) most variables have a VIF of > 4 (in decreasing order: PCE, PPI, WorkTrend, HPI, JobOpenings, Gas, JobOfferTrend, AlcoholTrend and BoredTrend). This suggests that some of these variables should not be included in the model because of the high collinearity. The ones with the highest VIF were also the ones that had a high correlation in the matrix above. The PCE (Personal Consumption Expenditures Index) had severe multicollinearity, above >60. Since we have very similar indexes in our data set it only makes sense to remove the PCE.
Now we are running the Variance Inflation Factor for the new model (model 2).

# Model 2: Full Model - PCE
mod2 <- lm(UNRATE ~ HPI + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod2)
##             HPI     JobOpenings             PPI             CPI             Gas 
##       17.915047       21.968008       26.836192        1.443887       12.575707 
##       WorkTrend   JobOfferTrend        JobTrend    WelfareTrend DepressionTrend 
##       20.293214        5.449336        1.655320        3.458456        2.735920 
##    AlcoholTrend      BoredTrend           COVID       Recession 
##        5.856248        4.268493        3.632208        1.496279

There are still many variables with high VIF’s. Below we will be removing the variables with the highest VIF’s until we reach a simplified model with all variables holding a VIF <4.

# Model 3: Full Model - PCE - PPI
mod3 <- lm(UNRATE ~ HPI + JobOpenings + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod3)
##             HPI     JobOpenings             CPI             Gas       WorkTrend 
##       17.044158       20.808234        1.293691        2.217710        8.107706 
##   JobOfferTrend        JobTrend    WelfareTrend DepressionTrend    AlcoholTrend 
##        5.443782        1.581495        3.063756        2.730156        5.304130 
##      BoredTrend           COVID       Recession 
##        4.255631        3.228722        1.403928
# Model 4: Full Model - PCE - PPI - JobOpenings
mod4 <- lm(UNRATE ~ HPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod4)
##             HPI             CPI             Gas       WorkTrend   JobOfferTrend 
##        5.373952        1.272638        2.217710        8.093479        4.824244 
##        JobTrend    WelfareTrend DepressionTrend    AlcoholTrend      BoredTrend 
##        1.567550        3.063128        2.727614        5.127353        3.278874 
##           COVID       Recession 
##        2.546438        1.356260
# Model 5: Full Model - PCE - PPI - JobOpenings - WorkTrend
mod5 <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod5)
##             HPI             CPI             Gas   JobOfferTrend        JobTrend 
##        4.967087        1.272174        2.095290        3.333240        1.542669 
##    WelfareTrend DepressionTrend    AlcoholTrend      BoredTrend           COVID 
##        3.039860        2.593271        3.107085        2.538216        2.545564 
##       Recession 
##        1.280639

As seen above the PCE, PPI were highly collinear with the other traditional predictors and the JobOpenings and WorkTrend are highly collinear with the non-traditional predictors. Our model (Model 5) after checking for multicollinearity is therefore the following:

\[ UNRATE = \beta_0 + \beta_1*HPI + \beta_2*CPI + \beta_3*Gas + \] \[\beta_4*JobOfferTrend +\beta_5*WelfareTrend + \beta_6*DepressionTrend + \] \[ \beta_7*AlcoholTrend +\beta_8*BoredTrend + \beta_9*COVID + \] \[ \beta_10*Recession + u\]

  • summary
summary(mod5)
## 
## Call:
## lm(formula = UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + 
##     WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + 
##     COVID + Recession, data = unemployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5093 -0.5045 -0.0531  0.5009  4.1947 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     20.962801   1.877755  11.164  < 2e-16 ***
## HPI             -0.091345   0.005498 -16.613  < 2e-16 ***
## CPI             -0.118898   0.192001  -0.619  0.53646    
## Gas             -0.112260   0.165169  -0.680  0.49751    
## JobOfferTrend    0.014686   0.006933   2.118  0.03539 *  
## JobTrend        -0.016061   0.013063  -1.229  0.22035    
## WelfareTrend    -0.035628   0.011472  -3.105  0.00218 ** 
## DepressionTrend -0.016449   0.009480  -1.735  0.08427 .  
## AlcoholTrend     0.047285   0.013576   3.483  0.00061 ***
## BoredTrend       0.026338   0.006631   3.972 9.98e-05 ***
## COVIDYes         6.132897   0.401566  15.272  < 2e-16 ***
## RecessionYes    -0.007767   0.251883  -0.031  0.97543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9676 on 198 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7954, Adjusted R-squared:  0.784 
## F-statistic: 69.98 on 11 and 198 DF,  p-value: < 2.2e-16
  • missing points (decide what we need)

Our missing points are not missing at random (NMAR), the lack thereof is related to the data itself since the last few months of certain indexes have not been released yet. We wanted to include as many months of COVID as possible and therefore did not feel the need to eliminate the months with missing data as there are multiple new policies put into effect in the last few months that may have a strong impact of the data. Therefore we decide to maintain the rows in the data set with missing values because the non-missing values may be very relevant to understanding unemployment during COVID.

Validation of Model Assumptions

1) Linear Relationship between y and x1, x2, x3

  • histograms
# Data Set for Model 5
unemployment5 <- read.csv("Data.csv")  %>%
  select(UNRATE,
    HPI=CSUSHPISA,
    CPI = CPALTT01USM657N,
    Gas,
    JobOfferTrend = job.offer...United.States.,
    JobTrend = job...United.States.,
    WelfareTrend = welfare...United.States.,
    DepressionTrend = depression...United.States.,
    AlcoholTrend = alcohol...United.States.,
    BoredTrend = bored...United.States.,
    COVID,
    Recession
  ) 

par(mfrow=c(3,4)) 
for(i in c(1:10)){
x <- names(unemployment5)[i]
hist(unemployment5[,i],xlab=x,main=x) }

From these histograms we can see that some of the data is mostly evenly distributed. The Unemployment Rate (UNRATE), JobTrend, HPI and CPI may be slightly skewed.

2) Normality of the random errors “Ei’s”

3) Constant Variance Assumptions for Random Errors

  • qq plots
par(mfrow=c(3,4)) 
for(i in c(1:10)){
x <- names(unemployment5)[i]
  qqnorm(unemployment5[,i],xlab=x) 
qqline(unemployment5[,i])
}

Here we can see that very few variables are skewed- possibly the HPI, WelfareTrend and AlcoholTrend seem to be slightly skewed and may need Box-Cox transformations

  • skewed variables -> Box Cox transformations

  • pairwise assumptions Shapiro / Wilks test

  • Residual Errors

  • Fit Models (Check F-test)

  • Shapiro Test

  • More Box Cox?

4) No influential points / outliers

Simple Linear Regression: Use Scatter Plot Multiple Linear Regression: Other tools

  • Leverage Plots

  • Cook’s Distance

look at points if they are far away

  • Studentized Residuals (graph) Look at the values of the special cases from graphs

-Partial Residual Plot

  • Multicollinearity

  • Mallow’s CP

  • BIC

  • AIC

  • Stepwise backward / forward and tepwise forward / backward

  • transform predictors