1. Abstract
The rise and fall of unemployment during the COVID pandemic has created a very peculiar economic environment. It is under this environment that we think about traditional and non-traditional measures to predict unemployment. For traditional measures we use indexes which have been shown by previous research to relate to unemployment. For non-traditional measures we use time series data of a selection of relevant words from Google Trends. We find that the strongest model for predicting unemployment encompasses both traditional and non-traditional measures. This has implications for governmental bodies and other interested parties and their methodologies for measuring real unemployment.
2. Introduction
2.1 Overview
Unemployment affects thousands of people each year and has ramifications for all aspects of life. This is why it is important to find ways to measure unemployment so that government institutions can act in time to stop problems arising for citizens. Unemployment can indicate the health of a country's economy and in turn, of their people. Unemployment is also related to negative effects such as crime and depression. When long periods of time exist with extended bouts of unemployment, people may become discouraged, and never return to the workforce even once the economy improves, something we may have been seeing during the pandemic. Measuring and predicting unemployment therefore becomes paramount to the well being of a society. The unemployment rate is measured by the Bureau of Labor Statistics of the U.S. Department of Labor each month. They do this by collecting a stratified sample of unemployment insurance claims from each state.
2. 2. Data Collection
Predicting unemployment is a complicated endeavor since it involves multiple causes such as the inflation rate, labor force, job openings and education. In this analysis we will try to create a model that uses both typical and atypical predictors of unemployment. For example, we will be using the CPI (Consumer Price Index) which is known to strongly relate to unemployment and is traditionally used by economists to estimate the true rate of unemployment. We will also be using other measures of inflation as we would like to investigate the strength of inflation when predicting unemployment rates. All of these measures come from the Federal Reserve Economic Data of the St. Louis Federal Reserve. We will also be using Google Trends, a website by Google which allows you to search for queries and their relative volumes. We assume that typical words unemployed people would use could be "job", "job offer", “welfare”, "bored" or even "alcohol".
We will also be looking at different states of the economy. With the recent 2008 recession and the COVID pandemic we have seen many workers lose their jobs, become discouraged, or too sick to work. It would be interesting to see the differences these two recessions have had on unemployment along with the primary motivation of this analysis. This is why we will be creating dummy variables to look at the strength of such effects.
2. 3. The Data Set
The data set is composed of 17 different attributes and 213 observations spanning back to January 2004. Of these attributes, 7 are traditional, 7 are non-traditional, 2 are dummy variables and 1 is the date of the observation. Our data set goes back to 2004 because that is when Google Trends began. We decided to use a monthly frequency as the Federal Reserve calculates the unemployment rate monthly. Since we have the unemployment rate for the United States, we framed all of our variables around only this country.
On the following list, the full name of the variable is written followed by the name in our dataset,
shortened for coding convenience:
• Unemployment Rate (UNRATE): The unemployment rate is the number of unemployed persons
over the age of 16 as a percentage of the labor force.
• S&P / Case-Shiller U.S. National HomePrice Index (HPI): The Case-Shiller US National Home
Price Index measures the value of single-family housing within the US. It is actually a composite of single-
family home price indices.
• Personal Consumption Expenditures: The PCE reflects changes in th prices of a wide range of
goods and services purchased by consumers in the US. This is a good measure of inflation and deflation.
• Producer Price Index (PPI): This is a group of indexes that calculates and represents the average
movement in selling prices from domestic production over a month.
-
Job Openings: This is the amount in thousands of job openings available each month in the United States.
-
Consumer Price Index (CPI): The CPI measures the weighted average of prices from a select basket of consumer goods and services spanning from transportation to medical care.
-
Gas: This gives us the retail prices for all grades of gasoline in the US.
-
Trends (Work, Job Offer, Job, Welfare, Depression, Alcohol and Bored): Each trend gives us the relative frequency of google searches of the specific word compared to the overall volume of google searches in that month.
• COVID: We consider COVID to have affected the American economy from March 2020 until this
day. This is the first of the dummy variables with facets “Yes” or “No”.
• Recession: Recessions after 2004 include the 2008 recession starting on December 2007 until June
2009, as according to the Federal Reserve. This is the second of the dummy variables with facets “Yes” or
“No”.
3. Pre-data analysis
For our initial analysis we used various
graphs to see how all these variables could possibly
be related. One thing that stood out is that when
visualizing the number of job openings and the
unemployment rate (our response variable) we can
see some very stark differences between the 2008
recession and COVID. As seen in Graph 1 below,
traditional recessions have a small amount of job openings because of the economy contracting. Covid, for
various reasons that are being studied right now by economists, the number of job openings increased
drastically regardless of there being a high number of unemployed people. You can see from the violin
graph that according to our data set unemployment was actually higher over time during COVID. This has
implications for making sure that these two variables are both included in our model.
3. 1. Multicollinearity
A lot of these indexes are highly related, this may create issues for a model that includes all of them. It is important for us to note down the variables that are highly correlated because they may create issues with collinearity further on in the model. This is why we initially decided to check for the VIF before doing any additional transformations for the variables. After checking for collinearity we find that we need to remove the Personal Consumption Expenditures, Producer Price Index, Job Openings and the Work Trend. This implies that all the variables that we have in our model were highly related to the ones we had to remove. Figure 1 shows the VIF for the full model.
Once we look at the histograms (Figure 2 in the appendix) we can see that most of the data is mostly evenly distributed. The Unemployment Rate (UNRATE), JobTrend, HPI and CPI may be slightly skewed. The CPI is skewed positively because monetary policy keeps it that way as a form to combat unemployment.
This is why the UNRATE and the CPI are skewed in a mirrored way. In the QQ plots (Figure 3) can see that very few variables are skewed- possibly the HPI, JobTrend and AlcoholTrend seem to be slightly skewed and may need Box-Cox transformations.
3.2. Initial Model
After eliminating multicollinearity our proposed model therefore becomes:
UNRATE = β0 + β1(HPI) + β2 (CPI) + β3 (Gas) + β4 (JobOfferTrend) +β5 (JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c
This model is already a good predictor of Unemployment, since the Adjusted R-squared is 0.784. But it can be significantly improved.
4. Model Diagnosis
4.1. Model Assumptions
First, we checked for a linear relationship between the response variable and the predictors. By looking at the scatterplots in Figure 4, we can see the response variable “UNRATE” seems to have linear relationships with many variables, but also many non-linear relationships. We might need to run some transformations on these variables. Then, we check for normality and constant variance. And we find out that the residual plot does not seem to show random variation, whereas the QQ plot is close to a straight line (Figure 5). In order for our final assumptions to be satisfied, we might have to run transformations in the response variable and check for any possible outliers or influential points.
Finally, our outliers are marked by the COVID pandemic. After investigating them we find that the outlying months are March, April and May 2020. Given the importance of these months within our motivations we do not find it appropriate to eliminate them. We believe that by including the dummy
variables COVID and Recession the leverage that these points have on our analysis will be mitigated. Regardless, the leverage we are seeing is still pretty low.
4.2. Missing Points
Our missing points are not missing at random (NMAR), the lack thereof is related to the data itself since the last few months of certain indexes have not been released yet. We wanted to include as many months of COVID as possible and therefore did not feel the need to eliminate the months with missing data as there are multiple new policies put into effect in the last few months that may have a strong impact on the data. Therefore we decide to maintain the rows in the data set with missing values because the non-missing values may be very relevant to understanding unemployment during COVID.
4.3. Transformations
Before we fit a model, we must first perform some adjustments to the data. As previously mentioned, the histograms of the predictors JobTrend and CPI raise some concern as they are slightly skewed. In hopes of making the distribution look more normal, we impose a log transformation. Additionally, the BoxCox transformation for Unemployment Rate (Figure 6) suggests that the relation is not exactly linear. We should try the inverse of UNRATE squared. We can see in Figure 7 that after the transformation, the residual plot looks random. This is definitely an improvement from the previous model so we will use the transformations in the models we fit in the following sections.
Our new model looks like:
(UNRATE)^(-2) = β0 + β1(HPI) + β2 log(CPI) + β3 (Gas) + β4 (JobOfferTrend) + β5 log(JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c
This model not only did a better job in satisfying the assumptions, but also improved the adjusted R-squared to 0.9419.
5. Model Selection
5.1. Stepwise Selection
We decided that the BIC selection does not work for us because we have a model with a lot of
predictors. We chose the AIC model because our goal was to find a model that does the best at predicting unemployment rate, not to find the best predictors of it. We ran AIC selection for our model*, and the model with the smallest AIC looks like:
(UNRATE)^(-2) = β0 + β1(HPI) + β2 (JobOfferTrend) + β3 (JobTrend) + β4(DepressionTrend) + β5 (AlcoholTrend) + β6 (COVID) + c
*due to errors with the missing values, we removed the log transformations for the stepwise model selection
5.2. Proposed Model
(UNRATE)^(-2) = β0 + β1(HPI) + β2(CPI) + β3 (JobOfferTrend) + β4 (JobTrend) + β5(DepressionTrend) + β6 (AlcoholTrend) + β7 (COVID) + c
This model yields an Adjusted R-squared of 0.928 as seen below. It is important to note that we decided to keep the predictors in the second to last step of the AIC stepwise selection. This way we have a more balanced model in terms of untraditional and traditional predictors, without having a big impact in the Adjusted R-squared. This model performs well in terms of both reasonable fit and prediction.
6. Conclusion
Figure 8. Final Model Summary
After comparing models with only one or the other type of predictor we found that a model which includes both traditional and non-traditional predictors with the highest R squared. This answers our research question finding that a model with both is the most efficient.
The model that we created can be transformed by using a different number and variety of indexes. It would also be really interesting to see what other words could be used to predict unemployment. Since Google Trends was only opened in 2004, there may be other methods to go back and look at words predating 2004. One suggestion would be to take a basket of books and see how often certain words or phrases were used in these books. This could expand our model beyond 2004 and have implications for
internal validity. Further suggested analysis would be to use this model for an out of sample test in the coming months. By looking at what happens with unemployment and our predictors in the rest of 2021 then we can confirm the true strength of our model.
The analysis that we created has various implications for the way that we predict unemployment, that being, expanding the type of predictors that one might use to include unusual ones such as the ones explored here. When we use both predictors we have a better model for predicting our measure of unemployment, but this may be an even stronger measure for true unemployment. There are also implications within the strength of google search queries and their targeted advertising. For example the government could use such ads to target those who may be looking for a job. These different forms of predictors can be used as sentinels to alert for higher or lower periods of unemployment.
8. Code
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(car)
library(formatR) library(MASS) library(faraway) library(leaps) library(Rcmdr) ```
```{r CLEAN, include= FALSE}
#load data into R
unemployment <- read.csv("Data.csv") %>%
dplyr::select(Date = Month,
UNRA TE,
HPI=CSUSHPISA,
PCE,
JobOpenings = JTSJOL,
PPI = PPIACO,
CPI = CPALTT01USM657N,
Gas,
WorkTrend = work...United.States.,
JobOfferTrend = job.offer...United.States.,
JobTrend = job...United.States.,
WelfareTrend = welfare...United.States.,
DepressionTrend = depression...United.States.,
AlcoholTrend = alcohol...United.States.,
BoredTrend = bored...United.States.,
COVID,
Recession
) ```
- The Full Model
```{r}
# Model 1: Full Model
mod1 <- lm(UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend +
WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
summary(mod1)
```
- Variance Inflation Factors
```{r}
vif(mod1)
```
```{r}
# Model 2: Full Model - PCE
mod2 <- lm(UNRATE ~ HPI + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend +
DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod2)
```
There are still many variables with high VIF's. Below we will be removing the variables with the highest VIF's until we reach a
simplified model with all variables holding a VIF <4.
```{r}
# Model 3: Full Model - PCE - PPI
mod3 <- lm(UNRATE ~ HPI + JobOpenings + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend +
DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod3)
# Model 4: Full Model - PCE - PPI - JobOpenings
mod4 <- lm(UNRATE ~ HPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend +
AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod4)
# Model 5: Full Model - PCE - PPI - JobOpenings - WorkTrend
mod5 <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend +
BoredTrend + COVID + Recession, data =unemployment)
vif(mod5)
```
```{r}
# Data Set for Model 5
unemployment5 <- read.csv("Data.csv") %>%
dplyr::select(UNRATE,
HPI=CSUSHPISA,
CPI = CPALTT01USM657N,
Gas,
JobOfferTrend = job.offer...United.States.,
JobTrend = job...United.States.,
WelfareTrend = welfare...United.States.,
DepressionTrend = depression...United.States.,
AlcoholTrend = alcohol...United.States.,
BoredTrend = bored...United.States.,
COVID,
Recession
)
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i]
hist(unemployment5[,i],xlab=x,main=x) }
```
- qq plots
```{r}
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i]
qqnorm(unemployment5[,i],xlab=x)
qqline(unemployment5[,i])
}
```
Box Cox transformations
```{r}
par(mfrow=c(2,2))
for(i in c(6)){
x <- names(unemployment5)[i]
boxcox(I(unemployment5[,i]+0.001)~1,xlab=x)### add 0.001 to make all values positive
}
```
```{r}
par(mfrow=c(2,2))
for(i in c(2:10)){
x <- names(unemployment5)[i]
plot(unemployment5[,i],unemployment5$UNRATE,xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(1,2))
for(i in c(11,12)){
x <- names(unemployment5)[i]
boxplot(unemployment5$UNRATE~unemployment5[,i],xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(2,2))
plot(mod5)
shapiro.test(residuals(mod5))
```
The residual plot does not seem random and the QQ plot is close to a straight line. There might be some influential/outliers.
```{r}
boxcox(mod5, lambda = seq (-3,3))
```
The graph suggests that the relation is not exactly linear. We should try (UNRATE)^2 .
```{r}
model.inv1=lm((UNRATE)^(-2)~HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend +
AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
plot(model.inv)
summary(model.inv)
```
- Leverage Plots / Cook's distance
```{r}
par(mfrow=c(1,2))
lev <- hatvalues(mod5) #extract the leverage values
labels <- row.names(unemployment)
halfnorm(lev,labs=labels,ylab="Leverages",xlim=c(0,3))
cook <- cooks.distance(mod5)#find Cook's distance
halfnorm(cook,labs=labels,ylab="Cook's distance",xlim=c(0,3))
```
Line 189 and 195 are outliers. We now investigate to see why.
(189) 2019-09
(195) 2020-03 ```{r}
outliers <- unemployment5[c(189,195), ]
head(outliers)
summary(unemployment5)
```
- Studentized Residuals (graph)
Look at the values of the special cases from graphs
```{r}
###Extract studentized residuals from the fitted model
studres <- rstudent(mod5)#get studentized residuals
range(studres)
out.ind <- which(abs(studres)>3) #looking at residuals
summary(unemployment5)
unemployment5[out.ind,]
```
-Partial Residual Plot
```{r}
# creating a model without categorical variables
mod5cont <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend
+ BoredTrend , data =unemployment)
par(mfrow=c(4,2))
termplot(mod5cont,partial.resid=TRUE,pch=16)
```
- TRANSFORMATIONS
```{r}
# Logging the CPI, because it's residuals were skewed. Adding boxcox for Y
mod6 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend +
AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod6)
```
```{r}
# Logging the JobTrend
mod7 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend +
AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod7)
```
- Multicollinearity
```{r}
vif(mod7)
```
The VIF of “HPI” and “AlcoholTrend” are a bit high (5.380891, 5.179086), but all
other VIFs seem to be ok. We can try a model selection technique to remove some of the predictors and see if that removes the
multicolinearity.
- Mallow's CP
```{r}
b <- regsubsets((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend
+ AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment,nvmax=11)
rs <- summary(b)
par(mfrow=c(1,2))
plot(1:11,rs$cp,ylab="Mallow's Cp",xlab="No.of Predictors",type="l",lwd=2)
plot(1:11,rs$bic,ylab="BIC",xlab="No.of Predictors",type="l",lwd=2)
```
```{r}
# Best model selected by Mallow's cp
rs$which[which.min(rs$cp),]
mod7.cp = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + log(JobTrend) + DepressionTrend + AlcoholTrend + BoredTrend +
COVID + Recession , data =unemployment)
```
```{r}
#Best model selected by BIC
rs$which[which.min(rs$bic),]
mod7.bic = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + WelfareTrend + DepressionTrend + AlcoholTrend + COVID, data
=unemployment)
```
```{r}
summary(mod7.cp)
vif(mod7.cp)
```
```{r}
summary(mod7.bic)
vif(mod7.bic)
```
- AIC
```{r}
#using model without logs because of NaNs error
mod5.7 =lm((UNRATE)^(-2) ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend +
AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
```
```{r}
##Stepwise backward selection based on AIC
step_b <- stepAIC(mod5.7, trace = TRUE, direction= "backward")
##Stepwise forward selection based on AIC
step_f <- stepAIC(mod5.7, trace = TRUE, direction= "forward")
##Stepwise both ways selection based on AIC
step_both <- stepAIC(mod5.7, trace = TRUE, direction= "both")
library(betareg)
coef(mod5.7)
```