Pages

Sunday, October 10, 2021

(1) Analysis: Using Traditional and Non-Traditional Predictors to Model Unemployment

Initial Report

Introduction

Motivation

Unemployment affects thousands of people each year and has ramifications in all aspects of life. This is why it is important to find ways to measure unemployment so that government institutions can act in time to stop problems arising for citizens. Unemployment can also indicate the health of countries economy and in turn, of their people. Unemployment is also related to negative effects such as crime and depression. When long periods of time exist with extended bouts of unemployment, people may become discouraged, and never return to the workforce even when the economy improves. Measuring and predicting unemployment therefore becomes paramount to the well being of a society. The unemployment rate is measured by the Bureau of Labor Statistics of the U.S. Department of Labor each month. They do this by collecting a stratified sample of unemployment insurance claims from each state.

Predicting unemployment is a complicated endeavor as in involves multiple causes such as the inflation rate, labor force, job openings and education. In this analysis we will try to create a model that uses both typical and atypical predictors of unemployment. For example, we will be using the CPI (Consumer Price Index) which is known to strongly relate to unemployment and is traditionally used by economists to estimate the true rate of unemployment. We will also be using other measures of inflation as we would like to investigate the strength of inflation when predicting unemployment rates. We will also be using Google Trends, a website by Google which allows you to search for queries and their relative volumes. We assume that typical words unemployed people would use could be “job”, “job offer”, “bored” or even “alcohol”.

We will also be looking at different states of the economy. With the recent 2008 recession and the COVID pandemic we have seen many workers lose their jobs, become discouraged, or too sick to work. It would be interesting to see the differences these two recessions have had on unemployment along with the primary motivation of this analysis.

Research Question

Are traditional measures more succesful at predicting unemployment than non traditional measures?

The Data Set

The data set is composed of of 17 different attributes (as shown in Table 1 below) and 213 observations spanning back to January 2004.

This data set includes monthly observations for each variable. We decided to use a monthly frequency as the Federal Reserve calculates the unemployment rate monthly. Since we have the unemployment rate for the United States, we framed all of our variables around only this country.

Predictor Descriptions

Variable Notes
Date Year and Month
UNRATE Unemployment Rate, Percent, Seasonally Adjusted
HPI S&P/Case-Shiller U.S. National Home Price Index, Seasonally Adjusted
PCE Personal Consumption Expenditures, Billions of Dollars, Seasonally Adjusted Annual Rate
JobOpenings Job Openings: Total Nonfarm, Level in Thousands, Seasonally Adjusted
PPI Producer Price Index by Commodity: All Commodities, Not Seasonally Adjusted
CPI Consumer Price Index: Total All Items for the United States, Growth Rate Previous Period, Not Seasonally Adjusted
Gas U.S. All Grades All Formulations Retail Gasoline Prices (Dollars per Gallon)
WorkTrend Frequency of the word ‘Work’ relative to total search volume on Google
JobOfferTrend Frequency of the words ‘Job Offer’ relative to total search volume on Google
JobTrend Frequency of the word ‘Job’ relative to total search volume on Google
WelfareTrend Frequency of the word ‘Welfare’ relative to total search volume on Google
DepressionTrend Frequency of the word ‘Depression’ relative to total search volume on Google
AlcoholTrend Frequency of the word ‘Alcohol’ relative to total search volume on Google
BoredTrend Frequency of the word ‘Bored’ relative to total search volume on Google
COVID If USA was in a recession (Yes/No)
Recession If USA was experiencing the COVID pandemic (Yes/No)

Unemployment Rate (UNRATE)

The unemployment rate is the number of unemployed persons over the age of 16 as a percentage of the labor force.

S&P / Case-Shiller U.S. National Home Price Index (HPI)

The Case-Shiller US National Home Price Index measures the value of single-family housing within the US. It is actually a composite of single-family home price indices.

Personal Consumption Expenditures

The PCE reflects changes in th prices of a wide range of goods and services purchased by consumers in the US. This is a good measure of inflation and deflation. - describe each predictor – where it is from, – what it shows, – how each number was calculated

Job Openings

This is the amount in thousands of job openings available each month in the United States.

Producer Price Index (PPI)

This is a group of indexes that calculates and represents the average movement in selling prices from domestic production over a month.

Consumer Price Index (CPI)

The CPI measures the weighted average of prices from a select basket of consumer goods and services spanning from transportation to medical care.

Gas

This gives us the retail prices for all grades of gasoline in the US.

COVID

We consider COVID to have affected the American economy from March 2020 until this day.

Recession

Recessions after 2004 include the 2008 recession starting on December 2007 until June 2009, as according to the Federal Reserve.

Types of predictors

Variable Type Levels
Date Numerical N/A
UNRATE Numerical N/A
HPI Numerical N/A
PCE Numerical N/A
JobOpenings Numerical N/A
PPI Numerical N/A
CPI Numerical N/A
Gas Numerical N/A
WorkTrend Numerical N/A
JobOfferTrend Numerical N/A
JobTrend Numerical N/A
WelfareTrend Numerical N/A
DepressionTrend Numerical N/A
AlcoholTrend Numerical N/A
BoredTrend Numerical N/A
COVID Categorical 2: Yes, No
Recession Categorical 2: Yes, No

Summary Statistics

The two scatterplots represented in graph 1 and 2 show us the different relationships for (1) traditional predictors of unemployment and (2) non-traditional predictors of unemployment.

While some of the predictors from graph 1 have a very strong linear relationship (for example the HPI, PCE and Job Openings) we see some diversion with the variables PPI, CPI and Gas. The relationship between gas and the CPI stands out, as you would assume the consumer price index to include gas prices and to have a much more linear relationship. We will investigate this further in Graph 7. Another interesting sight is the relationship of the CPI with the PPI, PCE and HPI. You would expect the Consumer Price Index to be more linearly related to the Producer Price Index since consumers buy goods and services from producers. The Personal Consumption Expenditures index should also be more closely related to the CPI since it is highly correlated to the PPI. With such small differences, it will be interesting to see which one of these indexes will be a stronger predictor of unemployment.

Graph 2 shows us the much more erratic relationship between google trends. Of course the search “Job” and “Job Trends” have a very strong relationship. Another relationship that stands out immediately is the positive relationship between the search for the word “Alcohol” and “Work”, implying that as more times the word “Work” is googled, “Alcohol” follows. The word “Bored” and “Depression” has a more loose relationship, which may have multiple implications for psychology and the way we view the relationship between depression and boredom. Sadly, “Depression” also has a weak positive relationship with “Welfare”. This may mean that as people become more unemployed (as measured by an increase in searches for “Welfare”) they also become more depressed.

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 2 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).
## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

## Warning: Removed 1 rows containing non-finite values (stat_ydensity).

As you can see from graph 3 and 4 COVID was not your typical recession. Whilst during the 2008 recession job openings decreased, during COVID they increased. This may be due to multiple reasons, for example, a lot of jobs in construction were lost in 2008 and during COVID a lot of people left work for health reasons. This makes little sense when looking at the unemployment rate (in graph 4), that during COVID dipped even more than during the 2008 recession. Further investigation is needed as these two recessions acted very differently on unemployment.

## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 2 row(s) containing missing values (geom_path).

## Warning: Removed 3 row(s) containing missing values (geom_path).

Graph 5 shows how Personal Consumption Expenditures and Producer Price Index seem to be very correlated, since they move in a similar trend. With a significant increase in 2008 and decrease after 2020. The US National Home Index follows them except for the years 2006–2012 due to the largest crash in global real estate markets in recent history.

GRAPH 6

## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).

– TALK ABOUT GAS PRICES AND CPI: THE CPI INCLUDES TRANSPORTATION AND THEREFORE GAS, HIGHLY RELATED, MAYBE ONE CAN GO IN OUR FINAL MODEL

– TALK ABOUT MISSING DATA (SOME OF THE LAST MONTHS WERE NOT AVAILABLE YET), HOW CAN THIS AFFECT OUR ANALYSIS – ARE THERE ANY OUTLIERS? YES THE RECESSION

Pre - Analysis

Correlation

As seen from the first correlation matrix some of the traditional predictors are high correlated with each other (close to 1). This may create issues for a model that includes all of them. Only after checking for residuals will we be able to eliminate the redundant variables that are present in the first correlation matrix. It is important for us to note down the variables that are highly correlated because they may create issues with collinearity further on in the model.

The correlation matrix for the non-traditional predictors gives us variables that are more lightly related to each other. Some of the least correlated predictors are the ones with “Depression” and “Work”, “JobOpenings” and “Job”. Interestingly, depression is correlated differently to Job Offer and Job. While it is still a weak relationship, the relationship with Job Offer is slightly negative and with Job it is slightly positive. There are some very strong negative correlations such as that between “bored” and “work” and some very strong correlations such as that between “alcohol” and “work”.

Proposed Model

\[ UNRATE = \beta_0 + \beta_1*HPI + \beta_2*PCE + \beta_3*JobOpenings + \beta_4*PPI +\beta_5*CPI + \beta_6*gas + \beta_7*WorkTrend + \]

\[ \beta_8*JobOfferTrend + \beta_9*JobTrend + \beta_10*WelfareTrend + \beta_11*DepressionTrend + \beta_12*BoredTrend + \beta_13*COVID + \beta_14*Recession+ u\]

  • The Full Model
# Model 1: Full Model 
mod1 <- lm(UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

summary(mod1)
## 
## Call:
## lm(formula = UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + 
##     WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + 
##     AlcoholTrend + BoredTrend + COVID + Recession, data = unemployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4043 -0.3367 -0.0509  0.3030  3.8841 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.7612278  2.5082142   1.500 0.135353    
## HPI             -0.0305594  0.0089167  -3.427 0.000744 ***
## PCE              0.0001388  0.0002021   0.687 0.492954    
## JobOpenings     -0.0012393  0.0001554  -7.977 1.27e-13 ***
## PPI              0.0683890  0.0184877   3.699 0.000282 ***
## CPI              0.2362076  0.1553136   1.521 0.129928    
## Gas             -1.1909642  0.3465234  -3.437 0.000720 ***
## WorkTrend        0.0119620  0.0155582   0.769 0.442912    
## JobOfferTrend    0.0007640  0.0066946   0.114 0.909265    
## JobTrend        -0.0004922  0.0102250  -0.048 0.961654    
## WelfareTrend    -0.0122233  0.0094054  -1.300 0.195279    
## DepressionTrend -0.0043688  0.0075178  -0.581 0.561829    
## AlcoholTrend     0.0129157  0.0142024   0.909 0.364267    
## BoredTrend       0.0311738  0.0065705   4.745 4.05e-06 ***
## COVIDYes         4.4751654  0.3976397  11.254  < 2e-16 ***
## RecessionYes    -0.9210334  0.2067848  -4.454 1.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7306 on 194 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8857, Adjusted R-squared:  0.8769 
## F-statistic: 100.2 on 15 and 194 DF,  p-value: < 2.2e-16
  • Variance Inflation Factors
vif(mod1)
##             HPI             PCE     JobOpenings             PPI             CPI 
##       22.916956       61.923449       22.157429       43.516928        1.460353 
##             Gas       WorkTrend   JobOfferTrend        JobTrend    WelfareTrend 
##       16.179010       27.399694        5.452429        1.658003        3.584180 
## DepressionTrend    AlcoholTrend      BoredTrend           COVID       Recession 
##        2.861029        5.965220        4.371834        4.378716        1.514133

When looking at the first model (model 1) most variables have a VIF of > 4 (in decreasing order: PCE, PPI, WorkTrend, HPI, JobOpenings, Gas, JobOfferTrend, AlcoholTrend and BoredTrend). This suggests that some of these variables should not be included in the model because of the high collinearity. The ones with the highest VIF were also the ones that had a high correlation in the matrix above. The PCE (Personal Consumption Expenditures Index) had severe multicollinearity, above >60. Since we have very similar indexes in our data set it only makes sense to remove the PCE.
Now we are running the Variance Inflation Factor for the new model (model 2).

# Model 2: Full Model - PCE
mod2 <- lm(UNRATE ~ HPI + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod2)
##             HPI     JobOpenings             PPI             CPI             Gas 
##       17.915047       21.968008       26.836192        1.443887       12.575707 
##       WorkTrend   JobOfferTrend        JobTrend    WelfareTrend DepressionTrend 
##       20.293214        5.449336        1.655320        3.458456        2.735920 
##    AlcoholTrend      BoredTrend           COVID       Recession 
##        5.856248        4.268493        3.632208        1.496279

There are still many variables with high VIF’s. Below we will be removing the variables with the highest VIF’s until we reach a simplified model with all variables holding a VIF <4.

# Model 3: Full Model - PCE - PPI
mod3 <- lm(UNRATE ~ HPI + JobOpenings + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod3)
##             HPI     JobOpenings             CPI             Gas       WorkTrend 
##       17.044158       20.808234        1.293691        2.217710        8.107706 
##   JobOfferTrend        JobTrend    WelfareTrend DepressionTrend    AlcoholTrend 
##        5.443782        1.581495        3.063756        2.730156        5.304130 
##      BoredTrend           COVID       Recession 
##        4.255631        3.228722        1.403928
# Model 4: Full Model - PCE - PPI - JobOpenings
mod4 <- lm(UNRATE ~ HPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod4)
##             HPI             CPI             Gas       WorkTrend   JobOfferTrend 
##        5.373952        1.272638        2.217710        8.093479        4.824244 
##        JobTrend    WelfareTrend DepressionTrend    AlcoholTrend      BoredTrend 
##        1.567550        3.063128        2.727614        5.127353        3.278874 
##           COVID       Recession 
##        2.546438        1.356260
# Model 5: Full Model - PCE - PPI - JobOpenings - WorkTrend
mod5 <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)

vif(mod5)
##             HPI             CPI             Gas   JobOfferTrend        JobTrend 
##        4.967087        1.272174        2.095290        3.333240        1.542669 
##    WelfareTrend DepressionTrend    AlcoholTrend      BoredTrend           COVID 
##        3.039860        2.593271        3.107085        2.538216        2.545564 
##       Recession 
##        1.280639

As seen above the PCE, PPI were highly collinear with the other traditional predictors and the JobOpenings and WorkTrend are highly collinear with the non-traditional predictors. Our model (Model 5) after checking for multicollinearity is therefore the following:

\[ UNRATE = \beta_0 + \beta_1*HPI + \beta_2*CPI + \beta_3*Gas + \] \[\beta_4*JobOfferTrend +\beta_5*WelfareTrend + \beta_6*DepressionTrend + \] \[ \beta_7*AlcoholTrend +\beta_8*BoredTrend + \beta_9*COVID + \] \[ \beta_10*Recession + u\]

  • summary
summary(mod5)
## 
## Call:
## lm(formula = UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + 
##     WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + 
##     COVID + Recession, data = unemployment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5093 -0.5045 -0.0531  0.5009  4.1947 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     20.962801   1.877755  11.164  < 2e-16 ***
## HPI             -0.091345   0.005498 -16.613  < 2e-16 ***
## CPI             -0.118898   0.192001  -0.619  0.53646    
## Gas             -0.112260   0.165169  -0.680  0.49751    
## JobOfferTrend    0.014686   0.006933   2.118  0.03539 *  
## JobTrend        -0.016061   0.013063  -1.229  0.22035    
## WelfareTrend    -0.035628   0.011472  -3.105  0.00218 ** 
## DepressionTrend -0.016449   0.009480  -1.735  0.08427 .  
## AlcoholTrend     0.047285   0.013576   3.483  0.00061 ***
## BoredTrend       0.026338   0.006631   3.972 9.98e-05 ***
## COVIDYes         6.132897   0.401566  15.272  < 2e-16 ***
## RecessionYes    -0.007767   0.251883  -0.031  0.97543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9676 on 198 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.7954, Adjusted R-squared:  0.784 
## F-statistic: 69.98 on 11 and 198 DF,  p-value: < 2.2e-16
  • missing points (decide what we need)

Our missing points are not missing at random (NMAR), the lack thereof is related to the data itself since the last few months of certain indexes have not been released yet. We wanted to include as many months of COVID as possible and therefore did not feel the need to eliminate the months with missing data as there are multiple new policies put into effect in the last few months that may have a strong impact of the data. Therefore we decide to maintain the rows in the data set with missing values because the non-missing values may be very relevant to understanding unemployment during COVID.

Validation of Model Assumptions

1) Linear Relationship between y and x1, x2, x3

  • histograms
# Data Set for Model 5
unemployment5 <- read.csv("Data.csv")  %>%
  select(UNRATE,
    HPI=CSUSHPISA,
    CPI = CPALTT01USM657N,
    Gas,
    JobOfferTrend = job.offer...United.States.,
    JobTrend = job...United.States.,
    WelfareTrend = welfare...United.States.,
    DepressionTrend = depression...United.States.,
    AlcoholTrend = alcohol...United.States.,
    BoredTrend = bored...United.States.,
    COVID,
    Recession
  ) 

par(mfrow=c(3,4)) 
for(i in c(1:10)){
x <- names(unemployment5)[i]
hist(unemployment5[,i],xlab=x,main=x) }

From these histograms we can see that some of the data is mostly evenly distributed. The Unemployment Rate (UNRATE), JobTrend, HPI and CPI may be slightly skewed.

2) Normality of the random errors “Ei’s”

3) Constant Variance Assumptions for Random Errors

  • qq plots
par(mfrow=c(3,4)) 
for(i in c(1:10)){
x <- names(unemployment5)[i]
  qqnorm(unemployment5[,i],xlab=x) 
qqline(unemployment5[,i])
}

Here we can see that very few variables are skewed- possibly the HPI, WelfareTrend and AlcoholTrend seem to be slightly skewed and may need Box-Cox transformations

  • skewed variables -> Box Cox transformations

  • pairwise assumptions Shapiro / Wilks test

  • Residual Errors

  • Fit Models (Check F-test)

  • Shapiro Test

  • More Box Cox?

4) No influential points / outliers

Simple Linear Regression: Use Scatter Plot Multiple Linear Regression: Other tools

  • Leverage Plots

  • Cook’s Distance

look at points if they are far away

  • Studentized Residuals (graph) Look at the values of the special cases from graphs

-Partial Residual Plot

  • Multicollinearity

  • Mallow’s CP

  • BIC

  • AIC

  • Stepwise backward / forward and tepwise forward / backward

  • transform predictors