Pages

Wednesday, December 8, 2021

Improving the Predicting Modeling Selection Process Using Lean Tools and Methods

Introduction:

This investigation looks at the process of creating a linear regression model in R Studio for

a random data set. The process name, mission and definitions are identified, followed by a flowchart and metrics. Then, different Deming-based Lean Six Sigma tools are used, namely, 5S, Total Productivity Maintenance, Quick Changeovers (SMED) and Mistake Proofing (Poka Yoke). Theoretical areas for improvement and methodology are mentioned to optimize and improve the key CTQs: cycle time, number of errors and run time of the entire process.

I. Naming the process and describing its mission
Process Name: Identifying and applying a regression model to any data set with a continuous response variable
Process Mission: Finding a best fit model by eliminating unnecessary steps, reducing complexity of processes and decreasing the amount of time to find such model.

II. Mission of the process
Mission Statement: Improving the process of my statistical analysis to become a better business analyst

III. Flowchart and dashboard of the process’ objectives and metrics

Dashboard
Strategy: Lean Six Sigma Tools and Methods for Process Improvement

Objectives

Reducing the time between accessing the data set and creating a best fit model
Reducing the number of errors in the code

Reducing the number of steps needed to obtain the optimal model

Metric

Cycle time from start to end
The number of times the program says error

The number of lines of code needed to find the final model

Reducing the amount of time it takes to run all of the code

After completing the code, the number of seconds it takes to produce a document from the script

Model Selection Flowchart

IV. Operationally Define Each Metric

CTQ Definition Definition of a defect Opportunity for defects

Cycle Time

Cycle time is defined by the amount of time between opening a clean data set and finding the best fitting model for such data set.

For an efficient process it should take < 2 hours to find a model for a data set with 20 variables. Every 10 more additional variables adds about 0.5 hours of cycle time.

Optimal cycle time (in hours) is therefore given by the following formula:

Cycle Time 2 + (n 10)*0.05,

where n is the number of variables in the data set

A cycle time that does not satisfy the equation above is a defect.

Erroneous code • Inefficient code

Number of Errors

Number of times the output of a line of code says error (see figure 1 below for example).

For an efficient process the script should not include any errors. The code to make a model should be applicable to any data set without error.

Number of Errors = 0 Any error is a defect in our process.

• Program needs update • Wi-Fi Connection lost • Typing errors

Run Time

Run time is defined by the amount of time it takes to run all the lines of code one after the other to produce the best fitting model.

For an efficient process the program should be able to run all of the code to produce an output in less than 1 minute. An additional minute can be added for every additional 100’000 observations included in the data set.

Optimal run time (in minutes) is given by this formula:

Run Time 1 + (n – 100’000)*0.00001

where n is the number of observations in the data set

A run time that does not satisfy the equation above is a defect.

• Computer Memory
• Extra packages installed in the program
• Wi-Fi Speed
• Inefficient Code

Figure 1

V. Using 5S, TPM, Quick Changeovers (SMED) and Mistake Proofing (Poke Yoke) to fix our model selection process.

5S

Seiri: Elimination of unnecessary packages and data sets (waste)
Most programs require multiple statistical analysis tools which can be found in add-on

packages which are installed prior to running code. Many of these are unnecessary and downloading them may create a lag in the run time or produce errors in the code. This happens because unwanted packages override commands in packages that are needed to perform analysis resulting in errors and reduced cycle time. Previous data sets already uploaded in the program can also result in higher run times. It is important to identify unnecessary data which is already uploaded in the program.

Figure 2 is the red tagging process for the first ten packages installed. We can remove these packages to increase the speed of our run time.

Figure 3 and Figure 4 shows us the environment with previous data sets, before and after red tagging.

Figure 3 Figure 4

Figure 2

Seiton: Keeping a clean coding space
We can organize the code so that it is easy to read in multiple ways:

(1) Using “chunks” to contain code and naming them according to the step they are on so that they can easily be viewed and accessed by the user. Figure 4 shows lines of code contained in a “chunk” delimitated by the symbols “ ``` ” and “ ``` ”. The chunk of code is named ‘r histograms’ to indicate that we are at Step 5 of our process.

Figure 4

(2) Using the same format for the entire code. In our program the arrow symbol (<-) and equal sign are interchangeable (=). However, it is much easier to read code when we use a standardized symbol system. In Figure 5, even though all the lines have the same function, we can see that lines 4 and 5 have a much cleaner and organized look than lines 1 and 2. We want all lines to be standardized like lines 4 and 5 so that it is much easier to spot mistakes and read the code.

Figure 5

Seiso: Cleaning our workspace (laptop)
The biggest factor that affects speed is RAM (Random Access Memory), a laptops short

term memory. When a lot of applications are open on a computer, more of its short term memory is used, slowing down the overall performance. We want the majority of the memory to be focused on the app we are using. To clear memory, in phase 1, we can access the short term memory through our systems manager and click the big X at the top circled in Figure 6. This reduces one of our opportunities for defect in the run time mentioned above. In phase 3, we can proceed to clean the program itself after cleaning our computer, as shown in Figure 7, by using the broom function circled in red. This eliminates all slow-down from previous build up in the program and eliminates the problem from the root cause.

Figure 6

Figure 7

Seiketsu: Developing best practices
The processes above can be automated, rather than assigning the responsibility to the

analyst at the beginning of the process. To clear computer memory, shutting down the computer rather than putting it to sleep will clear all RAM memory. Upon shutting down, the computer programs provide an option to either “save” or “don’t save” the “workspace image” (seen in Figure 8). By clicking the option “don’t save”, one can prevent clutter from forming for future use. If these two best practices are implemented after every use, then we standardize the process.

Figure 8

Shitsuke: Self discipline
Through the automation of the process above we can reduce the amount of work needed to

be done by each analyst. The reminders to “shut down” and “save workspace image” come automatically when a file is closed or unattended for a long time, providing an extrinsic reminder to clean the workspace. Hopefully, returning to a clean workplace will provide intrinsic motivation and reminder at the beginning of each session to continue shutting down and clearing the workspace image moving forward.

Total Productivity Maintenance

Jishu Hozen

Operators of the code can be more involved with finding the optimal model by learning and understanding what each step means. Through this, coders are better equipped to diagnose the errors when they occur. There are many steps involved in the flowchart and knowing why one comes after the other and how each one relates is very important. By using Jishu Hozen, one could run each individual section out of order to see if each part works and to learn more about how each relates to the other. If one line of code does not work, the analyst will be able to understand why, rather than focusing on the whole process to figure out the problem.

(1) Breakdown Maintenance

We do not want breakdown maintenance to occur because it will increase the cycle time of the process. However, steps should be identified to quickly run through this type of maintenance. The table below tackles the first five reactions one should have to an error in the code.

  1. 1  Check for syntax error in the code

  2. 2  Check for missing packages needed to run the code

  3. 3  Check for missing data

  4. 4  Check for vector distances

  5. 5  Check for previous examples of the error

(2) Preventative Maintenance

The operator of our process should be up to date with the most common errors found in regression analysis. This can be done by visiting websites such as stackoverflow.com and looking for the most frequent errors. A statistical analysis of errors found in these programming forums can be done to identify and educate people on them. This will help the operator avoid them and fix them during breakdown maintenance.

(3) Corrective Maintenance

Many times there are too many errors compiled in the code and one must start from scratch. One should always clean their workspace using techniques mentioned in the Seiso section. Then one can proceed to eliminate the latest line of code until the program runs again. Sometimes, one will return to a blank page and will have to restart the process entirely. This will return the system to an operational condition. Removing one line at a time is more time consuming but the problem can be identified without the need of eliminating all the work that has been done.

Quick Changeovers (SMED)

The majority of activities in regression analysis are internal, meaning that they occur when the machine is stopped. This is because computer run time is usually very short, and each individual line of codes takes seconds or even milli-seconds to run. The only time where the machine is working for a long time is in our second to last step (number 3 in the table below), when we fit the model with missing data. Rather than having the operator remaining idle during this time we can have them look deeper into what the missing variables are, adding that one minute from external to internal so that both the analysts and the machine are working at the same time.

Current Time

Improvement

Proposed Time

Number

Task/Operation

Internal

External

Internal

External

1

Receiving the data set

2 minutes

1 second

Use the 5S to eliminate the need to remove packages before use

0.5 minutes

1 second

2

Statistical Analysis (each step after 1)

3.5 minutes per step (30 steps)

~ 5 seconds

Perform function checks and periodic preventive maintenance

3.5 minutes per step

~ 3 seconds

3

Fitting model with omitted data

1 minute

1 minute

Further understand missing variables by looking at what they are

2 minutes

1 minute

Current Total

108 minutes

3.6 minutes

Improved Total

107.5 minutes

1.5 minutes

Mistake Proofing (Poka Yoke): Contact Method
We can easily use the contact method for many of our process systems. For example, one

can only proceed to a certain step only after having completed the previous one, otherwise the program will not compute the calculation. This contact check for errors is built within the machine. For example, if one does not receive the data, one cannot continue to check for missing data. Another more complicated example is performing certain statistical analyses with outliers. Results will not be significant if one keeps the outlier and they will not be able to provide the final product, a significant predictive model. The built-in mistake proofing in the program reduces the variation that comes with the analyst making sure that everything is correct.

Having a template with checks after the code or even encouraging copy and pasting from the template can reduce the amount of mistakes done by the analyst. Following the same template allows for an easier flow of the process and reduces overall cycle time. A template can be perfected over time as the process manager understands what is closer to their preference and works best for them.

VI. Conclusion

We applied different tools to make our process leaner and to improve the three main CTQ’s: cycle time, error and run time. Many of these tools are versatile and tackle similar areas of improvement. An important implication I learned is that these tools can be universally applied on both the internal processes (the analyst) and external processes (the computer), to achieve the most efficient and simple process. 



Thursday, December 2, 2021

Tuesday, October 12, 2021

(2) Results: Using Traditional and Non-Traditional Predictors to Model Unemployment

1. Abstract

The rise and fall of unemployment during the COVID pandemic has created a very peculiar economic environment. It is under this environment that we think about traditional and non-traditional measures to predict unemployment. For traditional measures we use indexes which have been shown by previous research to relate to unemployment. For non-traditional measures we use time series data of a selection of relevant words from Google Trends. We find that the strongest model for predicting unemployment encompasses both traditional and non-traditional measures. This has implications for governmental bodies and other interested parties and their methodologies for measuring real unemployment.

2. Introduction

2.1 Overview

Unemployment affects thousands of people each year and has ramifications for all aspects of life. This is why it is important to find ways to measure unemployment so that government institutions can act in time to stop problems arising for citizens. Unemployment can indicate the health of a country's economy and in turn, of their people. Unemployment is also related to negative effects such as crime and depression. When long periods of time exist with extended bouts of unemployment, people may become discouraged, and never return to the workforce even once the economy improves, something we may have been seeing during the pandemic. Measuring and predicting unemployment therefore becomes paramount to the well being of a society. The unemployment rate is measured by the Bureau of Labor Statistics of the U.S. Department of Labor each month. They do this by collecting a stratified sample of unemployment insurance claims from each state.

2. 2. Data Collection

Predicting unemployment is a complicated endeavor since it involves multiple causes such as the inflation rate, labor force, job openings and education. In this analysis we will try to create a model that uses both typical and atypical predictors of unemployment. For example, we will be using the CPI (Consumer Price Index) which is known to strongly relate to unemployment and is traditionally used by economists to estimate the true rate of unemployment. We will also be using other measures of inflation as we would like to investigate the strength of inflation when predicting unemployment rates. All of these measures come from the Federal Reserve Economic Data of the St. Louis Federal Reserve. We will also be using Google Trends, a website by Google which allows you to search for queries and their relative volumes. We assume that typical words unemployed people would use could be "job", "job offer", “welfare”, "bored" or even "alcohol".

We will also be looking at different states of the economy. With the recent 2008 recession and the COVID pandemic we have seen many workers lose their jobs, become discouraged, or too sick to work. It would be interesting to see the differences these two recessions have had on unemployment along with the primary motivation of this analysis. This is why we will be creating dummy variables to look at the strength of such effects.

2. 3. The Data Set

The data set is composed of 17 different attributes and 213 observations spanning back to January 2004. Of these attributes, 7 are traditional, 7 are non-traditional, 2 are dummy variables and 1 is the date of the observation. Our data set goes back to 2004 because that is when Google Trends began. We decided to use a monthly frequency as the Federal Reserve calculates the unemployment rate monthly. Since we have the unemployment rate for the United States, we framed all of our variables around only this country.

On the following list, the full name of the variable is written followed by the name in our dataset, shortened for coding convenience:
Unemployment Rate (UNRATE): The unemployment rate is the number of unemployed persons over the age of 16 as a percentage of the labor force.
S&P / Case-Shiller U.S. National HomePrice Index (HPI): The Case-Shiller US National Home Price Index measures the value of single-family housing within the US. It is actually a composite of single- family home price indices.
Personal Consumption Expenditures: The PCE reflects changes in th prices of a wide range of goods and services purchased by consumers in the US. This is a good measure of inflation and deflation.
Producer Price Index (PPI): This is a group of indexes that calculates and represents the average movement in selling prices from domestic production over a month.

  • Job Openings: This is the amount in thousands of job openings available each month in the United States.

  • Consumer Price Index (CPI): The CPI measures the weighted average of prices from a select basket of consumer goods and services spanning from transportation to medical care.

  • Gas: This gives us the retail prices for all grades of gasoline in the US.

  • Trends (Work, Job Offer, Job, Welfare, Depression, Alcohol and Bored): Each trend gives us the relative frequency of google searches of the specific word compared to the overall volume of google searches in that month.

COVID: We consider COVID to have affected the American economy from March 2020 until this day. This is the first of the dummy variables with facets “Yes” or “No”.
Recession: Recessions after 2004 include the 2008 recession starting on December 2007 until June 2009, as according to the Federal Reserve. This is the second of the dummy variables with facets “Yes” or “No”.

3. Pre-data analysis

For our initial analysis we used various
graphs to see how all these variables could possibly
be related. One thing that stood out is that when
visualizing the number of job openings and the
unemployment rate (our response variable) we can
see some very stark differences between the 2008
recession and COVID. As seen in Graph 1 below,
traditional recessions have a small amount of job openings because of the economy contracting. Covid, for various reasons that are being studied right now by economists, the number of job openings increased drastically regardless of there being a high number of unemployed people. You can see from the violin graph that according to our data set unemployment was actually higher over time during COVID. This has implications for making sure that these two variables are both included in our model.

3. 1. Multicollinearity

A lot of these indexes are highly related, this may create issues for a model that includes all of them. It is important for us to note down the variables that are highly correlated because they may create issues with collinearity further on in the model. This is why we initially decided to check for the VIF before doing any additional transformations for the variables. After checking for collinearity we find that we need to remove the Personal Consumption Expenditures, Producer Price Index, Job Openings and the Work Trend. This implies that all the variables that we have in our model were highly related to the ones we had to remove. Figure 1 shows the VIF for the full model.

Once we look at the histograms (Figure 2 in the appendix) we can see that most of the data is mostly evenly distributed. The Unemployment Rate (UNRATE), JobTrend, HPI and CPI may be slightly skewed. The CPI is skewed positively because monetary policy keeps it that way as a form to combat unemployment.

This is why the UNRATE and the CPI are skewed in a mirrored way. In the QQ plots (Figure 3) can see that very few variables are skewed- possibly the HPI, JobTrend and AlcoholTrend seem to be slightly skewed and may need Box-Cox transformations.

3.2. Initial Model

After eliminating multicollinearity our proposed model therefore becomes:

UNRATE = β0 + β1(HPI) + β2 (CPI) + β3 (Gas) + β4 (JobOfferTrend) +β5 (JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c

This model is already a good predictor of Unemployment, since the Adjusted R-squared is 0.784. But it can be significantly improved.

4. Model Diagnosis

4.1. Model Assumptions

First, we checked for a linear relationship between the response variable and the predictors. By looking at the scatterplots in Figure 4, we can see the response variable “UNRATE” seems to have linear relationships with many variables, but also many non-linear relationships. We might need to run some transformations on these variables. Then, we check for normality and constant variance. And we find out that the residual plot does not seem to show random variation, whereas the QQ plot is close to a straight line (Figure 5). In order for our final assumptions to be satisfied, we might have to run transformations in the response variable and check for any possible outliers or influential points.

Finally, our outliers are marked by the COVID pandemic. After investigating them we find that the outlying months are March, April and May 2020. Given the importance of these months within our motivations we do not find it appropriate to eliminate them. We believe that by including the dummy

variables COVID and Recession the leverage that these points have on our analysis will be mitigated. Regardless, the leverage we are seeing is still pretty low.

4.2. Missing Points

Our missing points are not missing at random (NMAR), the lack thereof is related to the data itself since the last few months of certain indexes have not been released yet. We wanted to include as many months of COVID as possible and therefore did not feel the need to eliminate the months with missing data as there are multiple new policies put into effect in the last few months that may have a strong impact on the data. Therefore we decide to maintain the rows in the data set with missing values because the non-missing values may be very relevant to understanding unemployment during COVID.

4.3. Transformations

Before we fit a model, we must first perform some adjustments to the data. As previously mentioned, the histograms of the predictors JobTrend and CPI raise some concern as they are slightly skewed. In hopes of making the distribution look more normal, we impose a log transformation. Additionally, the BoxCox transformation for Unemployment Rate (Figure 6) suggests that the relation is not exactly linear. We should try the inverse of UNRATE squared. We can see in Figure 7 that after the transformation, the residual plot looks random. This is definitely an improvement from the previous model so we will use the transformations in the models we fit in the following sections.

Our new model looks like:

(UNRATE)^(-2) = β0 + β1(HPI) + β2 log(CPI) + β3 (Gas) + β4 (JobOfferTrend) + β5 log(JobTrend) + β6(WelfareTrend) + β7 (DepressionTrend) + β8 (AlcoholTrend) + β9 (BoredTrend) + β10 (COVID) + β11 (Recession) + c

This model not only did a better job in satisfying the assumptions, but also improved the adjusted R-squared to 0.9419.

5. Model Selection

5.1. Stepwise Selection
We decided that the BIC selection does not work for us because we have a model with a lot of

predictors. We chose the AIC model because our goal was to find a model that does the best at predicting unemployment rate, not to find the best predictors of it. We ran AIC selection for our model*, and the model with the smallest AIC looks like:

(UNRATE)^(-2) = β0 + β1(HPI) + β2 (JobOfferTrend) + β3 (JobTrend) + β4(DepressionTrend) + β5 (AlcoholTrend) + β6 (COVID) + c

*due to errors with the missing values, we removed the log transformations for the stepwise model selection

5.2. Proposed Model

(UNRATE)^(-2) = β0 + β1(HPI) + β2(CPI) + β3 (JobOfferTrend) + β4 (JobTrend) + β5(DepressionTrend) + β6 (AlcoholTrend) + β7 (COVID) + c

This model yields an Adjusted R-squared of 0.928 as seen below. It is important to note that we decided to keep the predictors in the second to last step of the AIC stepwise selection. This way we have a more balanced model in terms of untraditional and traditional predictors, without having a big impact in the Adjusted R-squared. This model performs well in terms of both reasonable fit and prediction.

6. Conclusion

Figure 8. Final Model Summary

After comparing models with only one or the other type of predictor we found that a model which includes both traditional and non-traditional predictors with the highest R squared. This answers our research question finding that a model with both is the most efficient.

The model that we created can be transformed by using a different number and variety of indexes. It would also be really interesting to see what other words could be used to predict unemployment. Since Google Trends was only opened in 2004, there may be other methods to go back and look at words predating 2004. One suggestion would be to take a basket of books and see how often certain words or phrases were used in these books. This could expand our model beyond 2004 and have implications for

internal validity. Further suggested analysis would be to use this model for an out of sample test in the coming months. By looking at what happens with unemployment and our predictors in the rest of 2021 then we can confirm the true strength of our model.

The analysis that we created has various implications for the way that we predict unemployment, that being, expanding the type of predictors that one might use to include unusual ones such as the ones explored here. When we use both predictors we have a better model for predicting our measure of unemployment, but this may be an even stronger measure for true unemployment. There are also implications within the strength of google search queries and their targeted advertising. For example the government could use such ads to target those who may be looking for a job. These different forms of predictors can be used as sentinels to alert for higher or lower periods of unemployment.

7. Appendix

Figure 1: VIF for Full Model



Figure 2: Histograms

Figure 3: QQ Plots

Figure 4: Scatterplots for the variables

Figure 5: Residuals and QQ plot for First Model

Figure 6: BoxCox transformation for UNRATE

Figure 7: Residuals and QQ plot for Model after transformations

8. Code

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(dplyr)
library(car)

library(formatR) library(MASS) library(faraway) library(leaps) library(Rcmdr) ```

```{r CLEAN, include= FALSE}
#load data into R
unemployment <- read.csv("Data.csv") %>%

dplyr::select(Date = Month, UNRA TE, HPI=CSUSHPISA,
PCE,

JobOpenings = JTSJOL,
PPI = PPIACO,
CPI = CPALTT01USM657N,
Gas,
WorkTrend = work...United.States., JobOfferTrend = job.offer...United.States., JobTrend = job...United.States.,
WelfareTrend = welfare...United.States., DepressionTrend = depression...United.States., AlcoholTrend = alcohol...United.States., BoredTrend = bored...United.States.,
COVID,
Recession

) ```

- The Full Model
```{r}
# Model 1: Full Model
mod1 <- lm(UNRATE ~ HPI + PCE + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment) summary(mod1)
```
- Variance Inflation Factors
```{r}
vif(mod1)
```
```{r}
# Model 2: Full Model - PCE
mod2 <- lm(UNRATE ~ HPI + JobOpenings + PPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod2)
```

There are still many variables with high VIF's. Below we will be removing the variables with the highest VIF's until we reach a simplified model with all variables holding a VIF <4.
```{r}
# Model 3: Full Model - PCE - PPI

mod3 <- lm(UNRATE ~ HPI + JobOpenings + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod3)
# Model 4: Full Model - PCE - PPI - JobOpenings

mod4 <- lm(UNRATE ~ HPI + CPI + Gas + WorkTrend + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod4)
# Model 5: Full Model - PCE - PPI - JobOpenings - WorkTrend

mod5 <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
vif(mod5)
```

```{r}
# Data Set for Model 5
unemployment5 <- read.csv("Data.csv") %>%

dplyr::select(UNRATE,
HPI=CSUSHPISA,
CPI = CPALTT01USM657N,
Gas,
JobOfferTrend = job.offer...United.States., JobTrend = job...United.States.,
WelfareTrend = welfare...United.States., DepressionTrend = depression...United.States., AlcoholTrend = alcohol...United.States., BoredTrend = bored...United.States.,
COVID,
Recession

)
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i] hist(unemployment5[,i],xlab=x,main=x) } ```
- qq plots
```{r}
par(mfrow=c(3,4))
for(i in c(1:10)){
x <- names(unemployment5)[i]

qqnorm(unemployment5[,i],xlab=x)

qqline(unemployment5[,i]) }
```
Box Cox transformations ```{r}

par(mfrow=c(2,2))
for(i in c(6)){
x <- names(unemployment5)[i]
boxcox(I(unemployment5[,i]+0.001)~1,xlab=x)### add 0.001 to make all values positive
}
```
```{r}
par(mfrow=c(2,2))
for(i in c(2:10)){
x <- names(unemployment5)[i]
plot(unemployment5[,i],unemployment5$UNRATE,xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(1,2))
for(i in c(11,12)){
x <- names(unemployment5)[i]
boxplot(unemployment5$UNRATE~unemployment5[,i],xlab=x,ylab="UNRATE")
}
```
```{r}
par(mfrow=c(2,2))
plot(mod5)
shapiro.test(residuals(mod5))
```
The residual plot does not seem random and the QQ plot is close to a straight line. There might be some influential/outliers.

```{r}
boxcox(mod5, lambda = seq (-3,3))
```
The graph suggests that the relation is not exactly linear. We should try (UNRATE)^2 .
```{r}
model.inv1=lm((UNRATE)^(-2)~HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession, data =unemployment)
plot(model.inv)
summary(model.inv)
```
- Leverage Plots / Cook's distance
```{r}
par(mfrow=c(1,2))
lev <- hatvalues(mod5) #extract the leverage values
labels <- row.names(unemployment)
halfnorm(lev,labs=labels,ylab="Leverages",xlim=c(0,3))
cook <- cooks.distance(mod5)#find Cook's distance
halfnorm(cook,labs=labels,ylab="Cook's distance",xlim=c(0,3))
```
Line 189 and 195 are outliers. We now investigate to see why.

(189) 2019-09

(195) 2020-03 ```{r}

outliers <- unemployment5[c(189,195), ] head(outliers) summary(unemployment5)
```

- Studentized Residuals (graph)
Look at the values of the special cases from graphs ```{r}

###Extract studentized residuals from the fitted model

studres <- rstudent(mod5)#get studentized residuals range(studres)
out.ind <- which(abs(studres)>3) #looking at residuals summary(unemployment5)

unemployment5[out.ind,] ```
-Partial Residual Plot ```{r}

# creating a model without categorical variables
mod5cont <- lm(UNRATE ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend , data =unemployment)
par(mfrow=c(4,2))
termplot(mod5cont,partial.resid=TRUE,pch=16)
```
- TRANSFORMATIONS
```{r}
# Logging the CPI, because it's residuals were skewed. Adding boxcox for Y
mod6 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod6)
```
```{r}
# Logging the JobTrend
mod7 <- lm((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
summary(mod7)
```
- Multicollinearity
```{r}
vif(mod7)
```

The VIF of “HPI” and “AlcoholTrend” are a bit high (5.380891, 5.179086), but all
other VIFs seem to be ok. We can try a model selection technique to remove some of the predictors and see if that removes the multicolinearity.
- Mallow's CP
```{r}
b <- regsubsets((UNRATE)^(-2) ~ HPI + log(CPI) + Gas + JobOfferTrend + log(JobTrend) + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment,nvmax=11)
rs <- summary(b)
par(mfrow=c(1,2))
plot(1:11,rs$cp,ylab="Mallow's Cp",xlab="No.of Predictors",type="l",lwd=2)
plot(1:11,rs$bic,ylab="BIC",xlab="No.of Predictors",type="l",lwd=2)
```
```{r}
# Best model selected by Mallow's cp
rs$which[which.min(rs$cp),]
mod7.cp = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + log(JobTrend) + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
```
```{r}
#Best model selected by BIC
rs$which[which.min(rs$bic),]
mod7.bic = lm((UNRATE)^(-2) ~ HPI + JobOfferTrend + WelfareTrend + DepressionTrend + AlcoholTrend + COVID, data =unemployment)
```
```{r}
summary(mod7.cp)
vif(mod7.cp)
```
```{r}
summary(mod7.bic)
vif(mod7.bic)

```
- AIC
```{r}
#using model without logs because of NaNs error
mod5.7 =lm((UNRATE)^(-2) ~ HPI + CPI + Gas + JobOfferTrend + JobTrend + WelfareTrend + DepressionTrend + AlcoholTrend + BoredTrend + COVID + Recession , data =unemployment)
```
```{r}
##Stepwise backward selection based on AIC
step_b <- stepAIC(mod5.7, trace = TRUE, direction= "backward")
##Stepwise forward selection based on AIC
step_f <- stepAIC(mod5.7, trace = TRUE, direction= "forward")
##Stepwise both ways selection based on AIC
step_both <- stepAIC(mod5.7, trace = TRUE, direction= "both")
library(betareg)
coef(mod5.7)
```