Introduction
This data set includes information on the national price index of homes in the US. It contains a price index weekly, since before 1990. This will be interesting to analyze as the housing market can be seemingly unpredictable at times. If we also look at past market crashes, it could be possible to predict when another will occur. The data comes from FRED, and the index used is S&P/Case-Shiller (CSUSHPINSA).
data <- read.csv('CSUSHPINSA.csv')
CSHP <- ts(as.numeric(data [ ,- c(1)]),start = c (1987,01) , frequency =12)
Plot the raw data:
plot (CSHP, ylab =" Closing Price",xlab =" Day ",type ="o")
We can see that this is very sticky data, with an upward (positive) trend over time. There are no obvious outliers, except for the short rise and decline before and after the 2008 recession. Seasonality needs further investigation and the variance looks relatively constant and steady.
Model Specification
Create the BoxCox function:
BoxCox.ar1=function (y, order, lambda = seq(-2, 2, 0.01), plotit = TRUE, method = c("mle", "yule-walker", "burg", "ols", "yw"), ...)
else ar.result = ar((yˆxl[i] - 1)/xl[i], method = method, order.max = order)
loglik[i] <- -n/2 * log(ar.result$var.pred) - nlngmy }
lty = 2) abline(limit, 0, lty = 2)
1) * nlngmy
scal <- (par("usr")[4] - par("usr")[3])/par("pin")[2]
text(c(xl[1]) + 0.1, limit + 0.08 * scal, " 95%") }
invisible(list(lambda = xl, loglike = loglik, mle = mle, ci = c(lower, upper)))
}
library(FitAR)
BoxCox.ts(CSHP)
Graph 2
BoxCox.ar1(CSHP,method="burg")
Graph 3
The BoxCox transformation suggests a transformation with power -0.5. This is very uncommon so for this reason we decide against using it.
acf(CSHP)
Graph 4
pacf(CSHP)
1.5 2.0
Lag
4
ACF
Log Likelihood
0.0 0.2 0.4 0.6 0.8 1.0
2800 2900 3000
Series CSHP
0.0 0.5 1.0
eacf(CSHP)
1.5 2.0
Looks like this suggests an AR(1) Model. The pACF cuts off at lag 1 and the ACF decays extremely slowly. EACF is not very convincing of this though.There is clearly a trend element throughout the plot. The ACF plot shows that there are significant autocorrelations throughout, and it decays very slowly. Therefore the data should be differenced in order to remove autocorrelation.
plot(diff(CSHP),ylab="Differenced CSHP",xlab="Time", type="o")
Lag
5
Partial ACF
0.0 0.2 0.4 0.6 0.8 1.0
differencing.
1990 1995 2000 2005 2010 2015 2020
Time
Here we address the non stationarity in our data through differencing. This is needed because our data was very sticky and there was an increasing mean over time. We can see that the stickyness has been moderated and so has the increasing upward trend. Now we check the ACF plot again to see if the data needs more
par(mfrow=c(1,2)) acf(diff(CSHP)) pacf(diff(CSHP))
Series diff(CSHP)
Series diff(CSHP)
0.0 0.5 1.0 1.5 2.0
Lag
0.0 0.5 1.0 1.5 2.0
Lag
6
0.0
0.2 0.4 0.6 0.8
−2 0 2 4 6
−0.5
0.0 0.5
ACF
Differenced CSHP
Partial ACF
The ACF is now decaying in a geometrical fashion, we moved to take first-order differencing of the data. Now we can move onto the next step of our model specification. The differenced data looks seasonal.
par(mfrow=c(1,2)) acf(diff(CSHP), lag=12) pacf(diff(CSHP), lag=12)
Series diff(CSHP)
Series diff(CSHP)
0.2 0.4 0.6 0.8 1.0
Lag
eacf(diff(CSHP))
0.2 0.4 0.6 0.8 1.0
Lag
After we add the seasonal trend at lag 12 the data seems to decay better, with much larger cut offs in the pACF plot. The ACF and pACF plots suggest an ARIMA(2,1,0) with lag s = 12. The eACF plot suggests an ARIMA(1,1,3) or ARIMA(2,1,3).
plot(diff(CSHP, lag = 12) ,ylab= "Difference CSHP with S=12" , type="o")
7
ACF
0.0 0.2 0.4 0.6 0.8
Partial ACF
−0.5 0.0 0.5
1990 1995 2000 2005 2010 2015 2020
Time
De- spite the data looking that it has seasonality at lag = 12 when we add the seasonality into our normal plot we see that the seasonality does not actually create a better fit of the data. So we opt for not using the s =
12.
Model Fitting
#Model fitting
8
Difference CSHP with S=12
−20 0 10203040
−4 −2 0 2 4 6 −3 −1 0 1 2 3
Standardised residuals Theoretical Quantiles
standardized residuals depart from the mean in qq plot above suggesting non-normality.
#### Shapiro-Wilk normality test#### data: rstandard(arima113)## W = 0.84397, p-value < 2.2e-16
The
#normality and independence
shapiro.test(rstandard(arima113))
We see that our residuals are not normal given the very small p - value. The Shapiro-Wilk test strongly rejects normality of the residuals. Our confidence intervals can be too optimistic. This is likely due to the 2010 outliers, which are not “expected” under the assumption of normality.
runs(rstandard(arima113))
## $pvalue## [1] 0.567#### $observed.runs## [1] 215#### $expected.runs## [1] 208.6952#### $n1## [1] 188#### $n2## [1] 232##
9
Frequency
0 50 100 150
Sample Quantiles
−2 0 2 4 6
Here we fail to reject the null, and we can conclude that our residuals are independent.
0.0
Time
ACF of Residuals
0.5 1.0
1.5
1990
1995 2000 2005 2010
2015 2020
2.0
Lag
p values for Ljung−Box statistic
2 4 6 8 10 12 14
The Ljung-Box tests do not suggest lack of fit. The ARIMA(1,1,3) model appears to do a fairly good job. Now
we compare the ARIMA(1,1,3) with the overfitting equivalent of ARIMA(2,1,3) Overfitting
lag
10
p value ACF
0.0 1.0 −0.2 −2 6
−4 −2 0 2 4 6 −3 −1 0 1 2 3
Standardised residuals Theoretical Quantiles
standardized residuals depart from the mean in qq plot above suggesting non-normality.
#### Shapiro-Wilk normality test#### data: rstandard(arima213)## W = 0.84606, p-value < 2.2e-16
The
#normality and independence
shapiro.test(rstandard(arima213))
The ARIMA(2,1,3) also has very p-value, so we reject the null that the residuals are normally distributed. The Shapiro-Wilks test strongly rejects normality of the residuals. Our confidence intervals can be too optimistic. This is likely due to the 2010 outliers, which are not “expected” under the assumption of normality.
runs(rstandard(arima213))
## $pvalue## [1] 0.738#### $observed.runs## [1] 213#### $expected.runs## [1] 209.0952#### $n1## [1] 190#### $n2## [1] 230##
11
Frequency
0 50 100 150
Sample Quantiles
−2 0 2 4 6
Here we fail to reject the null, and conclude that our residuals are independent.
0.0
Time
ACF of Residuals
0.5 1.0
1.5
1990
1995 2000 2005 2010
2015 2020
2.0
Lag
p values for Ljung−Box statistic
2 4 6 8 10 12 14
lag
Histograms: Both models have residuals that are normally distributed. QQ Plots: The ARIMA(1,1,3) model seems to be slightly more linearly distrubuted across the mean. Shapiro Test: ARIMA(1,1,3) (p-value = 2.2e-16) = ARIMA(2,1,3) (p-value = 2.2e-16) Runs Test: ARIMA(1,1,3) (p-value = 0.567) > ARIMA(2,1,3) (p-value = 0.738) The AR1 model has a smaller p-value than the AR2 model. This can suggest that the residuals are more independent for the AR2 model. Ljung-Box Test: p-values look fairly similar for both. Given that both of these models have fairly similar values for these tests we will be using AIC to determine the best fit.
arima113;arima213
#### Call:## arima(x = CSHP, order = c(1, 1, 3))#### Coefficients:## ar1 ma1 ma2 ma3## 0.8065 0.7316 0.5195 0.1275## s.e. 0.0372 0.0597 0.0676 0.0527#### sigma^2 estimated as 0.1179: log likelihood = -148.15, aic = 304.3
12
p value ACF
0.0 1.0 −0.2 −2 6
#### Call:## arima(x = CSHP, order = c(2, 1, 3))#### Coefficients:## ar1 ar2 ma1 ma2 ma3## 0.6212 0.1564 0.9171 0.6446 0.2058## s.e. 0.7386 0.6259 0.7347 0.4978 0.3159#### sigma^2 estimated as 0.1179: log likelihood = -148.16, aic = 306.32
The ARIMA(1,1,3) has a smaller AIC. This is in line with the principle of parsimony and therefore choosing the simpler model. We will use this to do our forecasting. All coefficients are significant, as when we divide by their standard error, all are much greater than 2.
Also, when overfitting in the ARIMA(2,1,3) model, all coefficients no longer are significant. Dividing by their standard errors, they are all much lower than 2.
We will also be overfitting with ARIMA(1,1,4) and other seasonal models. Before moving onto forecasting we think that looking at the seasonal fit of the ARIMA(1,1,3) model may be interesting.
#### Call:## arima(x = CSHP, order = c(1, 1, 4))#### Coefficients:## ar1 ma1 ma2 ma3 ma4## 0.8035 0.7345 0.5251 0.1350 0.008## s.e. 0.0476 0.0652 0.0879 0.0935 0.083#### sigma^2 estimated as 0.1179: log likelihood = -148.15, aic = 306.3
#### Call:## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(0, 0, 2), period = 12),## method = "ML")#### Coefficients:## ar1 ma1 ma2 ma3 sma1 sma2## 0.8243 0.5231 0.3855 0.0339 0.4241 0.2826## s.e. 0.0369 0.0621 0.0644 0.0580 0.0551 0.0452#### sigma^2 estimated as 0.08963: log likelihood = -92.26, aic = 196.53
)) )) ))
arima114;arima00212;arima00312;arima10212
13
## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(0, 0, 3), period = 12),## method = "ML")#### Coefficients:
## ar1 ma1 ma2 ma3 sma1 sma2 sma3## 0.8441 0.4747 0.3698 0.0012 0.3822 0.3266 0.2431## s.e. 0.0353 0.0632 0.0639 0.0609 0.0590 0.0460 0.0544##
## sigma^2 estimated as 0.08507: log likelihood = -81.94, aic = 177.88
#### Call:## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(1, 0, 2), period = 12),## method = "ML")#### Coefficients:## ar1 ma1 ma2 ma3 sar1 sma1 sma2## 0.9028 0.3600 0.2518 -0.0920 0.9676 -0.6525 -0.0681## s.e. 0.0294 0.0656 0.0715 0.0669 0.0154 0.0560 0.0539#### sigma^2 estimated as 0.07128: log likelihood = -49.06, aic = 112.12
The ARIMA(1,1,3) x ARIMA (1,0, 2) has a much smaller AIC. We will investigate to see if normality and independence are violated.
−4 0 2 4 6 8
Standardised residuals
−3 −1 0 1 2 3
Theoretical Quantiles
14
Frequency
0 50 100 150
Sample Quantiles
−4 −2 0 2 4 6 8
#normality and independence
shapiro.test(rstandard(arima10212))
#### Shapiro-Wilk normality test#### data: rstandard(arima10212)## W = 0.82945, p-value < 2.2e-16
runs(rstandard(arima10212))
## $pvalue## [1] 0.978#### $observed.runs## [1] 209#### $expected.runs## [1] 209.781#### $n1## [1] 194#### $n2## [1] 226#### $k## [1] 0
tsdiag(arima10212,gof=15,omit.initial=F)
15
0.0
Time
ACF of Residuals
0.5 1.0
1.5
1990
Standardized Residuals
1995 2000 2005 2010
2015 2020
2.0
Lag
p values for Ljung−Box statistic
2 4 6 8 10 12 14
The results look much stronger for the multiplicative model. We will be using this to make our predictions. For example, the p-values for the Ljung-Box are predominantly non-significant for the multiplicative ARIMA
model.
Model Forecasting
## Jan Feb Mar Apr May Jun Jul Aug Sep## 2022 281.245 284.172 288.247 292.598 296.655 300.493 303.591 306.137 308.219## 2023 315.233 317.244 320.359 323.777 326.940 329.920 332.308 334.265 335.851## 2024 341.172 342.804 345.536 348.588 351.418 354.094 356.217 357.941 359.322## Oct Nov Dec
## 2022 310.077 311.899 313.585## 2023 337.276 338.662 339.927## 2024 360.564 361.780 362.892
round(arima10212predict$se,3)
## Jan Feb Mar Apr May Jun Jul Aug## 2022 0.267 0.660 1.178 1.745 2.345 2.969 3.608 4.255 4.906 5.558## 2023 7.523 8.225 8.956 9.703 10.459 11.219 11.978 12.735 13.485 14.228## 2024 16.422 17.166 17.922 18.684 19.447 20.210 20.971 21.727 22.477 23.221
16
lag
arima10212 <- arima(CSHP,order=c(1,1,3),method='ML',seasonal=list(order=c(1,0,2),period=12
arima10212predict <- predict(arima10212,n.ahead=36) round(arima10212predict$pred,3)
Sep Oct
))
p value ACF
0.0 1.0 0.0 −4 6
## Nov Dec## 2022 6.207 6.851## 2023 14.963 15.688## 2024 23.957 24.684
# prediction intervals
#### 1## 2## 3## 4## 5## 6## 7## 8## 9## 10## 11## 12## 13## 14## 15## 16## 17## 18## 19## 20## 21## 22## 23## 24## 25## 26## 27## 28## 29## 30## 31## 32## 33## 34## 35## 36
Month lower.pi upper.pi421 280.7218 281.7683422 282.8771 285.4662423 285.9373 290.5561424 289.1784 296.0173425 292.0585 301.2520426 294.6733 306.3121427 296.5199 310.6623428 297.7975 314.4774429 298.6027 317.8354430 299.1835 320.9698431 299.7345 324.0643432 300.1580 327.0126433 300.4882 329.9773434 301.1229 333.3657435 302.8049 337.9134436 304.7587 342.7951437 306.4403 347.4399438 307.9309 351.9084439 308.8307 355.7851440 309.3058 359.2244441 309.4200 362.2810442 309.3891 365.1637443 309.3349 367.9895444 309.1783 370.6755445 308.9859 373.3583446 309.1587 376.4499447 310.4086 380.6631448 311.9683 385.2074449 313.3023 389.5345450 314.4819 393.7053451 315.1148 397.3189452 315.3574 400.5252453 315.2679 403.3769454 315.0523 406.0758455 314.8264 408.7344456 314.5115 411.2717
plot(arima10212,n.ahead=36,col='red',type='b',pch=16,n1=2015,ylab="Prediction", xlab="Year lines(y=lower.pi,x=Month,lwd=2,col="red",lty="dashed") lines(y=upper.pi,x=Month,lwd=2,col="red",lty="dashed")
17
")
2016 2018 2020 2022 2024
Year
Our predictions continue the upward linear trend in the data. This suggests that the Case Shiller House Pricing Index will continue to go up at a decreasing rate for the next 36 years. The variance does fan out over time, as the process is nonstationary.
Discussion:
Our final model was: ARIMA(1, 1, 3) × ARIMA(1, 0, 2)12. After comparing this model to others and overfitting, we found that this was the most significant. After running diagnostics and tests such as Shapiro Wilks, runs test, Ljung Box test to make sure the model fits assumptions (normality, independence), this model has the most significant coefficients as compared to other models, and takes into account seasonality. We were able to forecast price index 36 months, or 3 years, ahead and were able to graph this visually, going into 2025. With such an unpredictable housing market, this is a very powerful and useful tool to be able to predict. As we saw in 2008, it is impossible to predict what will actually happen, but using this model that includes seasonal trends, it gives us more comfort in predicting housing price indices.
As pricing is ever changing, there will definitely be opportunities to revise and update the model, especially as we see new trends emerge. With time, we will be able to predict and create more accurate models for this data.
18
Prediction
200 250 300 350 400