Pages

Friday, October 8, 2021

Time Series Analysis of the S&P/Case-Shiller Housing Price Index

Introduction

This data set includes information on the national price index of homes in the US. It contains a price index weekly, since before 1990. This will be interesting to analyze as the housing market can be seemingly unpredictable at times. If we also look at past market crashes, it could be possible to predict when another will occur. The data comes from FRED, and the index used is S&P/Case-Shiller (CSUSHPINSA).

Read in the data:
data <- read.csv('CSUSHPINSA.csv')
CSHP <- ts(as.numeric(data [ ,c(1)]),start = (1987,01) , frequency =12)

Plot the raw data:
plot (CSHP, ylab =" Closing Price",xlab =" Day ",type ="o")

G1

We can see that this is very sticky data, with an upward (positive) trend over time. There are no obvious outliers, except for the short rise and decline before and after the 2008 recession. Seasonality needs further investigation and the variance looks relatively constant and steady.


Model Specification

Create the BoxCox function:

BoxCox.ar1=function (y, order, lambda = seq(-2, 2, 0.01), plotit = TRUE, method = c("mle", "yule-walker", "burg", "ols", "yw"), ...)

{
if (missing(method))

method = "mle"
y = as.vector(y/(max(abs(y)) + 1)) if (any(y <= 0))

stop("Data values must be positive") order = ar(log(y), method = method)$order nlngmy <- sum(log(y))
if (!missing(lambda))

xl <- lambda
else xl <- seq(-2, 2, 0.1)
loglik <- as.vector(xl)
for (i in 1:length(xl)) if (abs(xl[i]) > 0) {

if (missing(order))
ar.result = ar((yˆxl[i] - 1)/xl[i], method = method)

else ar.result = ar((yˆxl[i] - 1)/xl[i], method = method, order.max = order)

n = length(y) - ar.result$order
ar.res = ar.result$resid
n = length(y)
loglik[i] <- -n/2 * log(ar.result$var.pred) + (xl[i] -

}
else {

if (missing(order))
ar.result = ar(log(y), method = method)

else ar.result = ar(log(y), method = method, order.max = order) n = length(y) - ar.result$order
ar.res = ar.result$resid
n = length(y)

loglik[i] <- -n/2 * log(ar.result$var.pred) - nlngmy }

if (plotit) {
plot(xl, loglik, xlab = expression(lambda), ylab = "Log Likelihood",

type = "l", ylim = c(min(loglik), max(loglik))) lambdahat <- loglik[loglik == max(loglik)]
limit <- lambdahat - 0.5 * qchisq(0.95, 1) in.interval = xl[loglik >= limit]

lower = in.interval[1]
upper = rev(in.interval)[1]
mle = (xl[loglik == max(loglik)])[1]
lines(x = c(lower, lower), y = c(min(loglik), limit),

lty = 2)
lines(x = c(upper, upper), y = c(min(loglik), limit),

lty = 2)
lines(x = c(mle, mle), y = c(min(loglik), max(loglik)),

lty = 2) abline(limit, 0, lty = 2)

1) * nlngmy

scal <- (par("usr")[4] - par("usr")[3])/par("pin")[2]

text(c(xl[1]) + 0.1, limit + 0.08 * scal, " 95%") }

invisible(list(lambda = xl, loglike = loglik, mle = mle, ci = c(lower, upper)))

}

library(FitAR)

BoxCox.ts(CSHP)

Graph 2

BoxCox.ar1(CSHP,method="burg")

Graph 3

The BoxCox transformation suggests a transformation with power -0.5. This is very uncommon so for this reason we decide against using it.

acf(CSHP)

Graph 4

pacf(CSHP)

1.5 2.0

Lag

4

ACF

Log Likelihood

0.0 0.2 0.4 0.6 0.8 1.0

2800 2900 3000

Series CSHP

0.0 0.5 1.0

eacf(CSHP)

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ##0xxxxxxxxxxx x x x ##1xxxxxxxxxxx x x x ##2xxoxxxxxoxx x x x ##3xxoxxxxxoox x x o ##4xxxoooooooo x o o ##5xxxoxoooooo x x o ##6xxxoxoooooo x x o ##7xxxxooooooo x o o

1.5 2.0

Looks like this suggests an AR(1) Model. The pACF cuts off at lag 1 and the ACF decays extremely slowly. EACF is not very convincing of this though.There is clearly a trend element throughout the plot. The ACF plot shows that there are significant autocorrelations throughout, and it decays very slowly. Therefore the data should be differenced in order to remove autocorrelation.

plot(diff(CSHP),ylab="Differenced CSHP",xlab="Time", type="o")

Lag

5

Partial ACF

0.0 0.2 0.4 0.6 0.8 1.0

differencing.

1990 1995 2000 2005 2010 2015 2020

Time

Here we address the non stationarity in our data through differencing. This is needed because our data was very sticky and there was an increasing mean over time. We can see that the stickyness has been moderated and so has the increasing upward trend. Now we check the ACF plot again to see if the data needs more

par(mfrow=c(1,2)) acf(diff(CSHP)) pacf(diff(CSHP))

Series diff(CSHP)

Series diff(CSHP)

0.0 0.5 1.0 1.5 2.0

Lag

0.0 0.5 1.0 1.5 2.0

Lag

6

0.0

0.2 0.4 0.6 0.8

−2 0 2 4 6

−0.5

0.0 0.5

ACF

Differenced CSHP

Partial ACF

The ACF is now decaying in a geometrical fashion, we moved to take first-order differencing of the data. Now we can move onto the next step of our model specification. The differenced data looks seasonal.

par(mfrow=c(1,2)) acf(diff(CSHP), lag=12) pacf(diff(CSHP), lag=12)

Series diff(CSHP)

Series diff(CSHP)

0.2 0.4 0.6 0.8 1.0

Lag

eacf(diff(CSHP))

## AR/MA
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ##0xxxxxxxxxxx x x x ##1xxxoxxxooxx x x x ##2xxxooxxooox x x o ##3xxxoooooooo x o o ##4xxxoxoooooo x x o ##5xxxoxoooooo x x o ##6xxxxooooooo x o o ##7xxxxxoooooo x o x

0.2 0.4 0.6 0.8 1.0

Lag

After we add the seasonal trend at lag 12 the data seems to decay better, with much larger cut offs in the pACF plot. The ACF and pACF plots suggest an ARIMA(2,1,0) with lag s = 12. The eACF plot suggests an ARIMA(1,1,3) or ARIMA(2,1,3).

plot(diff(CSHP, lag = 12) ,ylab= "Difference CSHP with S=12" , type="o")

7

ACF

0.0 0.2 0.4 0.6 0.8

Partial ACF

−0.5 0.0 0.5

1990 1995 2000 2005 2010 2015 2020

Time

De- spite the data looking that it has seasonality at lag = 12 when we add the seasonality into our normal plot we see that the seasonality does not actually create a better fit of the data. So we opt for not using the s =

12.

Model Fitting

#Model fitting

arima113 <- arima(CSHP,order=c(1,1,3))
arima213 <- arima(CSHP,order=c(2,1,3)) # possible overfitting

par(mfrow=c(1,2))
hist(rstandard(arima113),xlab="Standardised residuals",main="") qqnorm(rstandard(arima113),main="")
qqline(rstandard(arima113))

8

Difference CSHP with S=12

−20 0 10203040

−4 −2 0 2 4 6 −3 −1 0 1 2 3

Standardised residuals Theoretical Quantiles

standardized residuals depart from the mean in qq plot above suggesting non-normality.

##
## Shapiro-Wilk normality test
##
## data: rstandard(arima113)
## W = 0.84397, p-value < 2.2e-16

The

#normality and independence

shapiro.test(rstandard(arima113))

We see that our residuals are not normal given the very small p - value. The Shapiro-Wilk test strongly rejects normality of the residuals. Our confidence intervals can be too optimistic. This is likely due to the 2010 outliers, which are not “expected” under the assumption of normality.

runs(rstandard(arima113))

## $pvalue
## [1] 0.567
##
## $observed.runs
## [1] 215
##
## $expected.runs
## [1] 208.6952
##
## $n1
## [1] 188
##
## $n2
## [1] 232
##

9

Frequency

0 50 100 150

Sample Quantiles

−2 0 2 4 6

## $k
## [1] 0

Here we fail to reject the null, and we can conclude that our residuals are independent.

tsdiag(arima113,gof=15,omit.initial=F)
Standardized Residuals

0.0

Time

ACF of Residuals

0.5 1.0

1.5

1990

1995 2000 2005 2010

2015 2020

2.0

Lag

p values for Ljung−Box statistic

2 4 6 8 10 12 14

The Ljung-Box tests do not suggest lack of fit. The ARIMA(1,1,3) model appears to do a fairly good job. Now

we compare the ARIMA(1,1,3) with the overfitting equivalent of ARIMA(2,1,3) Overfitting

lag

par(mfrow=c(1,2))
hist(rstandard(arima213),xlab="Standardised residuals",main="") qqnorm(rstandard(arima213),main="")
qqline(rstandard(arima213))

10

p value ACF

0.0 1.0 −0.2 −2 6

−4 −2 0 2 4 6 −3 −1 0 1 2 3

Standardised residuals Theoretical Quantiles

standardized residuals depart from the mean in qq plot above suggesting non-normality.

##
## Shapiro-Wilk normality test
##
## data: rstandard(arima213)
## W = 0.84606, p-value < 2.2e-16

The

#normality and independence

shapiro.test(rstandard(arima213))

The ARIMA(2,1,3) also has very p-value, so we reject the null that the residuals are normally distributed. The Shapiro-Wilks test strongly rejects normality of the residuals. Our confidence intervals can be too optimistic. This is likely due to the 2010 outliers, which are not “expected” under the assumption of normality.

runs(rstandard(arima213))

## $pvalue
## [1] 0.738
##
## $observed.runs
## [1] 213
##
## $expected.runs
## [1] 209.0952
##
## $n1
## [1] 190
##
## $n2
## [1] 230
##

11

Frequency

0 50 100 150

Sample Quantiles

−2 0 2 4 6

## $k
## [1] 0

Here we fail to reject the null, and conclude that our residuals are independent.

tsdiag(arima213,gof=15,omit.initial=F)
Standardized Residuals

0.0

Time

ACF of Residuals

0.5 1.0

1.5

1990

1995 2000 2005 2010

2015 2020

2.0

Lag

p values for Ljung−Box statistic

2 4 6 8 10 12 14

lag

Histograms: Both models have residuals that are normally distributed. QQ Plots: The ARIMA(1,1,3) model seems to be slightly more linearly distrubuted across the mean. Shapiro Test: ARIMA(1,1,3) (p-value = 2.2e-16) = ARIMA(2,1,3) (p-value = 2.2e-16) Runs Test: ARIMA(1,1,3) (p-value = 0.567) > ARIMA(2,1,3) (p-value = 0.738) The AR1 model has a smaller p-value than the AR2 model. This can suggest that the residuals are more independent for the AR2 model. Ljung-Box Test: p-values look fairly similar for both. Given that both of these models have fairly similar values for these tests we will be using AIC to determine the best fit.

arima113;arima213
##
## Call:
## arima(x = CSHP, order = c(1, 1, 3))
##
## Coefficients:
## ar1 ma1 ma2 ma3
## 0.8065 0.7316 0.5195 0.1275
## s.e. 0.0372 0.0597 0.0676 0.0527
##
## sigma^2 estimated as 0.1179: log likelihood = -148.15, aic = 304.3

12

p value ACF

0.0 1.0 −0.2 −2 6

##
## Call:
## arima(x = CSHP, order = c(2, 1, 3))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3
## 0.6212 0.1564 0.9171 0.6446 0.2058
## s.e. 0.7386 0.6259 0.7347 0.4978 0.3159
##
## sigma^2 estimated as 0.1179: log likelihood = -148.16, aic = 306.32

The ARIMA(1,1,3) has a smaller AIC. This is in line with the principle of parsimony and therefore choosing the simpler model. We will use this to do our forecasting. All coefficients are significant, as when we divide by their standard error, all are much greater than 2.

Also, when overfitting in the ARIMA(2,1,3) model, all coefficients no longer are significant. Dividing by their standard errors, they are all much lower than 2.

We will also be overfitting with ARIMA(1,1,4) and other seasonal models. Before moving onto forecasting we think that looking at the seasonal fit of the ARIMA(1,1,3) model may be interesting.

##
## Call:
## arima(x = CSHP, order = c(1, 1, 4))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4
## 0.8035 0.7345 0.5251 0.1350 0.008
## s.e. 0.0476 0.0652 0.0879 0.0935 0.083
##
## sigma^2 estimated as 0.1179: log likelihood = -148.15, aic = 306.3
##
## Call:
## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(0, 0, 2), period = 12),
## method = "ML")
##
## Coefficients:
## ar1 ma1 ma2 ma3 sma1 sma2
## 0.8243 0.5231 0.3855 0.0339 0.4241 0.2826
## s.e. 0.0369 0.0621 0.0644 0.0580 0.0551 0.0452
##
## sigma^2 estimated as 0.08963: log likelihood = -92.26, aic = 196.53

##
## Call:

)) )) ))

# more overfitting
arima114 <- arima(CSHP,order=c(1,1,4)) #ma4 term is not significant.
arima00212 <- arima(CSHP,order=c(1,1,3),method='ML',seasonal=list(order=c(0,0,2),period=12 arima00312 <- arima(CSHP,order=c(1,1,3),method='ML',seasonal=list(order=c(0,0,3),period=12 arima10212 <- arima(CSHP,order=c(1,1,3),method='ML',seasonal=list(order=c(1,0,2),period=12

arima114;arima00212;arima00312;arima10212

13

## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(0, 0, 3), period = 12),
## method = "ML")
##
## Coefficients:
## ar1 ma1 ma2 ma3 sma1 sma2 sma3
## 0.8441 0.4747 0.3698 0.0012 0.3822 0.3266 0.2431
## s.e. 0.0353 0.0632 0.0639 0.0609 0.0590 0.0460 0.0544
##
## sigma^2 estimated as 0.08507:  log likelihood = -81.94,  aic = 177.88
##
## Call:
## arima(x = CSHP, order = c(1, 1, 3), seasonal = list(order = c(1, 0, 2), period = 12),
## method = "ML")
##
## Coefficients:
## ar1 ma1 ma2 ma3 sar1 sma1 sma2
## 0.9028 0.3600 0.2518 -0.0920 0.9676 -0.6525 -0.0681
## s.e. 0.0294 0.0656 0.0715 0.0669 0.0154 0.0560 0.0539
##
## sigma^2 estimated as 0.07128: log likelihood = -49.06, aic = 112.12

The ARIMA(1,1,3) x ARIMA (1,0, 2) has a much smaller AIC. We will investigate to see if normality and independence are violated.

par(mfrow=c(1,2))
hist(rstandard(arima10212),xlab="Standardised residuals",main="") qqnorm(rstandard(arima10212),main="") qqline(rstandard(arima10212))

−4 0 2 4 6 8

Standardised residuals

−3 −1 0 1 2 3

Theoretical Quantiles

14

Frequency

0 50 100 150

Sample Quantiles

−4 −2 0 2 4 6 8

#normality and independence

shapiro.test(rstandard(arima10212))

##
## Shapiro-Wilk normality test
##
## data: rstandard(arima10212)
## W = 0.82945, p-value < 2.2e-16

runs(rstandard(arima10212))

## $pvalue
## [1] 0.978
##
## $observed.runs
## [1] 209
##
## $expected.runs
## [1] 209.781
##
## $n1
## [1] 194
##
## $n2
## [1] 226
##
## $k
## [1] 0

tsdiag(arima10212,gof=15,omit.initial=F)

15

0.0

Time

ACF of Residuals

0.5 1.0

1.5

1990

Standardized Residuals

1995 2000 2005 2010

2015 2020

2.0

Lag

p values for Ljung−Box statistic

2 4 6 8 10 12 14

The results look much stronger for the multiplicative model. We will be using this to make our predictions. For example, the p-values for the Ljung-Box are predominantly non-significant for the multiplicative ARIMA

model.

Model Forecasting

## Jan Feb Mar Apr May Jun Jul Aug Sep
## 2022 281.245 284.172 288.247 292.598 296.655 300.493 303.591 306.137 308.219
## 2023 315.233 317.244 320.359 323.777 326.940 329.920 332.308 334.265 335.851
## 2024 341.172 342.804 345.536 348.588 351.418 354.094 356.217 357.941 359.322
## Oct Nov Dec
## 2022 310.077 311.899 313.585
## 2023 337.276 338.662 339.927
## 2024 360.564 361.780 362.892

round(arima10212predict$se,3)

## Jan Feb Mar Apr May Jun Jul Aug
## 2022 0.267 0.660 1.178 1.745 2.345 2.969 3.608 4.255 4.906 5.558
## 2023 7.523 8.225 8.956 9.703 10.459 11.219 11.978 12.735 13.485 14.228
## 2024 16.422 17.166 17.922 18.684 19.447 20.210 20.971 21.727 22.477 23.221

16

lag

arima10212 <- arima(CSHP,order=c(1,1,3),method='ML',seasonal=list(order=c(1,0,2),period=12

arima10212predict <- predict(arima10212,n.ahead=36) round(arima10212predict$pred,3)

Sep Oct

))

p value ACF

0.0 1.0 0.0 −4 6

## Nov Dec
## 2022 6.207 6.851
## 2023 14.963 15.688
## 2024 23.957 24.684
# prediction intervals

lower.pi<-arima10212predict$pred-qnorm(0.975,0,1)*arima10212predict$se upper.pi<-arima10212predict$pred+qnorm(0.975,0,1)*arima10212predict$se Month=c(421:456)
data.frame(Month,lower.pi,upper.pi)

##
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
Month lower.pi upper.pi
421 280.7218 281.7683
422 282.8771 285.4662
423 285.9373 290.5561
424 289.1784 296.0173
425 292.0585 301.2520
426 294.6733 306.3121
427 296.5199 310.6623
428 297.7975 314.4774
429 298.6027 317.8354
430 299.1835 320.9698
431 299.7345 324.0643
432 300.1580 327.0126
433 300.4882 329.9773
434 301.1229 333.3657
435 302.8049 337.9134
436 304.7587 342.7951
437 306.4403 347.4399
438 307.9309 351.9084
439 308.8307 355.7851
440 309.3058 359.2244
441 309.4200 362.2810
442 309.3891 365.1637
443 309.3349 367.9895
444 309.1783 370.6755
445 308.9859 373.3583
446 309.1587 376.4499
447 310.4086 380.6631
448 311.9683 385.2074
449 313.3023 389.5345
450 314.4819 393.7053
451 315.1148 397.3189
452 315.3574 400.5252
453 315.2679 403.3769
454 315.0523 406.0758
455 314.8264 408.7344
456 314.5115 411.2717

plot(arima10212,n.ahead=36,col='red',type='b',pch=16,n1=2015,ylab="Prediction", xlab="Year lines(y=lower.pi,x=Month,lwd=2,col="red",lty="dashed") lines(y=upper.pi,x=Month,lwd=2,col="red",lty="dashed")

17

")

2016 2018 2020 2022 2024

Year

Our predictions continue the upward linear trend in the data. This suggests that the Case Shiller House Pricing Index will continue to go up at a decreasing rate for the next 36 years. The variance does fan out over time, as the process is nonstationary.

Discussion:

Our final model was: ARIMA(1, 1, 3) × ARIMA(1, 0, 2)12. After comparing this model to others and overfitting, we found that this was the most significant. After running diagnostics and tests such as Shapiro Wilks, runs test, Ljung Box test to make sure the model fits assumptions (normality, independence), this model has the most significant coefficients as compared to other models, and takes into account seasonality. We were able to forecast price index 36 months, or 3 years, ahead and were able to graph this visually, going into 2025. With such an unpredictable housing market, this is a very powerful and useful tool to be able to predict. As we saw in 2008, it is impossible to predict what will actually happen, but using this model that includes seasonal trends, it gives us more comfort in predicting housing price indices.

As pricing is ever changing, there will definitely be opportunities to revise and update the model, especially as we see new trends emerge. With time, we will be able to predict and create more accurate models for this data.

18

Prediction

200 250 300 350 400