As a preface, I will define here some functions that will be useful later on in the analysis.
## Functions used in Time Series Project -- Lake Erie Water Levelss
library(RColorBrewer)
qual_col_pals = brewer.pal.info[brewer.pal.info$category == 'qual',]
col_vector = unlist(mapply(brewer.pal, qual_col_pals$maxcolors, rownames(qual_col_pals)))
sse<-function(pred, data){
true<-as.vector(data)
sum((pred-true)^2)
}
colors2<-col_vector
addplot<-function(pred,range=1:length(lkerie),color1){
lines(x=c(time(lkerie))[range],y=c(lkerietrain,pred)[range],col=color1,lty="dashed")
}
get.best.arima <- function(x_ts, maxord = c(1,1,1,1,1,1))
{
best.aic <- 1e8
n <- length(x_ts)
for (p in 0:maxord[1]) for(d in 0:maxord[2]) for(q in 0:maxord[3])
for (P in 0:maxord[4]) for(D in 0:maxord[5]) for(Q in 0:maxord[6])
{
fit <- arima(x_ts, order = c(p,d,q),
seas = list(order = c(P,D,Q),
frequency(x_ts)), method = "CSS")
fit.aic <- -2 * fit$loglik + (log(n) + 1) * length(fit$coef)
if (fit.aic < best.aic)
{
best.aic <- fit.aic
best.fit <- fit
best.model <- c(p,d,q,P,D,Q)
}
}
list(best.aic, best.fit, best.model)
}
library(forecast)
library(prophet)
library(tidyr)
library(TSA)
library(lubridate)
The dataset I have chosen is a monthly recording of water levels in Lake Erie from the period between 1921 and 1970. The dataset is available at the following link: https://datamarket.com/data/set/22pw/monthly-lake-erie-levels-1921-1970#!ds=22pw&display=line&numberformat=n1. Below is a plot of the time series over time:
Importing the data¶
#lkerie<-dmseries("https://datamarket.com/data/set/22pw/monthly-lake-erie-levels-1921-1970#!ds=22pw&display=line")
#lkerie <- as.ts(lkerie,frequency=12)
#Note: update as of 2019, datamarket is no longer availiable. As an alternative, I am using data from a different source
# that has the lake erie water levels as well.
lkerie = read.csv('http://lre-wm.usace.army.mil/ForecastData/GLHYD_data_english.csv', skip=12)
mask = (lkerie$year <= 1970) & (lkerie$year >= 1921)
lkerie = lkerie[mask, 'Erie']
lkerie <- ts(lkerie,frequency=12, start=c(1921, 1))
plot(lkerie)
We can see that the water levels seem to jump somewhat cyclically over time. This is typical of physical processes, and thus it is an interesting dataset to analyze. We see that over time the water levels decreased initially, and then began to increase again somewhat steadily from the 1940s onwards. While there is clearly a seasonal component at work here, it is not so clear how this seasonal component operates initially from the plot above.
I first begin by fitting a seasonal means with linear trend model to summarize and get an idea of the seasonal component of the data. I then built four models – the Holt Winters model, an SARIMA model, a model decomposing the seasonal component with spectral analysis, and a model from the prophet package. I then compare the models using the sum squared errors and plotting them against the true data values.
I have separated the time series into two sets, as shown below. These are the training and test sets I will be using to evaluate how effective each of the models I fit are at estimating the true time series. The following code initializes the dataset as a time series, and creates a training and test dataset to perform analysis on. The test set is the last 12 observations, or one year, of the time series.
lkerietrain<-ts(lkerie[1:(length(lkerie)-12)], frequency=12)
lkerietest<-lkerie[(length(lkerie)-11):length(lkerie)]
tail(lkerietrain)
After this initialization, the dataset is in good condition and doesn’t need to be cleaned any further. The first 6 values of the time series are shown below:
Seasonal Means Model¶
The first model I fit was a seasonal means model, to get a sense of the seasonality of the dataset. Although this is a simplistic model that will not be used to model the data itself, it provides good insight into the seasonal nature of the time series. Below is a plot of the time series with the months added to the plot:
season_<-season(lkerie)
time_<-time(lkerie)
We can see that the time series tends to peak around the summer time, and has troughs in the colder months. The seasonal means model gives us more insight into how this is occurring:
help(plot)
seasonmean<-lm(lkerie~season_+time_)
summary(seasonmean)
plot(lkerie)
points(y=lkerie,x=time(lkerie),pch=as.vector(season(lkerie)))
Indeed, we can see that the seasonal means model has quite a bit of significant components, suggesting that there is a good amount of seasonality in the data. However, this is not consistent, with some months having insignificant values. In general, we can see that the summer months have positive coefficients, while the colder months (which are not nearly as significant) sit somewhere close to zero. This agrees with the earlier findings from the time series plot.
Next, I perform a decomposition of the time series, to see how this package can break up the time series into seasonal, trend and random components.
madecomp<-decompose(lkerie)
plot(madecomp)
madecomp$seasonal[1:12]
We can see again that there is clearly a seasonal component to the time series. We can also see the general trend of decreasing and increasing. When these components are removed, we see a relatively stationary random process. This cyclical nature of the seasonality lead me to consider spectral analysis. However, first we will look at two other models which can deal with seasonality – the Holt Winters’ Model and a seasonal ARIMA model.
Holt Winters' Model¶
I use two Holt-Winters’ Models (an additive and multiplicative model) to model the data. Below, I fit the two models, and plot the resulting fits. The Holt-Winters models both consider seasonal components, although the multiplicative model and additive model come up with different quantities for the coefficients (see the appendix for the summaries). However, the smoothing parameters are relatively the same in character, with alpha=0.8973698 beta=0.003048736 and gamma: 1. This indicates that the level and seasonality are smoothing based on observations in the distant past (with gamma = 1 meaning it is considering only the previous observations), while a small beta indicates that it is using the trend of only the nearest values. I have plotted one of the models below, but the other is very similar in nature. However, as we will see later, the prediction error is quite high – (and the prediction error for the multiplicative model is, predictably, much higher).
hw1<-HoltWinters(lkerietrain)
hw1
plot(hw1)
hw2<-HoltWinters(lkerietrain,beta=NULL,seasonal="multiplicative")
hw2
plot(hw2)
hw1pred<-predict(hw1,n.ahead=12,prediction.interval=TRUE)
hw2pred<-predict(hw2,n.ahead=12,prediction.interval=TRUE)
plot(forecast(hw1,12))
plot(forecast(hw2,12))
ssehw1<-sse(hw1pred[,1], lkerietest)
ssehw2<-sse(hw2pred[,1], lkerietest)
SARIMA Modeling¶
Next, we will turn to using an ARIMA (or in this case, a seasonal ARIMA model) to model and predict the data. The above time series is not stationary, so we must transform it. The transformation which creates an optimal situation for ARIMA modeling is one that is roughly stationary. After looking through various transformations (including logarithmic, differencing and BoxCox transformations), I settled upon a differencing of logs. By differencing the logarithms of the model, we get a result that is very close to stationary. The resulting plot is plotted below.
ts.plot(lkerie)
ts.plot(log(lkerie))
ts.plot(diff(lkerie))
ts.plot(diff(log(lkerie)))
BxCx<-BoxCox.ar(lkerie)
root<-BxCx$mle
ts.plot(diff(lkerie^(root)))
lkerietrain.trans<-diff(log(lkerie))
plot(lkerietrain.trans)
Except for the one outlier just before 1940, we can see that this time series is now stationary, and we can proceed to fit ARIMA models to it.
Using the ACF, PACF and EACF plots, we can see that it will be difficult to find a fitting ARMA (without a seasonal component) model that fits the data well. Since we believe there is a good degree of seasonality to this data, this is not surprising. In the plots below of the ACF and PACF, we see some sine and cosine patterns in the autocorrelations. This indicates a seasonal model may be more appropriate than an ordinary ARMA or ARIMA model.
mean(lkerie)
mean(diff(lkerie))
mean(diff(log(lkerie)),lag=12)
mean(diff(lkerie^root))
acf(lkerietrain.trans)
pacf(lkerietrain.trans)
Using the arima.subsets function gives us a good idea of the seasonality and we can better infer the best models to use based on this.
a <- armasubsets(lkerietrain.trans,nar=12,nma=12)
plot(a)
We can see the most significant lags are those at times 1, 11, and 12 for the AR component, and lag 12 for the MA component. Using the auto.arima and get.best.arima functions, we can get the optimal model based on the AIC.
auto.arima(lkerietrain.trans)
get.best.arima(lkerietrain.trans,maxord=c(2,2,2,2,2,2))
The get.best.arima’s model produces the minimal AIC, and therefore is the optimal model. We can see it is very similar to the models determined by the subsets, and adds an MA component to the seasonal part of the ARIMA model. I consider both models found (that is, SARIMA[1,0,0]x[2,0,2] and SARIMA[0,0,1]x[2,0,2]). In both models, all of the terms are significant, so there is no need to remove any.
Looking at the residuals of these ARIMA models, we can see that they are roughly normally distributed, with the outliers skewing it somewhat. This is encouraging. Unfortunately, the residuals (and squared residuals) do seem to still exhibit autocorrelation. However, the ACF and PACF squared do not appear to have drastically more correlated values in the squared case. I have included the Q-Q plot and the ACF of the residuals below. For brevity,
I have omitted the PACF, but it exhibits similar results. Regardless, this result is disappointing. This led me to use spectral analysis to interpret the seasonality, as I believed there was some underlying seasonal pattern that could not be found using these standard SARIMA processes.
eacf(lkerietrain.trans)
sarima1<-arima(log(lkerietrain), order = c(0, 1, 1), seasonal = list(order = c(2, 0, 2), frequency(lkerietrain)),
method = "CSS")
sarima2<-arima(log(lkerietrain), order = c(1, 0, 1), seasonal = list(order = c(0, 1, 1), frequency(lkerietrain)),
method = "CSS")
sarima1pred<-predict(sarima1,n.ahead=12)
sarima2pred<-predict(sarima2,n.ahead=12)
predarima1<-exp(sarima1pred$pred)
predarima2<-exp(sarima2pred$pred)
searima1<-exp(sarima1pred$se)
searima2<-exp(sarima2pred$se)
ts.plot(lkerietrain,predarima1,col="red",main="SARIMA 1",xlab="Time",ylab="Water Level")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima1+1.96*searima1)),col="red",lty="dashed")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima1-1.96*searima1)),col="red",lty="dashed")
lines(lkerietrain)
ts.plot(lkerietrain,predarima2,col="blue",main="SARIMA 2",xlab="Time",ylab="Water Level")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima2+1.96*searima2)),col="blue",lty="dashed")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima2-1.96*searima2)),col="blue",lty="dashed")
lines(lkerietrain)
ssesarima1<-sse(predarima1, lkerietest)
ssesarima2<-sse(predarima2, lkerietest)
plot(residuals(sarima1))
plot(residuals(sarima2))
qqnorm(y=residuals(sarima1))
acf(residuals(sarima1))
acf(residuals(sarima1)^2)
pacf(residuals(sarima1))
pacf(residuals(sarima1)^2)
acf(residuals(sarima2))
acf(residuals(sarima2)^2)
pacf(residuals(sarima2))
pacf(residuals(sarima2)^2)
Spectral Analysis + ARIMA¶
The cyclical component of the time series led me to believe that using spectral analysis to analyze the seasonality of the data may be fruitful. The method I used here was to find the highest frequencies on the periodogram, and use these to model the seasonal components.
Using the ten frequencies with this highest spectrum on the log of the time series, I created twenty variables with a sin/cosine function pertaining to each. I then performed linear regression on the transformed time series with these ten values to create a harmonic model.
spec.per<-spec(log(lkerietrain))
frequencies<-spec.per$freq[order(spec.per$spec,decreasing=TRUE)]
t<-1:length((log(lkerietrain)))
length(t)
cc1<-as.numeric(cos(2*pi*frequencies[1]*t))
d1<-sin(2*pi*frequencies[1]*t)
c2<-cos(2*pi*frequencies[2]*t)
d2<-sin(2*pi*frequencies[2]*t)
c3<-cos(2*pi*frequencies[3]*t)
d3<-sin(2*pi*frequencies[3]*t)
c4<-cos(2*pi*frequencies[4]*t)
d4<-sin(2*pi*frequencies[4]*t)
c5<-cos(2*pi*frequencies[5]*t)
d5<-sin(2*pi*frequencies[5]*t)
c6<-cos(2*pi*frequencies[6]*t)
d6<-sin(2*pi*frequencies[6]*t)
c7<-cos(2*pi*frequencies[7]*t)
d7<-sin(2*pi*frequencies[7]*t)
c8<-cos(2*pi*frequencies[8]*t)
d8<-sin(2*pi*frequencies[8]*t)
c9<-cos(2*pi*frequencies[9]*t)
d9<-sin(2*pi*frequencies[9]*t)
c10<-cos(2*pi*frequencies[10]*t)
d10<-sin(2*pi*frequencies[10]*t)
spec.m1<-lm(log(lkerietrain)~cc1+c2+c3+c4+c5+c6+c7+c8+c9+c10+d1+d2+d3+d4+d5+d6+d7+d8+d9+d10+t)
summary(spec.m1)
The results here are very significant, with a very high R-squared (.8804). We can see that the plot below produced a moderately good fit, considering it is just a sum of sine and cosine waves. However, the danger of overfitting is present. To mitigate this issue, I used the training data and a ARIMA (1,0,1) to find the sum squared errors related to different amounts of frequencies. We can see that the error plateaus at 2 frequencies. I used frequencies of 2 and 4 to create these fits. I then used these frequencies as a prediction of the seasonal components.
plot(exp(fitted(spec.m1)),type="l",col="blue")
lines(t,lkerietrain)
test<-cbind(c(589:600),cos(2*pi*frequencies[1]*(589:600)),
sin(2*pi*frequencies[1]*(589:600)),
cos(2*pi*frequencies[2]*(589:600)),
sin(2*pi*frequencies[2]*(589:600)),
cos(2*pi*frequencies[3]*(589:600)),
sin(2*pi*frequencies[3]*(589:600)),
cos(2*pi*frequencies[4]*(589:600)),
sin(2*pi*frequencies[4]*(589:600)),
cos(2*pi*frequencies[5]*(589:600)),
sin(2*pi*frequencies[5]*(589:600)),
cos(2*pi*frequencies[6]*(589:600)),
sin(2*pi*frequencies[6]*(589:600)),
cos(2*pi*frequencies[7]*(589:600)),
sin(2*pi*frequencies[7]*(589:600)),
cos(2*pi*frequencies[8]*(589:600)),
sin(2*pi*frequencies[8]*(589:600)),
cos(2*pi*frequencies[9]*(589:600)),
sin(2*pi*frequencies[9]*(589:600)),
cos(2*pi*frequencies[10]*(589:600)),
sin(2*pi*frequencies[10]*(589:600)))
colnames(test)=c("t","cc1","d1","c2","d2","c3","d3","c4","d4","c5","d5","c6","d6","c7","d7","c8","d8","c9","d9","c10","d10")
plot(lkerietrain)
models<-list()
for(i in 1:10){
beg<-"cc1+d1"
if(i>1){
for(s in 2:i){
beg<-paste(beg,paste("c",s,sep=""),paste("d",s,sep=""),sep="+")
}
}
end<-paste(beg,"t",sep="+")
models[[i]]<-lm(formula(paste("log(lkerietrain)","~",end,sep="")))
pre<-exp(fitted(models[[i]]))
sub<-lkerietrain-exp(fitted(models[[i]]))
arimasub<-Arima(sub, order=c(1,0,1))
print(i)
print(sum(((arimasub$fitted+pre)-lkerietrain)^2))
}
summary(models[[2]])
summary(models[[4]])##c4 is insignificant, remove it
models[[4]]<-lm(log(lkerietrain)~cc1+c2+c3+d1+d2+d3+d4+t)
Using the models with 2 and 4 frequencies, I subtracted these fitted spectral models from the original time series. As the plots are very similar in nature, I will only produce one of them here:
sub1<-lkerietrain-exp(fitted(models[[2]]))
sub2<-lkerietrain-exp(fitted(models[[4]]))
plot(sub1)
plot(sub2)
This produced two time series that are very reminiscent of random noise, which is what is expected, as this was to remove the seasonal component from the data. After a similar analysis to this data to the above SARIMA models, I arrived at two models to fit this ‘white noise’ like data, SARIMA[1,0,1]x[0,1,1] for the first and SARIMA[2,1,1]x[1,0,1] for the second (the code is included in the appendix, but this result had the smallest AIC from the get.best.arima() function.)
acf(sub1)
pacf(sub1)
acf(sub2)
pacf(sub2)
eacf(sub1)
eacf(sub2)
plot(armasubsets(sub1,nar=10,nma=10))
plot(armasubsets(sub2,nar=10,nma=10))
auto.arima(sub1)
auto.arima(sub2)
get.best.arima(sub1,c(2,2,2,2,2,2))
get.best.arima(sub2,c(2,2,2,2,2,2))
a1spec1<-Arima(sub1,c(1,0,1))
a2spec1<-auto.arima(sub1)
a1spec2<-Arima(sub2,c(1,0,1))
a2spec2<-auto.arima(sub2)
a3spec1<-Arima(y = sub1, order = c(1, 0, 1), seasonal = c(0, 1, 1),
method = "CSS")
a3spec2<-Arima(y = sub2, order = c(2, 1, 1), seasonal = c(1, 0, 1),
method = "CSS")
The plots below show that the residuals for the first model are relatively normal and that they do not exhibit very high auto-correlation or partial auto-correlation in the squared cases, which suggests that a these ARIMA models are sufficient, and a GARCH model is unneccessary.
resid1<-residuals(a3spec1)
resid2<-residuals(a3spec2)
plot(resid1)
acf(resid1)
acf(resid1^2)
plot(resid2)
pacf(resid2)
pacf(resid2^2)
# Predictions
specpred1<-exp(predict(models[[2]],newdata=as.data.frame(test)))
specpred2<-exp(predict(models[[4]],newdata=as.data.frame(test)))
arimapred1<-predict(a3spec1,n.ahead=12)$pred
arimapred2<-predict(a3spec2,n.ahead=12)$pred
arimase1<-predict(a3spec1,n.ahead=12)$se
arimase2<-predict(a3spec2,n.ahead=12)$se
finalpred1<-specpred1+arimapred1
finalpred2<-specpred2+arimapred2
ssesp1<-sse(finalpred1, lkerietest)
ssesp2<-sse(finalpred2, lkerietest)
ssesp1/12
ssesp2/12
Prophet¶
The final model that I considered was one produced by Facebook’s prophet package for R. All that is needed is to initialize the data and set it up appropriately. Although the prophet package is somewhat of a “black box” – and it is most effective on daily data – I thought it would be interesting to include it and see how it performs as compared with the other models.
ds <- as.Date(lkerietrain)
y<-as.numeric((log(lkerietrain)))
df<-data.frame(ds,y)
prop<-prophet(df,weekly.seasonality=TRUE)
future <- make_future_dataframe(prop, period = 12,freq="m")
forecast <- prophet:::predict.prophet(prop, future)
fitproph<-exp((forecast$yhat)[1:(length(forecast$yhat)-12)])
predproph<-exp((forecast$yhat)[(length(forecast$yhat)-11):length(forecast$yhat)])
confintup<-exp((forecast$yhat_upper)[(length(forecast$yhat)-11):length(forecast$yhat)])
confintlow<-confintup<-exp((forecast$yhat_lower)[(length(forecast$yhat)-11):length(forecast$yhat)])
sseproph<-sse(predproph, lkerietest)
Results¶
Below is a table representing the sum squared errors of each of the models when forecasting for the next 12 months. These are the errors of each model in comparison to the test data.
results<-data.frame(c(ssehw1,ssehw2,ssesarima1,ssesarima2,ssesp1,ssesp2,sseproph))
colnames(results)<-"SSE"
rownames(results)<-(c("HW 1","HW 2","SARIMA 1","SARIMA 2","Spec 1","Spec 2","Prophet"))
results
We can see that the models which use the frequencies from spectral analysis to remove the seasonal components are the best at predicting the data. This makes sense, as many geophysical processes can be modeled well using the sin and cosine waves in spectral analysis. Besides these two models, the SARIMA 2 model predicted with the least error. Below, I produce a plot which shows all the models’ predictions as well as the true values for the time series.
length(lkerie)
range<-560:600
plot(time(lkerie)[range],lkerie[range],ty="l",xlab="Time",ylab="Water Level",main="Overview of Models Used")
addplot(finalpred1,range,colors2[1])
addplot(finalpred2,range,colors2[2])
addplot(predarima1,range,colors2[3])
addplot(predarima2,range,colors2[4])
addplot(hw1pred,range,colors2[5])
addplot(hw2pred,range,colors2[6])
addplot(predproph,range,colors2[7])
lines(lkerie[range])
legend('topleft', c("True","Spec 1","Spec 2","SARIMA 1","SARIMA 2","HW 1","HW 2","Prophet"),col=c("black",colors2),
lty="dashed", bty='n', cex=.75)
Finally, I produce graphs with confidence intervals for the most effective models of each type.
plot(forecast(hw1,12))
plot(x=time(lkerie),c(lkerietrain,finalpred2),ty="l",main="Forecast of Spec 2",xlab="Time",ylab="Water Level")
lines(x=c(time(lkerie)),c(lkerietrain,finalpred1+1.96*arimase1),lty="dashed",col="blue")
lines(x=c(time(lkerie)),c(lkerietrain,finalpred1-1.96*arimase1),lty="dashed",col="blue")
lines(lkerietrain,col="black")
plot(time(lkerie), c(lkerietrain,predarima2), ty='l', col="blue",main="SARIMA 2",xlab="Time",ylab="Water Level")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima2+1.96*searima2)),col="blue",lty="dashed")
lines(x=c(time(lkerie)),c(lkerietrain,(predarima2-1.96*searima2)),col="blue",lty="dashed")
lines(lkerietrain)