Thursday, June 16, 2016

Time Series in R .. to be continued

library(quantmod)

SBUX <- auto.assign="F)</p" from="2012-1-1" getsymbols="" to="2012-12-31">
plot(SBUX$SBUX.Adjusted, main="SBUX Closing Price (adjusted) for 2012")

// Typically there are many things we might want to do in time series analysis
// *generate forecasts of future values (predictive distributions)
// *gain an understanding of the process itself

//Constructing Indicators
//how can a ts of past info be "transformed" into indicators useful for forecasting

//Standard ts models fit a process to the series, requiring the estimation of a few parameters

//Standard approach
//1. Plot the time series. There are a lot of features that are still most readilty identified by a human
//2. Transform original ts to a stationary ts
//(a) determine whether to apply boxcox/log transformation or not
//(b) check for trend, seasonality, sharp changes, outliers, etc then do appropriate transformation
//3. Fit a model to the stationary ts
//4. Disgnostic tests to check model is reasonable, if its not go back to step 2
//5. Generate forecasts (predictive distributions) for the model
//6. Invert the transformation made in step 2, get forecasts of original ts.

# Stationarity

//Weakly stationary ts if :
//Var(X(t))//E[X(t)]=mu for all t ---------------------------------(2)
//autocov(r,r+h)=Cov(X(r), X(r+h))= f(h) = autocov(h) --(3)

//Also (3) implies that Var(X(t)) is independent of t, that is, a constant. So (1) can be restated.

//Examples
//(1) White noise process WN(mu, SigmaSq) is stationary
//(2) Random Walk process X(t) = Summ(1,t) [WN(0,SigmaSq)] is NOT stationary
//(3) MA(1) process X(t) = WN(t) + theta*WN(t-1) is stationary
//(4) ARCH(1) process is stationary, actually ARCH(1) process is also a white noise process


acf(SBUX$SBUX.Adjusted)

Box.test(SBUX$SBUX.Adjusted,lag=10,type="Ljung-Box")

pacf(SBUX$SBUX.Adjusted)

//-----------------------------

//AR 1

//Stationarity of AR1. Stationary if |fi1| < 1

//use arima() to fit

plot(SBUX$SBUX.Adjusted)
Box.test(SBUX$SBUX.Adjusted,lag=10,type="Ljung-Box", fitdf=1) //fitdf=1 since AR(1), prevents overfitting
arima(SBUX$SBUX.Adjusted, order=c(1,0,0))

plot(diff(log(SBUX$SBUX.Adjusted))[-1])
Box.test(diff(log(SBUX$SBUX.Adjusted))[-1],lag=10,type="Ljung-Box", fitdf=1)
arima(diff(log(SBUX$SBUX.Adjusted))[-1], order=c(1,0,0))

//AR(p)

DGS3 <- auto.assign="F)</p" getsymbols="" src="FRED">DGS3<-last p="" years="">head(DGS3diff)
DGS3diff<-diff 1="" p="">DGS3diff<-dgs3diff diff="" is.na="" p="">
pacf(DGS3diff) #AR1 not appropriate as in AR1 lags 2,3.. shouldn't matter after accounting for lag 1
#In general in PACF plot for AR(p), shouldn't see significant numbers after lag p --- Important

#acf formula is hard for AR(p) but there is function ARMAacf
#see how the ACF and PACF for an AR(2) model with fi1=0.5 and fi2=-0.2 looks:
plot(0:10, ARMAacf(ar=c(0.5,-0.2), lag.max=10), type="b", pch=16) #ACF
plot(1:10, ARMAacf(ar=c(0.5,-0.2), lag.max=10, pacf=TRUE), type="b", pch=16)  #PACF

# using auto.arima() to fit.
# 1. argument max.p sets the largest value of p to consider, max allowed is 5.
# 2. to fit a pure AR(p) model, either use arima() or in auto.arima() specify max.q=0
# 3. ic="aic" or "bic", or "aicc" which is corrected AIC - used for model selction by auto.arima()
# 4. input data to this fn should be vector, not xts. Apply as.numeric()
# 5. stepwise=FALSE to be specified if you want exhaustive search
# 6. seasonal=FALSE if don't want to consider models with seasonal components (default)
# 7. argument max.q sets the largest value of q to consider, max allowed is 5.
# 8. d can be specified explictly (e.g. d=2) else it will determine itself (not preferred).


# Consider daily trading volume of MSFT for 2012, find first difference of series.
# what order AR is chosen with aicc?

# Answer

library(forecast)
MSFT <- auto.assign="F)</p" from="2012-1-1" getsymbols="" to="2012-12-31">MSFTVolumeDiff <- diff="" olume="" p="">auto.arima(as.numeric(MSFTVolumeDiff), max.p=6, ic="aicc")

//It fit a model with p=1 and q=1. In addition, mean parameter was found to be unnecessary.

//Stationarity of AR(p)
//Recall that root of a fn f(x) is value of x that makes f(x) equal 0.
//It can be shown that AR(p) is stationary if EVERY root of f(x) = 1 - Summ:n(1,p)[FI(n)*x^n]
//has absolute value greater than 1.
//To get the roots of f(x) shown above we can use in R, polyroot(c(1, -FI(1), -FI(2),...))

# for example, lets say an AR(2) with FI(1) -1.2074 and FI(2) -0.2237 is fit. Is it stationary?
polyroot(c(1,-1.2074,0.2237))
# both are greater than 1, but the first one is pretty close.
# Ruppert says, "since the roots are estimated with error, reason to suspect nonstationarity"

# Unit Root Tests
# These are hypothesis tests for stationarity
# the simplest one being Dickey-Fuller test - but useful only for AR(1)
# Augmented Dickey Fuller (ADF) is more commonly used: adf.test()

library(tseries)
adf.test(as.numeric(DGS3diff), alternative="s") #"s" for stationarity, "e" for explosive
#p-value of 0.01 suggests we can reject the null in favor of alternative, that is "stationary"

# Lets see what auto.arima fits on DGS3diff
auto.arima(as.numeric(DGS3diff), max.p=6, ic="aicc")
# It fit ARIMA(2,0,2) with some parameters. Can't use polyroot on it since it has MA component.

// MA, ARMA, ARIMA

# MA(q) process - Stationary, no correlation beyond lag q in ACF plot.

# Lets simulate MA(2) process data of length 250 with theta1=0.3 and theta2=-0.2 and sigmaSq=1
simdat <- arima.sim="" model="list(ma=c(0.3,-0.2))," n="250," sd="sqrt(1))</p">
# See if the simulated data is stationary by using ADF test
adf.test(as.numeric(simdat),alternative="s") #small p value suggests stationary data

# See if any correlation beyond lag 2 in ACF plot
acf(simdat) # no correlation beyond lag 2, as expected, since data was simulated from MA(2) process

# Let's fit a MA(2) model to this data (as if we didn't know it was simulated data from specified MA(2))
holdfit <- arima="" order="c(0,0,2))</p" simdat="">holdfit # returns an MA(2) with theta1 0.24, theta2 -0.16 and sigmaSq 0.88

# But the theta1 estimate has a std error - lets see the confidence interval for it
holdfit$coef[1] - 1.96*sqrt(holdfit$var.coef[1,1])
holdfit$coef[1] + 1.96*sqrt(holdfit$var.coef[1,1]) # 95% C.I for theta1 is (0.11,0.35)

#Ljung-Box Test (H1: "at least some correl at some lag") applied to residuals of the fit
Box.test(holdfit$residuals, fitdf=2, lag=20, type="Ljung-Box") #p not small enough to reject null-> no correl
acf(holdfit$residuals) # acf confirms box test result
qqnorm(holdfit$residuals) # normal probability plot.. need to make this work.
qqline(holdfit$residuals)

# Comparing acf from simulated data to acf from true process
acf(simdat, cex.axis=1.3,cex.lab=1.3,main="") # Sample ACF of simulated data
trueacf=ARMAacf(ma=c(0.3,-0.2), lag.max=30) # true ACF
points(0:30,trueacf, pch=16, col=2)
fitacf=ARMAacf(ma=holdfit$coef[1:2],lag.max=30) #ACF of the fitted model
points(0:30,fitacf, col=4)
legend(23,1,legend=c("True ACF","Fitted model ACF"), pch=c(16,1),col=c(2,4), xjust=1, cex=1.3)

# fitting ARIMA

DTWEXM <- auto.assign="F)" dollar="" getsymbols="" index="" p="" src="FRED" trade-weighted="">DTWEXM <- dtwexm="" p="">plot(DTWEXM) # clearly not stationary
plot(diff(DTWEXM,1,1)) #diff(vec,n,d) is lag n difference (X(t)-X(t-n)) applied d times #now looks stationary
plot(diff(DTWEXM,1,2)) #2nd (lag-1) differences .. this too looks stationary.

# We want smallest d that's good enough
adf.test(as.numeric(diff(DTWEXM,1,1)[!is.na(diff(DTWEXM,1,1))]), alternative="s")
adf.test(as.numeric(diff(DTWEXM,1,2)[!is.na(diff(DTWEXM,1,2))]), alternative="s")
# Looks like 1st differencing was sufficient, so d=1

# REVISION: using auto.arima() to fit.
# 1. argument max.p sets the largest value of p to consider, max allowed is 5.
# 2. to fit a pure AR(p) model, either use arima() or in auto.arima() specify max.q=0
# 3. ic="aic" or "bic", or "aicc" which is corrected AIC - used for model selction by auto.arima()
# 4. input data to this fn should be vector, not xts. Apply as.numeric()
# 5. stepwise=FALSE to be specified if you want exhaustive search
# 6. seasonal=FALSE if don't want to consider models with seasonal components (default)
# 7. argument max.q sets the largest value of q to consider, max allowed is 5.
# 8. d can be specified explictly (e.g. d=2) else it will determine itself (not preferred).
# 9. approx=F tso that true maximum likelihood estimates are found

fitted_model<-auto .arima="" a="" arima="" as.numeric="" d="1," fit="" ic="aic" it.="" it="" p="" see="" that="" to="" we="">
#---------------------------------------------------------------------------------------

# Forecasting

# REVISION: Standard approach
//1. Plot the time series. There are a lot of features that are still most readilty identified by a human
//2. Transform original ts to a stationary ts
//(a) determine whether to apply boxcox/log transformation or not
//(b) check for trend, seasonality, sharp changes, outliers, etc then do appropriate transformation
//3. Fit a model to the stationary ts
//4. Disgnostic tests to check model is reasonable, if its not go back to step 2
//5. Generate forecasts (predictive distributions) for the model
//6. Invert the transformation made in step 2, get forecasts of original ts.

k=3 # number of future predictions
# Rem that holdfit was the model fitted to MA(2) simulated values
next_k_preds <-predict holdfit="" n.ahead="k)</p">
# Lets do another example

AAA <- auto.assign="F," getsymbols="" p="" src="FRED">AAA <- aaa="" p="">
plot(AAA) # Clearly not stationary, there is a trend
plot(diff(AAA)) # looks like trend has been removed, could be stationary
adf.test(as.numeric(diff(AAA)[-1]),alt="s") # small p value suggests stationary
holdfit <- a="" approx="F)" arima="" as.numeric="" auto.arima="" d="1," fits="" it="" p="" process="">
#diagnostic tests
Box.test(holdfit$residuals, fitdf=1, lag=10, type="Ljung-Box") #high p-value, fail to reject null: no correl
acf(holdfit$residuals)# suggests no serial correlation among residuals
pacf(holdfit$residuals)# suggests no serial conditional correlation among residuals
plot(holdfit$residuals) # looks good

#generate forecasts
holdpreds <- 6="" also="" care="" holdfit="" inverting="" n.ahead="6)" note="" of="" p="" predict="" r="" step="" takes="" that="" the="" transformation="">holdpreds

#--------------------------------------------------------------------------------------------

# Seasonality

HOUSTNSA <- auto.assign="F)</p" getsymbols="" src="FRED">HOUSTNSA <- houstnsa="" p="">plot(HOUSTNSA, ylab="Housing Starts (thousands of units)")

# We notice that there are trends not stationary
# We also notice seasonality
acf(HOUSTNSA) # confirms seasonality, peaks every 12 lags

plot(diff(HOUSTNSA)[-1]) #this removes the trend, but seasonality remains acf(diff(HOUSTNSA)[-1])
plot(diff(HOUSTNSA, 12,1)) #this removes seasonality but trends remain. (seasonally adjusted)
plot(diff(diff(HOUSTNSA, 12,1))) # this removes both and should be good. ------------(A)
# we can fit ARMA to it, but in practice this turns out to not yield models with sufficient flexibility

# Multiplicative ARIMA models for accomodating seasonality (preferred)
# ARIMA((p,d,q)*(P,D,Q)s) where s is the period of seasonality (12 for a year, for example)
# Think of it as fitting ARIMA(p,d,q) on non-seasonal component and ARIMA(P,D,Q) on seasonal component of ts.
# This multiplicative ARIMA model is fitted upon the ts obtained in (A) above
#(In R you needn't do the differencings first)

# Fitting a multiplicative ARIMA model

# METHOD 1: using arima() fitting a ARIMA((1,1,1)*(1,1,0)12) to the housing data again.
holdfit1 <- arima="" order="c(1,1,1)," period="12))</p" seasonal="list(order=c(1,1,0),">
# METHOD 1: using auto.arima()

# s is typically already known, or estimated exploring the ACF, not something auto.arima() finds.
# ts object fed to auto.arima should already be in a form that specifies the period of the series, i.e. s

HOUSts<-ts an="" creates="" frequency="12)" function="" not="" object="" p="" r="" series="" the="" time="" ts="" xts=""># d, D should always be specified yourself based on inspection of series, although auto.arima can search for it
holdfit2<-auto .arima="" d="1,D=1,approx=F)</p" ts="">
# We got a better (smaller) aic than that in the totally self-specified arima() model holdfit1.

# Use it for forecasting now.
holdpreds <- holdfit2="" n.ahead="24)</p" predict="">plot(HOUSts, xlab="Year", xlim=c(0,22))
lines(holdpreds$pred,col=2)
lines(holdpreds$pred + (1.96*holdpreds$se), col=2, lty=2)
lines(holdpreds$pred - (1.96*holdpreds$se), col=2, lty=2)

#---------------------------------------------------------------------------------------------

# Heteroskedasticity

#---------------------------------------------------------------------------------------------

# ARFIMA models

#---------------------------------------------------------------------------------------------

# Revisiting Regression

# One key assumption in regression was that errors iid normal, i.e, uncorrelated with const variance
holdlm=lm(HOUSts~sqrt(HOUSts)) # just some random regression

# In a time series regression test whether they are really uncorrelated makes sense to see ACF of residuals
acf(holdlm$residuals, lag.max=10)

# Another way is the durbin-watson hypothesis test, made specifically to test serial correl in a linear model
# H0: there is no correlation. If small p-value, reject the null: there is correlation.
library(car)
durbinWatsonTest(holdlm,10)

# DW suffers from the same multiple testing problem as acf. Can use Ljung-Box on holdlm$residuals, if needed.
# In our example we can see that errors ARE correlated.

# How to deal with this problem?
# 1. might want to log transform the predictor or response variable or both in the linear model
# 2. If significant serial correlation, fit an ARIMA(p,d,q) process in place of model error, instead of WN.

# Example. Response var is 3-year tsy rate, predictor var is 1-yr tsy-rate.
OneYearTreas <- auto.assign="F)</p" getsymbols="" src="FRED">ThreeYearTreas<-getsymbols auto.assign="F)</p" src="FRED">
holdfit1 <- hreeyeartreas="" lm="" oneyeartreas="" p="">holdfit2 <- arima="" hreeyeartreas="" order="c(1,1,1)," xreg="OneYearTreas)</p">holdfit3 <- as.numeric="" auto.arima="" d="1)</p" hreeyeartreas="" xreg="as.numeric(OneYearTreas),">acf(holdfit3$residuals)

# This solves the problem of serial correlation. What about non-constant variance?
# One way is model the errors as GARCH. Unfortunately not straightforward like with ARIMA above. Follow steps:
#1. Fit a regular lm model.
#2. Fit a time series model to the residuals  from model fit above.
#3. Refit the model using weighted least squares, weights being reciprocals of conditional variances from step2
#   Conditional variance from garchFit() are found as: holdgarchfit@sigma.t
#   weights = 1/(holdfit@sigma.t^2)
# Do it yourself.

#----------------------------------------------------------------------------------------------

No comments:

Post a Comment