Showing posts with label Data Analysis. Show all posts
Showing posts with label Data Analysis. Show all posts

Thursday, June 16, 2016

Time Series in R .. to be continued

library(quantmod)

SBUX <- auto.assign="F)</p" from="2012-1-1" getsymbols="" to="2012-12-31">
plot(SBUX$SBUX.Adjusted, main="SBUX Closing Price (adjusted) for 2012")

// Typically there are many things we might want to do in time series analysis
// *generate forecasts of future values (predictive distributions)
// *gain an understanding of the process itself

//Constructing Indicators
//how can a ts of past info be "transformed" into indicators useful for forecasting

//Standard ts models fit a process to the series, requiring the estimation of a few parameters

//Standard approach
//1. Plot the time series. There are a lot of features that are still most readilty identified by a human
//2. Transform original ts to a stationary ts
//(a) determine whether to apply boxcox/log transformation or not
//(b) check for trend, seasonality, sharp changes, outliers, etc then do appropriate transformation
//3. Fit a model to the stationary ts
//4. Disgnostic tests to check model is reasonable, if its not go back to step 2
//5. Generate forecasts (predictive distributions) for the model
//6. Invert the transformation made in step 2, get forecasts of original ts.

# Stationarity

//Weakly stationary ts if :
//Var(X(t))//E[X(t)]=mu for all t ---------------------------------(2)
//autocov(r,r+h)=Cov(X(r), X(r+h))= f(h) = autocov(h) --(3)

//Also (3) implies that Var(X(t)) is independent of t, that is, a constant. So (1) can be restated.

//Examples
//(1) White noise process WN(mu, SigmaSq) is stationary
//(2) Random Walk process X(t) = Summ(1,t) [WN(0,SigmaSq)] is NOT stationary
//(3) MA(1) process X(t) = WN(t) + theta*WN(t-1) is stationary
//(4) ARCH(1) process is stationary, actually ARCH(1) process is also a white noise process


acf(SBUX$SBUX.Adjusted)

Box.test(SBUX$SBUX.Adjusted,lag=10,type="Ljung-Box")

pacf(SBUX$SBUX.Adjusted)

//-----------------------------

//AR 1

//Stationarity of AR1. Stationary if |fi1| < 1

//use arima() to fit

plot(SBUX$SBUX.Adjusted)
Box.test(SBUX$SBUX.Adjusted,lag=10,type="Ljung-Box", fitdf=1) //fitdf=1 since AR(1), prevents overfitting
arima(SBUX$SBUX.Adjusted, order=c(1,0,0))

plot(diff(log(SBUX$SBUX.Adjusted))[-1])
Box.test(diff(log(SBUX$SBUX.Adjusted))[-1],lag=10,type="Ljung-Box", fitdf=1)
arima(diff(log(SBUX$SBUX.Adjusted))[-1], order=c(1,0,0))

//AR(p)

DGS3 <- auto.assign="F)</p" getsymbols="" src="FRED">DGS3<-last p="" years="">head(DGS3diff)
DGS3diff<-diff 1="" p="">DGS3diff<-dgs3diff diff="" is.na="" p="">
pacf(DGS3diff) #AR1 not appropriate as in AR1 lags 2,3.. shouldn't matter after accounting for lag 1
#In general in PACF plot for AR(p), shouldn't see significant numbers after lag p --- Important

#acf formula is hard for AR(p) but there is function ARMAacf
#see how the ACF and PACF for an AR(2) model with fi1=0.5 and fi2=-0.2 looks:
plot(0:10, ARMAacf(ar=c(0.5,-0.2), lag.max=10), type="b", pch=16) #ACF
plot(1:10, ARMAacf(ar=c(0.5,-0.2), lag.max=10, pacf=TRUE), type="b", pch=16)  #PACF

# using auto.arima() to fit.
# 1. argument max.p sets the largest value of p to consider, max allowed is 5.
# 2. to fit a pure AR(p) model, either use arima() or in auto.arima() specify max.q=0
# 3. ic="aic" or "bic", or "aicc" which is corrected AIC - used for model selction by auto.arima()
# 4. input data to this fn should be vector, not xts. Apply as.numeric()
# 5. stepwise=FALSE to be specified if you want exhaustive search
# 6. seasonal=FALSE if don't want to consider models with seasonal components (default)
# 7. argument max.q sets the largest value of q to consider, max allowed is 5.
# 8. d can be specified explictly (e.g. d=2) else it will determine itself (not preferred).


# Consider daily trading volume of MSFT for 2012, find first difference of series.
# what order AR is chosen with aicc?

# Answer

library(forecast)
MSFT <- auto.assign="F)</p" from="2012-1-1" getsymbols="" to="2012-12-31">MSFTVolumeDiff <- diff="" olume="" p="">auto.arima(as.numeric(MSFTVolumeDiff), max.p=6, ic="aicc")

//It fit a model with p=1 and q=1. In addition, mean parameter was found to be unnecessary.

//Stationarity of AR(p)
//Recall that root of a fn f(x) is value of x that makes f(x) equal 0.
//It can be shown that AR(p) is stationary if EVERY root of f(x) = 1 - Summ:n(1,p)[FI(n)*x^n]
//has absolute value greater than 1.
//To get the roots of f(x) shown above we can use in R, polyroot(c(1, -FI(1), -FI(2),...))

# for example, lets say an AR(2) with FI(1) -1.2074 and FI(2) -0.2237 is fit. Is it stationary?
polyroot(c(1,-1.2074,0.2237))
# both are greater than 1, but the first one is pretty close.
# Ruppert says, "since the roots are estimated with error, reason to suspect nonstationarity"

# Unit Root Tests
# These are hypothesis tests for stationarity
# the simplest one being Dickey-Fuller test - but useful only for AR(1)
# Augmented Dickey Fuller (ADF) is more commonly used: adf.test()

library(tseries)
adf.test(as.numeric(DGS3diff), alternative="s") #"s" for stationarity, "e" for explosive
#p-value of 0.01 suggests we can reject the null in favor of alternative, that is "stationary"

# Lets see what auto.arima fits on DGS3diff
auto.arima(as.numeric(DGS3diff), max.p=6, ic="aicc")
# It fit ARIMA(2,0,2) with some parameters. Can't use polyroot on it since it has MA component.

// MA, ARMA, ARIMA

# MA(q) process - Stationary, no correlation beyond lag q in ACF plot.

# Lets simulate MA(2) process data of length 250 with theta1=0.3 and theta2=-0.2 and sigmaSq=1
simdat <- arima.sim="" model="list(ma=c(0.3,-0.2))," n="250," sd="sqrt(1))</p">
# See if the simulated data is stationary by using ADF test
adf.test(as.numeric(simdat),alternative="s") #small p value suggests stationary data

# See if any correlation beyond lag 2 in ACF plot
acf(simdat) # no correlation beyond lag 2, as expected, since data was simulated from MA(2) process

# Let's fit a MA(2) model to this data (as if we didn't know it was simulated data from specified MA(2))
holdfit <- arima="" order="c(0,0,2))</p" simdat="">holdfit # returns an MA(2) with theta1 0.24, theta2 -0.16 and sigmaSq 0.88

# But the theta1 estimate has a std error - lets see the confidence interval for it
holdfit$coef[1] - 1.96*sqrt(holdfit$var.coef[1,1])
holdfit$coef[1] + 1.96*sqrt(holdfit$var.coef[1,1]) # 95% C.I for theta1 is (0.11,0.35)

#Ljung-Box Test (H1: "at least some correl at some lag") applied to residuals of the fit
Box.test(holdfit$residuals, fitdf=2, lag=20, type="Ljung-Box") #p not small enough to reject null-> no correl
acf(holdfit$residuals) # acf confirms box test result
qqnorm(holdfit$residuals) # normal probability plot.. need to make this work.
qqline(holdfit$residuals)

# Comparing acf from simulated data to acf from true process
acf(simdat, cex.axis=1.3,cex.lab=1.3,main="") # Sample ACF of simulated data
trueacf=ARMAacf(ma=c(0.3,-0.2), lag.max=30) # true ACF
points(0:30,trueacf, pch=16, col=2)
fitacf=ARMAacf(ma=holdfit$coef[1:2],lag.max=30) #ACF of the fitted model
points(0:30,fitacf, col=4)
legend(23,1,legend=c("True ACF","Fitted model ACF"), pch=c(16,1),col=c(2,4), xjust=1, cex=1.3)

# fitting ARIMA

DTWEXM <- auto.assign="F)" dollar="" getsymbols="" index="" p="" src="FRED" trade-weighted="">DTWEXM <- dtwexm="" p="">plot(DTWEXM) # clearly not stationary
plot(diff(DTWEXM,1,1)) #diff(vec,n,d) is lag n difference (X(t)-X(t-n)) applied d times #now looks stationary
plot(diff(DTWEXM,1,2)) #2nd (lag-1) differences .. this too looks stationary.

# We want smallest d that's good enough
adf.test(as.numeric(diff(DTWEXM,1,1)[!is.na(diff(DTWEXM,1,1))]), alternative="s")
adf.test(as.numeric(diff(DTWEXM,1,2)[!is.na(diff(DTWEXM,1,2))]), alternative="s")
# Looks like 1st differencing was sufficient, so d=1

# REVISION: using auto.arima() to fit.
# 1. argument max.p sets the largest value of p to consider, max allowed is 5.
# 2. to fit a pure AR(p) model, either use arima() or in auto.arima() specify max.q=0
# 3. ic="aic" or "bic", or "aicc" which is corrected AIC - used for model selction by auto.arima()
# 4. input data to this fn should be vector, not xts. Apply as.numeric()
# 5. stepwise=FALSE to be specified if you want exhaustive search
# 6. seasonal=FALSE if don't want to consider models with seasonal components (default)
# 7. argument max.q sets the largest value of q to consider, max allowed is 5.
# 8. d can be specified explictly (e.g. d=2) else it will determine itself (not preferred).
# 9. approx=F tso that true maximum likelihood estimates are found

fitted_model<-auto .arima="" a="" arima="" as.numeric="" d="1," fit="" ic="aic" it.="" it="" p="" see="" that="" to="" we="">
#---------------------------------------------------------------------------------------

# Forecasting

# REVISION: Standard approach
//1. Plot the time series. There are a lot of features that are still most readilty identified by a human
//2. Transform original ts to a stationary ts
//(a) determine whether to apply boxcox/log transformation or not
//(b) check for trend, seasonality, sharp changes, outliers, etc then do appropriate transformation
//3. Fit a model to the stationary ts
//4. Disgnostic tests to check model is reasonable, if its not go back to step 2
//5. Generate forecasts (predictive distributions) for the model
//6. Invert the transformation made in step 2, get forecasts of original ts.

k=3 # number of future predictions
# Rem that holdfit was the model fitted to MA(2) simulated values
next_k_preds <-predict holdfit="" n.ahead="k)</p">
# Lets do another example

AAA <- auto.assign="F," getsymbols="" p="" src="FRED">AAA <- aaa="" p="">
plot(AAA) # Clearly not stationary, there is a trend
plot(diff(AAA)) # looks like trend has been removed, could be stationary
adf.test(as.numeric(diff(AAA)[-1]),alt="s") # small p value suggests stationary
holdfit <- a="" approx="F)" arima="" as.numeric="" auto.arima="" d="1," fits="" it="" p="" process="">
#diagnostic tests
Box.test(holdfit$residuals, fitdf=1, lag=10, type="Ljung-Box") #high p-value, fail to reject null: no correl
acf(holdfit$residuals)# suggests no serial correlation among residuals
pacf(holdfit$residuals)# suggests no serial conditional correlation among residuals
plot(holdfit$residuals) # looks good

#generate forecasts
holdpreds <- 6="" also="" care="" holdfit="" inverting="" n.ahead="6)" note="" of="" p="" predict="" r="" step="" takes="" that="" the="" transformation="">holdpreds

#--------------------------------------------------------------------------------------------

# Seasonality

HOUSTNSA <- auto.assign="F)</p" getsymbols="" src="FRED">HOUSTNSA <- houstnsa="" p="">plot(HOUSTNSA, ylab="Housing Starts (thousands of units)")

# We notice that there are trends not stationary
# We also notice seasonality
acf(HOUSTNSA) # confirms seasonality, peaks every 12 lags

plot(diff(HOUSTNSA)[-1]) #this removes the trend, but seasonality remains acf(diff(HOUSTNSA)[-1])
plot(diff(HOUSTNSA, 12,1)) #this removes seasonality but trends remain. (seasonally adjusted)
plot(diff(diff(HOUSTNSA, 12,1))) # this removes both and should be good. ------------(A)
# we can fit ARMA to it, but in practice this turns out to not yield models with sufficient flexibility

# Multiplicative ARIMA models for accomodating seasonality (preferred)
# ARIMA((p,d,q)*(P,D,Q)s) where s is the period of seasonality (12 for a year, for example)
# Think of it as fitting ARIMA(p,d,q) on non-seasonal component and ARIMA(P,D,Q) on seasonal component of ts.
# This multiplicative ARIMA model is fitted upon the ts obtained in (A) above
#(In R you needn't do the differencings first)

# Fitting a multiplicative ARIMA model

# METHOD 1: using arima() fitting a ARIMA((1,1,1)*(1,1,0)12) to the housing data again.
holdfit1 <- arima="" order="c(1,1,1)," period="12))</p" seasonal="list(order=c(1,1,0),">
# METHOD 1: using auto.arima()

# s is typically already known, or estimated exploring the ACF, not something auto.arima() finds.
# ts object fed to auto.arima should already be in a form that specifies the period of the series, i.e. s

HOUSts<-ts an="" creates="" frequency="12)" function="" not="" object="" p="" r="" series="" the="" time="" ts="" xts=""># d, D should always be specified yourself based on inspection of series, although auto.arima can search for it
holdfit2<-auto .arima="" d="1,D=1,approx=F)</p" ts="">
# We got a better (smaller) aic than that in the totally self-specified arima() model holdfit1.

# Use it for forecasting now.
holdpreds <- holdfit2="" n.ahead="24)</p" predict="">plot(HOUSts, xlab="Year", xlim=c(0,22))
lines(holdpreds$pred,col=2)
lines(holdpreds$pred + (1.96*holdpreds$se), col=2, lty=2)
lines(holdpreds$pred - (1.96*holdpreds$se), col=2, lty=2)

#---------------------------------------------------------------------------------------------

# Heteroskedasticity

#---------------------------------------------------------------------------------------------

# ARFIMA models

#---------------------------------------------------------------------------------------------

# Revisiting Regression

# One key assumption in regression was that errors iid normal, i.e, uncorrelated with const variance
holdlm=lm(HOUSts~sqrt(HOUSts)) # just some random regression

# In a time series regression test whether they are really uncorrelated makes sense to see ACF of residuals
acf(holdlm$residuals, lag.max=10)

# Another way is the durbin-watson hypothesis test, made specifically to test serial correl in a linear model
# H0: there is no correlation. If small p-value, reject the null: there is correlation.
library(car)
durbinWatsonTest(holdlm,10)

# DW suffers from the same multiple testing problem as acf. Can use Ljung-Box on holdlm$residuals, if needed.
# In our example we can see that errors ARE correlated.

# How to deal with this problem?
# 1. might want to log transform the predictor or response variable or both in the linear model
# 2. If significant serial correlation, fit an ARIMA(p,d,q) process in place of model error, instead of WN.

# Example. Response var is 3-year tsy rate, predictor var is 1-yr tsy-rate.
OneYearTreas <- auto.assign="F)</p" getsymbols="" src="FRED">ThreeYearTreas<-getsymbols auto.assign="F)</p" src="FRED">
holdfit1 <- hreeyeartreas="" lm="" oneyeartreas="" p="">holdfit2 <- arima="" hreeyeartreas="" order="c(1,1,1)," xreg="OneYearTreas)</p">holdfit3 <- as.numeric="" auto.arima="" d="1)</p" hreeyeartreas="" xreg="as.numeric(OneYearTreas),">acf(holdfit3$residuals)

# This solves the problem of serial correlation. What about non-constant variance?
# One way is model the errors as GARCH. Unfortunately not straightforward like with ARIMA above. Follow steps:
#1. Fit a regular lm model.
#2. Fit a time series model to the residuals  from model fit above.
#3. Refit the model using weighted least squares, weights being reciprocals of conditional variances from step2
#   Conditional variance from garchFit() are found as: holdgarchfit@sigma.t
#   weights = 1/(holdfit@sigma.t^2)
# Do it yourself.

#----------------------------------------------------------------------------------------------

Sunday, February 1, 2015

R for Basic Statistics - 1

R for Simulation, Sampling and Inference


Simulation


outcomes = c("heads", "tails")
sim_fair_coin = sample(outcomes, prob=c(0.4,0.6) , size=100, replace=TRUE)
barplot(table(sim_fair_coin))


Another use of sample() is to sample n elements randomly from a vector v.
sample(v, n)


To create a vector of size 15 all of whose value are identical:
vector1=rep(0,15)
vector2=rep(NA, 15). NA is often used as placeholder for missing data in R.


For loop in R
for (i in 1:50) {}


Compare to Python (later)


Divide a plot into multiple plots using (following example divides plotting area into three rows and 1 column):


par(mfrow = c(3, 1))


Set the scale of any graph using xlim and ylim arguments.


range() when applied on vector gives a vector of length 2 showing the smallest and largest element of that vector. It is useful to set the scale of graphs using xlim and ylim. For example:


# Define the limits for the x-axis:
xlimits = range(sample_means10)
# Draw the histogram:
hist(sample_means10, breaks=20, xlim=xlimits)


A complete confidence-interval example (comment code later):


# Initialize 'samp_mean', 'samp_sd' and 'n':
samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60


for (i in 1:50) {
   samp = sample(population, n)
   samp_mean[i] = mean(samp)
   samp_sd[i] = sd(samp)
}


# Calculate the interval bounds here:
lower=samp_mean - 1.96*samp_sd/sqrt(n)
upper=samp_mean + 1.96*samp_sd/sqrt(n)


# Plotting the confidence intervals:
pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)


Please note below in the output of the program above, a great use case for plot_ci chart.

Friday, January 30, 2015

Basic R revision - Part 4

Something I should have covered in part 1

Logical Operators in R: & and |

Lists in R

A list in R, much like a Python list, allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. To construct a list simply wrap list():
list(var1, var2, var3)

Naming the elements of a list

my_list = list(VECTOR=my_vector,MATRIX=my_matrix,DATAFRAME=my_df)
Now VECTOR, MATRIX and DATAFRAME are names of the first, second and third elements of the list.

Indexing in Lists
[[ ]] is used instead of [ ], for example mylist[[3]] gives third element of the list mylist.

To append an element to a list use c(): mylist = c (mylist, newelement)


Reading data from web
Use the read.table() function to read data from a url and then assign it to a dataset present:
present  = read.table("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/present.txt")

If the table is already in the form of an R dataset, then just load it using:
A dataframe called cdc is now in your R workspace.

Plotting

To plot frequency tables use barplot(). This frequency chart function is suitable for categorical variable, after it has been converted to a frequency table by using table(categoricalVarVectorname) or summary(factor(categoricalVarVectorname)).

To plot frequency chart for continuous variables, use histogram (it buckets into ranges, and then draws bars for each range): hist(vectorname, breaks=50)

To plot xy plane use plot(x,y)

The table() command is used to create a frequency table for a categorical variable. We can also input more than one categorical variables as input arguments to the table() command. It can give you, for instance, a frequency distribution in 2 variables, such as this:
              nonsmoker   smoker
 excellent 2879 1778
 very good 3758 3214
 good       2782 2893
 fair        911 1108
 poor        229   448
mosaicplot() is a good plot to display this data

boxplot() can be used on a vector to get graph showing the various quartiles.A table of values of the various quartiles can be generated by using summary() on the vector.

Another good use is boxplot(aContinuousVarOfDataset ~ aCategoricalVarOfDataset)
This shows a graph of quartiles of continuous var for each value of categorical variable.

Here the continuous var vector can be an existing continuous variable of dataset, of course, but also a constructed vector from various continuous variables of the dataset.

Thursday, January 29, 2015

Basic R revision - Part 3

Dataframes : Datasets in R


When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset. To broadly view the structure of a dataset use head() to look at the header columns and first few observations. (tail() shows the last few). Another method is to use str() which gives you the number of observations (rows), number of features or columns for each observation, a list of variable or column names and their datatypes, with their first few observations. Other useful functions are names() which gives column names, and dim() which gives a vector of two elements - nrows and ncols of the dataframe.


Creating a data frame


Normally you need your data in a very customized form before you can run any statistical algorithms on them. You can either perform that customization at the database level, that is, by querying in SQL to generate your output of the most suitably customized form, or, you can import the raw data onto your R (or Python) environment as it is, and use R (or Python) to create custom dataframes afterwards. (I do not currently have an opinion on what is the best practice - mostly common sense dictates what to do - but I will add to this post if and when I do have any nuggets of wisdom on this).* Here we will learn how to use R to create custom dataframes.


added 21Oct2015
I now feel that it is generally better to do the latter - that is - don't try to work with the SQL query too much to get customized data output - there are much better tools to deal with customization at the language level. R has data.tables and dplyr, for example. For an example, suppose there are two cols a and b and you only want to output the part of the whole dataset where a>5. Easily do-able in SQL. But suppose you only want to output the part of the whole dataset where a+b>5 - not doable as far as I know in SQL. But at R level you can do it.

We can use the data.frame() function to wrap around all the vectors we want to combine in the dataframe. All the vectors, of course, should have the same length (equal number of observations). You can think of this function as similar to cbind, except it deals with vectors of potentially different datatypes. It's not really that similar to cbind actually, as each argument to to data.frame() should be a vector, whereas arguments to cbind can be some vectors and some matrices too!


planets     = c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type        = c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gas giant","Gas giant","Gas giant","Gas giant")
diameter    = c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883);
rotation    = c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);
rings       = c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);


# Create the data frame:
planets_df  = data.frame(planets,type,diameter,rotation,rings)


Indexing and subsetting in dataframes works similar to matrices.


To get diameters of the first 3 planets in planets_df, we can use any of the following:
fpd1 = planets_df[1:3,"diameter"]
fpd2 = planets_df[1:3,3]
fpd3 = planets_df$diameter[1:3]


Example to get only those observations of dataset where planet has rings: planets_df[planets_df$rings,]


For an alternate way to do the same thing, use subset(): subset(planets_df, subset=(planets_df$rings == TRUE))
Use this way to get observations of dataset where planets smaller than earth: subset(planets_df, subset=(planets_df$diameter<1 span="">


To add a new feature or column or attribute to the dataframe planet_df, let's say sun_closeness_rank, simply define it while referring to it as an attribute of that dataframe:
planets_df$sun_closeness_rank = c(1,2,3,4,5,6,7,8)


Sorting a vector in R


The order() function, when applied to a vector, returns a vector with the rank of each element.
For example, order(c(6,3,8)) = {2, 1, 3} vector. Now this vector can be given as index to the original vector, to get a sorted version of original vector.
a = c(100, 9, 101)
order(a)
[1] 2 1 3
a[order(a)]
[1]   9 100 101


Sorting a dataframe by a particular column


For example, if we want to sort planets_df by diameter descending and create largest_first_df:
positions = order(-planets_df$diameter)
largest_first_df =planets_df[positions , ]

Basic R revision - Part 2

Factors

Factors are used to store categorical variables, where categorical variables are those whose value can only be one amongst a well-defined, discrete set of values. For example factor_gender is a factor that stores variables that can contain elements: "male" and "female".

To construct a factor variable out of a vector of values, just wrap the vector using factor(). For example:

> gender_vector = c("Male", "Female", "Female", "Male", "Male")
> factor_gender_vector = factor(gender_vector)
> factor_gender_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Categorical variables are of two types: nominal and ordinal.
factor_gender would be nominal as there is no grading from lower to higher between male and female unless you are a sexist asshole.
factor_bondratings would be ordinal as there is a natural grading, where we know :



AAA > AA > A > BBB > BB > CCC > CC > C > D

In R, the assumption in for the categorical nominal variable to be nominal. If you wish to specify ordinal, use the order and levels keywords:



temperature_vector = c("High","Low","High","Low","Medium")
factor_temperature_vector = factor(temperature_vector, order=TRUE, levels=c("Low","Medium","High"))
> factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

Renaming the elements of a factor variable

Use the levels() function to do this.



> survey_vector = c("M", "F", "F", "M", "M")
> factor_survey_vector = factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M

> levels(factor_survey_vector) = c("Female", "Male")
> factor_survey_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Note that it is important to follow the correct order while naming. Using
levels(factor_survey_vector) = c("Female", "Male")
would have been incorrect, since I had run the code earlier to see the unnamed output being "Levels: F M"

Using summary()

summary() is a general R function but it's very useful with factors. For example:



> summary(factor_survey_vector)
Female   Male
     2      3

If a factor is nominal, then the comparison operator > becomes invalid. See the following (continuation) code for my favorite proof for the equality of sexes:



> # Battle of the sexes:
> # Male
> factor_survey_vector[1]
[1] Male
Levels: Female Male
> # Female
> factor_survey_vector[2]
[1] Female
Levels: Female Male
> # Male larger than female?
> factor_survey_vector[1] > factor_survey_vector[2]
'>' not meaningful for factors

Comparison operators meaningful for ordinal categorical variables. See:



> speed_vector = c("Fast", "Slow", "Slow", "Fast", "Ultra-fast")
> # Add your code below
> factor_speed_vector = factor(speed_vector, order = TRUE, levels = c("Slow", "Fast", "Ultra-fast"))
> # Print
> factor_speed_vector
[1] Fast       Slow       Slow       Fast       Ultra-fast
Levels: Slow < Fast < Ultra-fast
> # R prints automagically in the right order
> summary(factor_speed_vector)
      Slow       Fast Ultra-fast
         2          2          1

> compare_them = factor_speed_vector[2] > factor_speed_vector[5]
> # Is data analyst 2 faster than data analyst 5?
> compare_them
[1] FALSE

So Analyst 2 is not faster than Analyst 5.

Wednesday, January 28, 2015

Basic R revision - Part 1



A random useful function

To get the data type of of variable in R, use the function class().

my_numeric = 42
my_character = "forty-two"
my_logical = FALSE

> class(my_numeric)
[1] "numeric"
> class(my_character)
[1] "character"
> class(my_logical)
[1] "logical"

Always remember: Python/C++ vector indices start with 0, R vector indices start with 1
 Subset a vector in R, use vectorname[c(starting index: ending index)]
If disparate (non-adjacent elements): vectorname[c(index1, index2, index3 ..)]

Compare to Python:
Subset a vector in Python, use vectorname[starting index: ending index + 1]
Note that index numbers will be defined as per Python convention
Suppose from a vector v = ['P','O','K','E','R'], we need to output ['O','K','E']
In Python, use v[1:4]
In R, use v[c(2:4)] or just v[2:4]

Also to get all elements in Python a way is to do Mymatrix[3, : ] (gets row 3)
To do the same exercise in R the way to do is Mymatrix[3,  ]

Comparison Operators in R vs VBA and Python

Comparison Operators in R and C++
<, >, >=, ==, !=

Comparison operators in VBA
<, >, <=, =, <>
Python supports both != and
<>

For equality, Python supports == (like R and C++)

In R you can use comparison operator between a vector and a number and get a binary vector which compares each element of the vector to that number.
(Not sure if you can do that in Python. Will check later and update.) Also, you can use that binary vector as an index to get a subset of the original vector.

Matrix in R: To construct a matrix in R you need to add a matrix() wrapper to a vector. e.g. matrix(c(1:9), byrow=TRUE, nrow=3)

Naming elements of a vector and rows/cols of a matrix

Naming can often be useful later. Syntax is simple:
For vector:
vectorv = c(2,3,4)
names(vectorv)=c("a","b","c")
Now, vectorv[“a”]=2
Now, vectorv[“a”]=2

For matrix:
new_hope = c( 460.998007, 314.4)
empire_strikes = c(290.475067, 247.900000)
return_jedi = c(309.306177,165.8)
# Construct the matrix
star_wars_matrix = matrix(c(new_hope,empire_strikes,return_jedi), nrow=3, byrow=TRUE)
# Add your code here such that rows and columns of star_wars_matrix have a name!
rownames(star_wars_matrix) = c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)= c("US", "non-US")

Another way can be to include these in the matrix definition itself:
movie_names = c("A New Hope","The Empire Strikes Back","Return of the Jedi")
col_titles = c("US","non-US")
star_wars_matrix = matrix(box_office_all, nrow=3, byrow=TRUE, dimnames=list(movie_names,col_titles))
Summing all elements of entire rows or columns, or summing all elements of any vector

To do row sums or column sums in R for a matrix just use rowSums(matrixname) or colSums(matrixname). Note that it is important to capitalize S in rowSums or colSums. Another way can be to reference the needed vector by using something like Mymatrix[3, ] and then wrapping sum() around it.

Combining/Appending functions

cbind(vectorname) can append a vector to an existing matrix as a new column, provided vector's length is same as number of matrix rows. Similarly, rbind. Note the similarity to c() wrapper to construct any vector.

Arithmetic Operators

+,-,*,/ work in an elementwise way for both vectors and matrices
matrix1 * matrix2 does elementwise multiplication, not matrix multiplication as in Linear Algebra for which we use %*% in R