Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Sunday, February 1, 2015

R for Basic Statistics - 1

R for Simulation, Sampling and Inference


Simulation


outcomes = c("heads", "tails")
sim_fair_coin = sample(outcomes, prob=c(0.4,0.6) , size=100, replace=TRUE)
barplot(table(sim_fair_coin))


Another use of sample() is to sample n elements randomly from a vector v.
sample(v, n)


To create a vector of size 15 all of whose value are identical:
vector1=rep(0,15)
vector2=rep(NA, 15). NA is often used as placeholder for missing data in R.


For loop in R
for (i in 1:50) {}


Compare to Python (later)


Divide a plot into multiple plots using (following example divides plotting area into three rows and 1 column):


par(mfrow = c(3, 1))


Set the scale of any graph using xlim and ylim arguments.


range() when applied on vector gives a vector of length 2 showing the smallest and largest element of that vector. It is useful to set the scale of graphs using xlim and ylim. For example:


# Define the limits for the x-axis:
xlimits = range(sample_means10)
# Draw the histogram:
hist(sample_means10, breaks=20, xlim=xlimits)


A complete confidence-interval example (comment code later):


# Initialize 'samp_mean', 'samp_sd' and 'n':
samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60


for (i in 1:50) {
   samp = sample(population, n)
   samp_mean[i] = mean(samp)
   samp_sd[i] = sd(samp)
}


# Calculate the interval bounds here:
lower=samp_mean - 1.96*samp_sd/sqrt(n)
upper=samp_mean + 1.96*samp_sd/sqrt(n)


# Plotting the confidence intervals:
pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)


Please note below in the output of the program above, a great use case for plot_ci chart.

Friday, January 30, 2015

Basic R revision - Part 4

Something I should have covered in part 1

Logical Operators in R: & and |

Lists in R

A list in R, much like a Python list, allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. To construct a list simply wrap list():
list(var1, var2, var3)

Naming the elements of a list

my_list = list(VECTOR=my_vector,MATRIX=my_matrix,DATAFRAME=my_df)
Now VECTOR, MATRIX and DATAFRAME are names of the first, second and third elements of the list.

Indexing in Lists
[[ ]] is used instead of [ ], for example mylist[[3]] gives third element of the list mylist.

To append an element to a list use c(): mylist = c (mylist, newelement)


Reading data from web
Use the read.table() function to read data from a url and then assign it to a dataset present:
present  = read.table("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/present.txt")

If the table is already in the form of an R dataset, then just load it using:
A dataframe called cdc is now in your R workspace.

Plotting

To plot frequency tables use barplot(). This frequency chart function is suitable for categorical variable, after it has been converted to a frequency table by using table(categoricalVarVectorname) or summary(factor(categoricalVarVectorname)).

To plot frequency chart for continuous variables, use histogram (it buckets into ranges, and then draws bars for each range): hist(vectorname, breaks=50)

To plot xy plane use plot(x,y)

The table() command is used to create a frequency table for a categorical variable. We can also input more than one categorical variables as input arguments to the table() command. It can give you, for instance, a frequency distribution in 2 variables, such as this:
              nonsmoker   smoker
 excellent 2879 1778
 very good 3758 3214
 good       2782 2893
 fair        911 1108
 poor        229   448
mosaicplot() is a good plot to display this data

boxplot() can be used on a vector to get graph showing the various quartiles.A table of values of the various quartiles can be generated by using summary() on the vector.

Another good use is boxplot(aContinuousVarOfDataset ~ aCategoricalVarOfDataset)
This shows a graph of quartiles of continuous var for each value of categorical variable.

Here the continuous var vector can be an existing continuous variable of dataset, of course, but also a constructed vector from various continuous variables of the dataset.

Thursday, January 29, 2015

Basic R revision - Part 3

Dataframes : Datasets in R


When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset. To broadly view the structure of a dataset use head() to look at the header columns and first few observations. (tail() shows the last few). Another method is to use str() which gives you the number of observations (rows), number of features or columns for each observation, a list of variable or column names and their datatypes, with their first few observations. Other useful functions are names() which gives column names, and dim() which gives a vector of two elements - nrows and ncols of the dataframe.


Creating a data frame


Normally you need your data in a very customized form before you can run any statistical algorithms on them. You can either perform that customization at the database level, that is, by querying in SQL to generate your output of the most suitably customized form, or, you can import the raw data onto your R (or Python) environment as it is, and use R (or Python) to create custom dataframes afterwards. (I do not currently have an opinion on what is the best practice - mostly common sense dictates what to do - but I will add to this post if and when I do have any nuggets of wisdom on this).* Here we will learn how to use R to create custom dataframes.


added 21Oct2015
I now feel that it is generally better to do the latter - that is - don't try to work with the SQL query too much to get customized data output - there are much better tools to deal with customization at the language level. R has data.tables and dplyr, for example. For an example, suppose there are two cols a and b and you only want to output the part of the whole dataset where a>5. Easily do-able in SQL. But suppose you only want to output the part of the whole dataset where a+b>5 - not doable as far as I know in SQL. But at R level you can do it.

We can use the data.frame() function to wrap around all the vectors we want to combine in the dataframe. All the vectors, of course, should have the same length (equal number of observations). You can think of this function as similar to cbind, except it deals with vectors of potentially different datatypes. It's not really that similar to cbind actually, as each argument to to data.frame() should be a vector, whereas arguments to cbind can be some vectors and some matrices too!


planets     = c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type        = c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gas giant","Gas giant","Gas giant","Gas giant")
diameter    = c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883);
rotation    = c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);
rings       = c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);


# Create the data frame:
planets_df  = data.frame(planets,type,diameter,rotation,rings)


Indexing and subsetting in dataframes works similar to matrices.


To get diameters of the first 3 planets in planets_df, we can use any of the following:
fpd1 = planets_df[1:3,"diameter"]
fpd2 = planets_df[1:3,3]
fpd3 = planets_df$diameter[1:3]


Example to get only those observations of dataset where planet has rings: planets_df[planets_df$rings,]


For an alternate way to do the same thing, use subset(): subset(planets_df, subset=(planets_df$rings == TRUE))
Use this way to get observations of dataset where planets smaller than earth: subset(planets_df, subset=(planets_df$diameter<1 span="">


To add a new feature or column or attribute to the dataframe planet_df, let's say sun_closeness_rank, simply define it while referring to it as an attribute of that dataframe:
planets_df$sun_closeness_rank = c(1,2,3,4,5,6,7,8)


Sorting a vector in R


The order() function, when applied to a vector, returns a vector with the rank of each element.
For example, order(c(6,3,8)) = {2, 1, 3} vector. Now this vector can be given as index to the original vector, to get a sorted version of original vector.
a = c(100, 9, 101)
order(a)
[1] 2 1 3
a[order(a)]
[1]   9 100 101


Sorting a dataframe by a particular column


For example, if we want to sort planets_df by diameter descending and create largest_first_df:
positions = order(-planets_df$diameter)
largest_first_df =planets_df[positions , ]

Basic R revision - Part 2

Factors

Factors are used to store categorical variables, where categorical variables are those whose value can only be one amongst a well-defined, discrete set of values. For example factor_gender is a factor that stores variables that can contain elements: "male" and "female".

To construct a factor variable out of a vector of values, just wrap the vector using factor(). For example:

> gender_vector = c("Male", "Female", "Female", "Male", "Male")
> factor_gender_vector = factor(gender_vector)
> factor_gender_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Categorical variables are of two types: nominal and ordinal.
factor_gender would be nominal as there is no grading from lower to higher between male and female unless you are a sexist asshole.
factor_bondratings would be ordinal as there is a natural grading, where we know :



AAA > AA > A > BBB > BB > CCC > CC > C > D

In R, the assumption in for the categorical nominal variable to be nominal. If you wish to specify ordinal, use the order and levels keywords:



temperature_vector = c("High","Low","High","Low","Medium")
factor_temperature_vector = factor(temperature_vector, order=TRUE, levels=c("Low","Medium","High"))
> factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

Renaming the elements of a factor variable

Use the levels() function to do this.



> survey_vector = c("M", "F", "F", "M", "M")
> factor_survey_vector = factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M

> levels(factor_survey_vector) = c("Female", "Male")
> factor_survey_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Note that it is important to follow the correct order while naming. Using
levels(factor_survey_vector) = c("Female", "Male")
would have been incorrect, since I had run the code earlier to see the unnamed output being "Levels: F M"

Using summary()

summary() is a general R function but it's very useful with factors. For example:



> summary(factor_survey_vector)
Female   Male
     2      3

If a factor is nominal, then the comparison operator > becomes invalid. See the following (continuation) code for my favorite proof for the equality of sexes:



> # Battle of the sexes:
> # Male
> factor_survey_vector[1]
[1] Male
Levels: Female Male
> # Female
> factor_survey_vector[2]
[1] Female
Levels: Female Male
> # Male larger than female?
> factor_survey_vector[1] > factor_survey_vector[2]
'>' not meaningful for factors

Comparison operators meaningful for ordinal categorical variables. See:



> speed_vector = c("Fast", "Slow", "Slow", "Fast", "Ultra-fast")
> # Add your code below
> factor_speed_vector = factor(speed_vector, order = TRUE, levels = c("Slow", "Fast", "Ultra-fast"))
> # Print
> factor_speed_vector
[1] Fast       Slow       Slow       Fast       Ultra-fast
Levels: Slow < Fast < Ultra-fast
> # R prints automagically in the right order
> summary(factor_speed_vector)
      Slow       Fast Ultra-fast
         2          2          1

> compare_them = factor_speed_vector[2] > factor_speed_vector[5]
> # Is data analyst 2 faster than data analyst 5?
> compare_them
[1] FALSE

So Analyst 2 is not faster than Analyst 5.

Wednesday, January 28, 2015

Basic R revision - Part 1



A random useful function

To get the data type of of variable in R, use the function class().

my_numeric = 42
my_character = "forty-two"
my_logical = FALSE

> class(my_numeric)
[1] "numeric"
> class(my_character)
[1] "character"
> class(my_logical)
[1] "logical"

Always remember: Python/C++ vector indices start with 0, R vector indices start with 1
 Subset a vector in R, use vectorname[c(starting index: ending index)]
If disparate (non-adjacent elements): vectorname[c(index1, index2, index3 ..)]

Compare to Python:
Subset a vector in Python, use vectorname[starting index: ending index + 1]
Note that index numbers will be defined as per Python convention
Suppose from a vector v = ['P','O','K','E','R'], we need to output ['O','K','E']
In Python, use v[1:4]
In R, use v[c(2:4)] or just v[2:4]

Also to get all elements in Python a way is to do Mymatrix[3, : ] (gets row 3)
To do the same exercise in R the way to do is Mymatrix[3,  ]

Comparison Operators in R vs VBA and Python

Comparison Operators in R and C++
<, >, >=, ==, !=

Comparison operators in VBA
<, >, <=, =, <>
Python supports both != and
<>

For equality, Python supports == (like R and C++)

In R you can use comparison operator between a vector and a number and get a binary vector which compares each element of the vector to that number.
(Not sure if you can do that in Python. Will check later and update.) Also, you can use that binary vector as an index to get a subset of the original vector.

Matrix in R: To construct a matrix in R you need to add a matrix() wrapper to a vector. e.g. matrix(c(1:9), byrow=TRUE, nrow=3)

Naming elements of a vector and rows/cols of a matrix

Naming can often be useful later. Syntax is simple:
For vector:
vectorv = c(2,3,4)
names(vectorv)=c("a","b","c")
Now, vectorv[“a”]=2
Now, vectorv[“a”]=2

For matrix:
new_hope = c( 460.998007, 314.4)
empire_strikes = c(290.475067, 247.900000)
return_jedi = c(309.306177,165.8)
# Construct the matrix
star_wars_matrix = matrix(c(new_hope,empire_strikes,return_jedi), nrow=3, byrow=TRUE)
# Add your code here such that rows and columns of star_wars_matrix have a name!
rownames(star_wars_matrix) = c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)= c("US", "non-US")

Another way can be to include these in the matrix definition itself:
movie_names = c("A New Hope","The Empire Strikes Back","Return of the Jedi")
col_titles = c("US","non-US")
star_wars_matrix = matrix(box_office_all, nrow=3, byrow=TRUE, dimnames=list(movie_names,col_titles))
Summing all elements of entire rows or columns, or summing all elements of any vector

To do row sums or column sums in R for a matrix just use rowSums(matrixname) or colSums(matrixname). Note that it is important to capitalize S in rowSums or colSums. Another way can be to reference the needed vector by using something like Mymatrix[3, ] and then wrapping sum() around it.

Combining/Appending functions

cbind(vectorname) can append a vector to an existing matrix as a new column, provided vector's length is same as number of matrix rows. Similarly, rbind. Note the similarity to c() wrapper to construct any vector.

Arithmetic Operators

+,-,*,/ work in an elementwise way for both vectors and matrices
matrix1 * matrix2 does elementwise multiplication, not matrix multiplication as in Linear Algebra for which we use %*% in R