Thursday, January 29, 2015

Basic R revision - Part 3

Dataframes : Datasets in R


When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset. To broadly view the structure of a dataset use head() to look at the header columns and first few observations. (tail() shows the last few). Another method is to use str() which gives you the number of observations (rows), number of features or columns for each observation, a list of variable or column names and their datatypes, with their first few observations. Other useful functions are names() which gives column names, and dim() which gives a vector of two elements - nrows and ncols of the dataframe.


Creating a data frame


Normally you need your data in a very customized form before you can run any statistical algorithms on them. You can either perform that customization at the database level, that is, by querying in SQL to generate your output of the most suitably customized form, or, you can import the raw data onto your R (or Python) environment as it is, and use R (or Python) to create custom dataframes afterwards. (I do not currently have an opinion on what is the best practice - mostly common sense dictates what to do - but I will add to this post if and when I do have any nuggets of wisdom on this).* Here we will learn how to use R to create custom dataframes.


added 21Oct2015
I now feel that it is generally better to do the latter - that is - don't try to work with the SQL query too much to get customized data output - there are much better tools to deal with customization at the language level. R has data.tables and dplyr, for example. For an example, suppose there are two cols a and b and you only want to output the part of the whole dataset where a>5. Easily do-able in SQL. But suppose you only want to output the part of the whole dataset where a+b>5 - not doable as far as I know in SQL. But at R level you can do it.

We can use the data.frame() function to wrap around all the vectors we want to combine in the dataframe. All the vectors, of course, should have the same length (equal number of observations). You can think of this function as similar to cbind, except it deals with vectors of potentially different datatypes. It's not really that similar to cbind actually, as each argument to to data.frame() should be a vector, whereas arguments to cbind can be some vectors and some matrices too!


planets     = c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type        = c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gas giant","Gas giant","Gas giant","Gas giant")
diameter    = c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883);
rotation    = c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);
rings       = c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);


# Create the data frame:
planets_df  = data.frame(planets,type,diameter,rotation,rings)


Indexing and subsetting in dataframes works similar to matrices.


To get diameters of the first 3 planets in planets_df, we can use any of the following:
fpd1 = planets_df[1:3,"diameter"]
fpd2 = planets_df[1:3,3]
fpd3 = planets_df$diameter[1:3]


Example to get only those observations of dataset where planet has rings: planets_df[planets_df$rings,]


For an alternate way to do the same thing, use subset(): subset(planets_df, subset=(planets_df$rings == TRUE))
Use this way to get observations of dataset where planets smaller than earth: subset(planets_df, subset=(planets_df$diameter<1 span="">


To add a new feature or column or attribute to the dataframe planet_df, let's say sun_closeness_rank, simply define it while referring to it as an attribute of that dataframe:
planets_df$sun_closeness_rank = c(1,2,3,4,5,6,7,8)


Sorting a vector in R


The order() function, when applied to a vector, returns a vector with the rank of each element.
For example, order(c(6,3,8)) = {2, 1, 3} vector. Now this vector can be given as index to the original vector, to get a sorted version of original vector.
a = c(100, 9, 101)
order(a)
[1] 2 1 3
a[order(a)]
[1]   9 100 101


Sorting a dataframe by a particular column


For example, if we want to sort planets_df by diameter descending and create largest_first_df:
positions = order(-planets_df$diameter)
largest_first_df =planets_df[positions , ]

No comments:

Post a Comment