Thursday, April 30, 2015

Life as the sum of all embarrassments

All of life, I guess, is a giant exercise in gradually humbling yourself further. I find it amazing, and this is truly different from saying that I find it annoying, that there are any people at all who get more sure of what they know as they grow older.

Personally, I think, that the more I learn and get to know about things, the more it becomes embarrassingly clear to me that there is so, so much I do not know. One of the reasons I like working in finance is that it has so much to never come to know. (Yes, that sounds convoluted but that it what I wanted to say; it's not in error.)

At the start of this year, I made big, elaborate plans for what I will learn in 2015. 2 things have changed since then. One, I realize that to execute only them will take not one year but probably three, and two, that there are going to be many diversions on the way where I'd go about learning things I hadn't initially planned to, so in effect it will take more than three, maybe five or six. But the thing that I am not yet accounting for is that some of those diversions will become full-fledged highways in and of themselves, and that they will have their own diversions. And so on ad-infinitum. And with this, I'm pretty confident that the whole thing will take not five years but fifty, or however many I have left.

Which makes me wonder - is it worth planning for a period as long as an year, when so much changes during it? I would argue that there is value in planning, even when the failure of the plan is a foregone conclusion. It reminds us of the highway, when we're on our joyrides along diversions. It enhances our realization that the diversions are lovely, but also forces us to examine how lovely they are compared to what's at the end of the highway. And sometimes, to pave new paths  by which the diversion would eventually join the highway again.

That, I guess, makes us more creative..




Friday, March 27, 2015

Not normal

For the last three days I've been sleeping the whole time I'm home. After coming back home at 6 PM, I'm asleep at 6:20, waking up only at about 6:40 AM the next day, at which time I quickly get ready to work and reach office at 7:30. It's a miracle I've been coming to work on time all throughout. This isn't normal. It has got to be something in the medication.

Thursday, March 26, 2015

Lull

February and March haven't been great, healthwise. In February, I first contacted seasonal flu that lasted four days and just when I thought I had recovered, a week later I was down with a terribly irritating allergy that made breathing unbearable for a week. There is nothing like a blocked nose to make you realize how beautiful your immediately preceding life was. Then March began with a stomach that couldn't cope up with the move to India, and finally, a day before I was to travel back to the US, I contacted viral infection that led to bacterial infection and tonsillitis that I'm still battling. I'm recovering well and hoping to be fine in the next couple of days and hoping, also, that this will be the end of it for this year. All this has already set me back at least a month in my plans, and in addition to the time lost, what I worry about is the momentum lost. I think that it will now take me a couple of weeks just to get back into the groove of things and get the motivation going. Anyway. Hope that happens very very soon.

Saturday, February 21, 2015

Google Chrome

Never open more than 3 tabs at a time. If you must open another tab, first choose a tab to close.

Friday, February 6, 2015

Blogging platforms and moonlight

I'm contemplating moving the blog to wordpress. Since a big chunk of what I post nowadays tends to be code with all these greater than and less than signs, it totally throws the html on the blogger off. It is amazing to me that blogger won't add this small little feature that let's you insert code seamlessly with something like a code tag that wordpress provides. It is 2015 and it is very surprising to see that blogger would let so many users slip away but not pluck the low hanging fruit of adding code tags. I personally do not want to move the blog, but that is out of two entirely irrational reasons: one, that I'm lazy and do not have the time or enthusiasm to set up a blog from scratch, and two, that I've been on here for ten years now and it feels like home and so I don't want to leave.

I've been meaning to add tutorials for some great R libraries like dplyr, ggplot2, and shiny, and also some Python tutorials but before that I have fulfill the promise I made to myself of adding tutorials for the relatively uninspiring parts of R that form its basics. And, at any rate, to get started with any of that - it seems like moving to wordpress is the way. A couple of days ago, I tried to embed code snippets in earlier R posts but the results were clumsy and off-putting.

In any case, an interesting thing I learned today (let's assume I won't be adding anything else tonight): the earth's moon was once an asteroid that struck the earth and the hot molten ball that rebounded extremely slowly became the moon a couple hundred million years later. All that romantic poetry with moon as the evergreen metaphor seems a little incongruent now, doesn't it?

Monday, February 2, 2015

Some less talked about positives of the American Civil War

This is not a defense of wars, but the American Civil War had several positives coming out of it, and I thought I'd outline them briefly. Of course, it was at a great cost of 600,000 lives, about 2 percent of the entire US population at the time. Everybody knows about the biggest achievement of the war, which was also in large part the main engine behind the war: the abolition of slavery. Here I will mention some of the other, often overlooked positives:

1. For the first time, the recently discovered Bromine was used for healing and cleaning wounds. It improved standards of hygiene in wartime medical assistance in a big way, significantly reducing casualties where soldiers succumbed not to bullet wounds but to the ensuing gangrene.

2. Before the civil war, nursing was a primarily male occupation. With this war, the need for nursing outgrew the supply that men alone could furnish, bringing women in large numbers into the nursing profession. Alongside textile mills of the same period, this paved the way for women getting out of their houses for work in large numbers.

3. Working as a nurse at the time of the war inspired Carla Barton to start the American Red Cross post the war, an organization that has since saved millions of lives.

4. Embalming, the practice of using zinc chloride and arsenic for the preservation of the dead bodies of soldiers was an innovation of the civil war, which meant that the bodies could be received by their families, even weeks later, in recognizable and non-decomposed conditions for their last rites.

5. Telegram got a boost by Lincoln as a method of mapping and devising macro and micro level strategies. This would continue to remain a masterstroke of wartime planning for decades to come, including during the world wars.

6. In many ways, it was the first modern war. Before the civil war, the battle capacity of any regiment was limited by how much artillery they could carry with them. Once they were out of supplies, they could fight no longer. This was the first war to isolate supplies from actual offensives, by employing railroads to continually and strategically supply new arms to the fighting troops.

Sunday, February 1, 2015

R for Basic Statistics - 1

R for Simulation, Sampling and Inference


Simulation


outcomes = c("heads", "tails")
sim_fair_coin = sample(outcomes, prob=c(0.4,0.6) , size=100, replace=TRUE)
barplot(table(sim_fair_coin))


Another use of sample() is to sample n elements randomly from a vector v.
sample(v, n)


To create a vector of size 15 all of whose value are identical:
vector1=rep(0,15)
vector2=rep(NA, 15). NA is often used as placeholder for missing data in R.


For loop in R
for (i in 1:50) {}


Compare to Python (later)


Divide a plot into multiple plots using (following example divides plotting area into three rows and 1 column):


par(mfrow = c(3, 1))


Set the scale of any graph using xlim and ylim arguments.


range() when applied on vector gives a vector of length 2 showing the smallest and largest element of that vector. It is useful to set the scale of graphs using xlim and ylim. For example:


# Define the limits for the x-axis:
xlimits = range(sample_means10)
# Draw the histogram:
hist(sample_means10, breaks=20, xlim=xlimits)


A complete confidence-interval example (comment code later):


# Initialize 'samp_mean', 'samp_sd' and 'n':
samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60


for (i in 1:50) {
   samp = sample(population, n)
   samp_mean[i] = mean(samp)
   samp_sd[i] = sd(samp)
}


# Calculate the interval bounds here:
lower=samp_mean - 1.96*samp_sd/sqrt(n)
upper=samp_mean + 1.96*samp_sd/sqrt(n)


# Plotting the confidence intervals:
pop_mean = mean(population)
plot_ci(lower, upper, pop_mean)


Please note below in the output of the program above, a great use case for plot_ci chart.

Saturday, January 31, 2015

What seems like work to other people that doesn't seem like work to you?

In a recent post, Paul Graham suggests that we ask ourselves this question, and that our answers to this question, are things we are well suited for. I totally agree.

I have always wanted to do a lot of things. Ever since I assumed a semblance of adulthood, I have wanted to to do many different things, have many different occupations. I don't use the term "wanted to" very loosely. When I say "wanted to" I mean that I have actively tried to better myself at those things for at least a month, with a view to do them professionally. These have included becoming a poet, a programmer, a short story writer, a singer, an investor, a quant, a film critic, a photographer, a cartoonist. Fundamentally, I am not a subscriber of the notion that one has to become this one specific thing in life. Right from my teenage, the one thing that the popularly reinforced idea of "you have just one life" has made me a little frantic about is the desire to pack a number of different professions into this one life. To many other people, this same idea is a great motivator pushing themselves in the opposite direction, of devoting themselves entirely to one great pursuit, and making a mark in it. I admire those people, but for some reason, doing many different things holds more sway to me than being a master of any one thing, and I think this is guided by my regret minimization utility function. Would I regret it more if I couldn't be great at one thing, or would I regret it more to not try having done so many others. For me, it is the latter.

At the same time, I very much believe in the other popular notion that "if something is worth doing, it is worth doing well". And it would be foolish not to concede to the oft proven point that trying to do several things is a big impediment in developing expertise in any one thing. Therefore, for people with dispositions such as mine, it is all that much more important that they choose their targets well, because they are only going to be good at so many things.

Which brings me back to Paul Graham's question. It is a great guide.
My answers: Writing essays, Studying statistics and probability, walking. I wish debugging was also on this list. But I suppose this list will change.
It is useful to create one's own list in answer to this question, and to come back to it periodically: both as a reminder to follow it, and as a reminder to update it.




Friday, January 30, 2015

Basic R revision - Part 4

Something I should have covered in part 1

Logical Operators in R: & and |

Lists in R

A list in R, much like a Python list, allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. To construct a list simply wrap list():
list(var1, var2, var3)

Naming the elements of a list

my_list = list(VECTOR=my_vector,MATRIX=my_matrix,DATAFRAME=my_df)
Now VECTOR, MATRIX and DATAFRAME are names of the first, second and third elements of the list.

Indexing in Lists
[[ ]] is used instead of [ ], for example mylist[[3]] gives third element of the list mylist.

To append an element to a list use c(): mylist = c (mylist, newelement)


Reading data from web
Use the read.table() function to read data from a url and then assign it to a dataset present:
present  = read.table("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/present.txt")

If the table is already in the form of an R dataset, then just load it using:
A dataframe called cdc is now in your R workspace.

Plotting

To plot frequency tables use barplot(). This frequency chart function is suitable for categorical variable, after it has been converted to a frequency table by using table(categoricalVarVectorname) or summary(factor(categoricalVarVectorname)).

To plot frequency chart for continuous variables, use histogram (it buckets into ranges, and then draws bars for each range): hist(vectorname, breaks=50)

To plot xy plane use plot(x,y)

The table() command is used to create a frequency table for a categorical variable. We can also input more than one categorical variables as input arguments to the table() command. It can give you, for instance, a frequency distribution in 2 variables, such as this:
              nonsmoker   smoker
 excellent 2879 1778
 very good 3758 3214
 good       2782 2893
 fair        911 1108
 poor        229   448
mosaicplot() is a good plot to display this data

boxplot() can be used on a vector to get graph showing the various quartiles.A table of values of the various quartiles can be generated by using summary() on the vector.

Another good use is boxplot(aContinuousVarOfDataset ~ aCategoricalVarOfDataset)
This shows a graph of quartiles of continuous var for each value of categorical variable.

Here the continuous var vector can be an existing continuous variable of dataset, of course, but also a constructed vector from various continuous variables of the dataset.

Thursday, January 29, 2015

Basic R revision - Part 3

Dataframes : Datasets in R


When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset. To broadly view the structure of a dataset use head() to look at the header columns and first few observations. (tail() shows the last few). Another method is to use str() which gives you the number of observations (rows), number of features or columns for each observation, a list of variable or column names and their datatypes, with their first few observations. Other useful functions are names() which gives column names, and dim() which gives a vector of two elements - nrows and ncols of the dataframe.


Creating a data frame


Normally you need your data in a very customized form before you can run any statistical algorithms on them. You can either perform that customization at the database level, that is, by querying in SQL to generate your output of the most suitably customized form, or, you can import the raw data onto your R (or Python) environment as it is, and use R (or Python) to create custom dataframes afterwards. (I do not currently have an opinion on what is the best practice - mostly common sense dictates what to do - but I will add to this post if and when I do have any nuggets of wisdom on this).* Here we will learn how to use R to create custom dataframes.


added 21Oct2015
I now feel that it is generally better to do the latter - that is - don't try to work with the SQL query too much to get customized data output - there are much better tools to deal with customization at the language level. R has data.tables and dplyr, for example. For an example, suppose there are two cols a and b and you only want to output the part of the whole dataset where a>5. Easily do-able in SQL. But suppose you only want to output the part of the whole dataset where a+b>5 - not doable as far as I know in SQL. But at R level you can do it.

We can use the data.frame() function to wrap around all the vectors we want to combine in the dataframe. All the vectors, of course, should have the same length (equal number of observations). You can think of this function as similar to cbind, except it deals with vectors of potentially different datatypes. It's not really that similar to cbind actually, as each argument to to data.frame() should be a vector, whereas arguments to cbind can be some vectors and some matrices too!


planets     = c("Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune");
type        = c("Terrestrial planet","Terrestrial planet","Terrestrial planet","Terrestrial planet","Gas giant","Gas giant","Gas giant","Gas giant")
diameter    = c(0.382,0.949,1,0.532,11.209,9.449,4.007,3.883);
rotation    = c(58.64,-243.02,1,1.03,0.41,0.43,-0.72,0.67);
rings       = c(FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,TRUE);


# Create the data frame:
planets_df  = data.frame(planets,type,diameter,rotation,rings)


Indexing and subsetting in dataframes works similar to matrices.


To get diameters of the first 3 planets in planets_df, we can use any of the following:
fpd1 = planets_df[1:3,"diameter"]
fpd2 = planets_df[1:3,3]
fpd3 = planets_df$diameter[1:3]


Example to get only those observations of dataset where planet has rings: planets_df[planets_df$rings,]


For an alternate way to do the same thing, use subset(): subset(planets_df, subset=(planets_df$rings == TRUE))
Use this way to get observations of dataset where planets smaller than earth: subset(planets_df, subset=(planets_df$diameter<1 span="">


To add a new feature or column or attribute to the dataframe planet_df, let's say sun_closeness_rank, simply define it while referring to it as an attribute of that dataframe:
planets_df$sun_closeness_rank = c(1,2,3,4,5,6,7,8)


Sorting a vector in R


The order() function, when applied to a vector, returns a vector with the rank of each element.
For example, order(c(6,3,8)) = {2, 1, 3} vector. Now this vector can be given as index to the original vector, to get a sorted version of original vector.
a = c(100, 9, 101)
order(a)
[1] 2 1 3
a[order(a)]
[1]   9 100 101


Sorting a dataframe by a particular column


For example, if we want to sort planets_df by diameter descending and create largest_first_df:
positions = order(-planets_df$diameter)
largest_first_df =planets_df[positions , ]

Being regular

1. Regularity is everything. Learning something for 2 hours for 25 alternate days is far, far superior to learning the same thing for 10 hours each on the first 5 days and then not coming back to it. On day 51, you will be in a much better position by following the first strategy.

2. If you don't intend to keep using a skill (WebDev, ML - anything) at least 2 to 3 times a week, for at least a couple of years, you might as well not learn it. You will unlearn it in as little as two to three months, and if you need that skill again you will have to start from the very beginning, making you question why you spent all that time learning it in the first place.

3. Set all non-focus but essential things on a reminder until it becomes an autopilot thing. This includes things such as exercising - essential, yes, but should not occupy your mental space and time. The time you spend actually exercising is all the time you should devote to it, nothing more. 

Basic R revision - Part 2

Factors

Factors are used to store categorical variables, where categorical variables are those whose value can only be one amongst a well-defined, discrete set of values. For example factor_gender is a factor that stores variables that can contain elements: "male" and "female".

To construct a factor variable out of a vector of values, just wrap the vector using factor(). For example:

> gender_vector = c("Male", "Female", "Female", "Male", "Male")
> factor_gender_vector = factor(gender_vector)
> factor_gender_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Categorical variables are of two types: nominal and ordinal.
factor_gender would be nominal as there is no grading from lower to higher between male and female unless you are a sexist asshole.
factor_bondratings would be ordinal as there is a natural grading, where we know :



AAA > AA > A > BBB > BB > CCC > CC > C > D

In R, the assumption in for the categorical nominal variable to be nominal. If you wish to specify ordinal, use the order and levels keywords:



temperature_vector = c("High","Low","High","Low","Medium")
factor_temperature_vector = factor(temperature_vector, order=TRUE, levels=c("Low","Medium","High"))
> factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

Renaming the elements of a factor variable

Use the levels() function to do this.



> survey_vector = c("M", "F", "F", "M", "M")
> factor_survey_vector = factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M

> levels(factor_survey_vector) = c("Female", "Male")
> factor_survey_vector
[1] Male   Female Female Male   Male 
Levels: Female Male

Note that it is important to follow the correct order while naming. Using
levels(factor_survey_vector) = c("Female", "Male")
would have been incorrect, since I had run the code earlier to see the unnamed output being "Levels: F M"

Using summary()

summary() is a general R function but it's very useful with factors. For example:



> summary(factor_survey_vector)
Female   Male
     2      3

If a factor is nominal, then the comparison operator > becomes invalid. See the following (continuation) code for my favorite proof for the equality of sexes:



> # Battle of the sexes:
> # Male
> factor_survey_vector[1]
[1] Male
Levels: Female Male
> # Female
> factor_survey_vector[2]
[1] Female
Levels: Female Male
> # Male larger than female?
> factor_survey_vector[1] > factor_survey_vector[2]
'>' not meaningful for factors

Comparison operators meaningful for ordinal categorical variables. See:



> speed_vector = c("Fast", "Slow", "Slow", "Fast", "Ultra-fast")
> # Add your code below
> factor_speed_vector = factor(speed_vector, order = TRUE, levels = c("Slow", "Fast", "Ultra-fast"))
> # Print
> factor_speed_vector
[1] Fast       Slow       Slow       Fast       Ultra-fast
Levels: Slow < Fast < Ultra-fast
> # R prints automagically in the right order
> summary(factor_speed_vector)
      Slow       Fast Ultra-fast
         2          2          1

> compare_them = factor_speed_vector[2] > factor_speed_vector[5]
> # Is data analyst 2 faster than data analyst 5?
> compare_them
[1] FALSE

So Analyst 2 is not faster than Analyst 5.

Wednesday, January 28, 2015

Basic R revision - Part 1



A random useful function

To get the data type of of variable in R, use the function class().

my_numeric = 42
my_character = "forty-two"
my_logical = FALSE

> class(my_numeric)
[1] "numeric"
> class(my_character)
[1] "character"
> class(my_logical)
[1] "logical"

Always remember: Python/C++ vector indices start with 0, R vector indices start with 1
 Subset a vector in R, use vectorname[c(starting index: ending index)]
If disparate (non-adjacent elements): vectorname[c(index1, index2, index3 ..)]

Compare to Python:
Subset a vector in Python, use vectorname[starting index: ending index + 1]
Note that index numbers will be defined as per Python convention
Suppose from a vector v = ['P','O','K','E','R'], we need to output ['O','K','E']
In Python, use v[1:4]
In R, use v[c(2:4)] or just v[2:4]

Also to get all elements in Python a way is to do Mymatrix[3, : ] (gets row 3)
To do the same exercise in R the way to do is Mymatrix[3,  ]

Comparison Operators in R vs VBA and Python

Comparison Operators in R and C++
<, >, >=, ==, !=

Comparison operators in VBA
<, >, <=, =, <>
Python supports both != and
<>

For equality, Python supports == (like R and C++)

In R you can use comparison operator between a vector and a number and get a binary vector which compares each element of the vector to that number.
(Not sure if you can do that in Python. Will check later and update.) Also, you can use that binary vector as an index to get a subset of the original vector.

Matrix in R: To construct a matrix in R you need to add a matrix() wrapper to a vector. e.g. matrix(c(1:9), byrow=TRUE, nrow=3)

Naming elements of a vector and rows/cols of a matrix

Naming can often be useful later. Syntax is simple:
For vector:
vectorv = c(2,3,4)
names(vectorv)=c("a","b","c")
Now, vectorv[“a”]=2
Now, vectorv[“a”]=2

For matrix:
new_hope = c( 460.998007, 314.4)
empire_strikes = c(290.475067, 247.900000)
return_jedi = c(309.306177,165.8)
# Construct the matrix
star_wars_matrix = matrix(c(new_hope,empire_strikes,return_jedi), nrow=3, byrow=TRUE)
# Add your code here such that rows and columns of star_wars_matrix have a name!
rownames(star_wars_matrix) = c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix)= c("US", "non-US")

Another way can be to include these in the matrix definition itself:
movie_names = c("A New Hope","The Empire Strikes Back","Return of the Jedi")
col_titles = c("US","non-US")
star_wars_matrix = matrix(box_office_all, nrow=3, byrow=TRUE, dimnames=list(movie_names,col_titles))
Summing all elements of entire rows or columns, or summing all elements of any vector

To do row sums or column sums in R for a matrix just use rowSums(matrixname) or colSums(matrixname). Note that it is important to capitalize S in rowSums or colSums. Another way can be to reference the needed vector by using something like Mymatrix[3, ] and then wrapping sum() around it.

Combining/Appending functions

cbind(vectorname) can append a vector to an existing matrix as a new column, provided vector's length is same as number of matrix rows. Similarly, rbind. Note the similarity to c() wrapper to construct any vector.

Arithmetic Operators

+,-,*,/ work in an elementwise way for both vectors and matrices
matrix1 * matrix2 does elementwise multiplication, not matrix multiplication as in Linear Algebra for which we use %*% in R

Tuesday, January 27, 2015

Carbs vs Fats

In a comparison of macronutrients, the often misunderstood fats are way better than carbohydrates. Fats, especially the unsaturated ones, provide several essential functions such as protecting our inner organs, maintaining good cholesterol levels and reducing bad cholesterol. Outside of them, Omega 3 fats found in foods such as Tuna, Walnuts and Beans are one of the healthiest things you can consume for your brain function - improving memory, fighting depression, bipolar disorder and ADHD. In addition, it is good for your cardiovascular system and bone joints. Even some saturated fats, such as those found in Desi Ghee, help break down other hard to digest food and reduce the negative effects of other fried and spicy food you eat, making them easily digestible. The one category of fat that is unequivocally unhealthy is trans fats, which would be all fatty food items with a good shelf life - biscuits, chips, pies, donuts, cake etc. In comparison, there are no “essential” carbohydrates, that is, there are no essential body functions that require carbohydrates. So the only function carbs serve is to provide energy, which is provided in just as much quantity by more functional nutrients like fats and proteins.

Indian diet is very heavy on carbs, which provide energy but aid no body function, except that a basic amount is needed to aid digestion. You would be surprised that that basic amount is so little as 3 slices of bread a day for a full grown adult, and that is if that were all the carbs he were consuming. Of course, an average adult will also be consuming carbs in decent quantities from vegetables, beans and fruits.
__


Disclaimer: This is a log for my personal use and does not constitute medical advice from me. I am not qualified to give medical advice to others and you should consult your doctor before making any nutritional or medicinal changes. 

Making good on promises past.

So on the first day of the year I promised to write a blog post here about something I learn "almost daily" and then stopped after day 2. That is a familiar theme with new year resolutions. I see a problem there in that statement I made. It was the qualifier "almost". Qualified promises are like qualified love: imaginary. Secondly, vague goals detract from implementation. So, yeah, I shamelessly claim that I'll be updating it daily. Yes, that would be every day. One of the main motivations for writing things down on the blog is the belief that writing notes is not only helpful as a revision tool for committing things to memory, but also that there are things you learn while writing about a subject that you had not learned while reading about it before.

I have been spending a few hours everyday learning some interesting stuff, only I never got around to writing about that here - so I'm hoping to also backfill some entries retroactively for the lost 26 days of the year whenever I go back to revising that stuff.

Here we go again.