Wednesday 20 July 2011

Pump Up The Volume

The HHP competition convinced me that it was time to upgrade my tired old computer. As data sets get bigger, having more RAM and a 64 bit operating system can make things a little quicker.

For those who aren't aware, if you have a 32-bit operating system then there is no point having more than 4GB of RAM, as it can't be used. So if you need more RAM, then you will also need a 64-bit operating system.

I had a chat with the guy in the computer shop and he convinced me that it was going to be worth my while just to buy a new PC rather than upgrade my old one. My specifications were 16GB RAM, Windows 7 64-bit and the latest i7 processors. Size was also a consideration - due to the impending take over of my office by my daughter and me being relegated to the cupboard under the stairs.

I considered a laptop, but apparently the i7 processors that go in them are not as fast as the desktop versions. Instead I opted for a really small case, which increased the price as the case was more expensive and apparently you need a smaller motherboard to fit in them. A bog standard graphics card was added and Bobs your uncle.

The next day the shop rang me - with some bad news. The power supply that is required for the i7 processors won't fit into the nice little case I had opted for, so they recommended another for me that is kind of normal size (but still half the size and weight of my incumbent that comes with 5 fans and neon lights!).

The downgrade in case saved me some money, and the total cost was AU$1,300 - no monitor, keyboard or mouse included though. Incidentally, it would have still cost about $800 to upgrade my old PC (or so the salesman told me), and I now have 2 (actually about 8 as I never seem to get around to getting rid of my old ones).

Keen to see what it could do, I ran my old algorithms through the new beast, and it seemed to accomplish the same tasks as my Win 7 64-bit i5 laptop in  66% of the time. I then looked at the task manager to see what was going on under the hood.

The first thing I noticed is that with no applications running at all, just the operating system, it seems to be using 1.71GB, which seems a little excessive. 


I then kicked off my algorithms and was a bit dismayed to see that only 14% of the available CPU was being used - so essentially only 1 of the 8 cores.


This is because the code I was running was doing things serially. It was doing a set of tasks that needed the prior task to be completed before the next one could begin. The extra processors would enable me to work on different tasks at the same time (such as browsing the Internet) - but the whole point of the investment was to speed up the number crunching.

One solution to utilise the extra processing power would have been to run several algorithms at the same time - but I wanted one algorithm to complete 8 times quicker.

A particularly popular algorithm is Random Forests. Now this is essentially an ensemble of lots of individual decision trees that are all built independently. There is no need to wait for one tree being finished before we can start building the next - an ideal candidate for parallel processing.

It took me a while to hack together a solution, but eventually I got 100% CPU usage and my model built in 1/8th of the time.




Below is the R code that will multi thread the building of Random Forests. The multithreading piece came from a generous Kaggle post in the Don't Overfit competition by Chris Pardy, and with that bit of info, it was trivial to adapt for my purposes.

Incidentally, Chris posted this code completely unexpectedly in response to some other code I had already shared. This was a nice surprise and is why the Kaggle concept is such a great learning environment.

There is now a dilemma though to sharing - a slight matter of $3million.

From what I understand, in order to win one of the milestone cash prizes you essentially have to tell everyone what you did in a way that they can replicate your results. If you do this, you are essentially putting yourself at the bottom of the leaderboard, as everyone else has your solution plus their own.

The milestone prizes increase in value every 6 months, so would it be a good strategy for the leader to not take the $30,000 in the first one and hold out for $50,000 6 months later or $60,000 12 months later, given that you will potentially be improving your model by learning from those who do take the money.

Any thoughts?

R code for parallel random forests 


######################################
# some example code to multithread
# the building of random forests
######################################


#load the required libraries
library(randomForest)
library(snowfall)    
library(rlecuyer)

#load the data
theData <- iris

#set the formula
theFormula <- as.formula("Species ~ . ")

#specify number of threads
threads <- 8

#total number of trees in the random forest
totalTrees <- 1000

#number of trees to build per thread
trees <- as.integer(totalTrees / threads)


###########################
# now the funky stuff
###########################

#the function each thread calls
    parallelRF <- function(i){
    return(randomForest(theFormula, data=theData, ntree=trees))
}    
 
 # Initialise "cluster"
 sfInit(parallel = TRUE, cpus = threads, type = "SOCK")

 # Make data available to other R instances / nodes
 sfExport(list = c("theData", "theFormula","trees"))

 # To load a library on each R instance / node
 sfClusterEval(library(randomForest))
 
 # Use a parallel RNG to avoid correlated random numbers
 # Requires library(rlecuyer) installed on all nodes
 sfClusterSetupRNG()

# build the random forests
 allForests <- sfClusterApplyLB(1:threads, parallelRF)

 sfStop()
 
#everything finished, so
#merge all the forests into one
 myRandomForest <- allForests[[1]]
 for(i in 2:threads){
 myRandomForest <- combine(myRandomForest,allForests[[i]])
 }

#convince oursleves we have them all combined
myRandomForest$ntree

#what is important
myRandomForest$importance

7 comments:

  1. Thanks for the post. I recently needed to get a hard drive from an iMac and I was surprised how jam-packed were things inside.

    I'm using Python on a dual-core 4GB RAM PC and I'm not planning to upgrade anytime soon, since Python can only use one CPU; I'm using the other core for surfing the web and watching videos.

    Best Wishes!

    ReplyDelete
  2. IMHO, you should take the money. If you trust yourself on being able to have an edge over the others, imagine your edge plus 30.000 dollars in computing power.
    May be I would have some doubts in the last milestone, where the money jumps from 60k to 3 million, but probably the tricks you use now want be that valuable in two years.
    Another exception would be if you are really crushing the rest, for example, you algorithm is making 0.4300 against others doing 0.4500, you probably have some serious information you don't want to give away.
    Very interesting blog and congrats for your current third place.
    Cheers!

    ReplyDelete
  3. Thanks for the interesting post! Seems like a nice rig you've assembled. A few things:

    1. I see 8 "cores" on your Task Manager -- but with the i7, I think that means you have 4 physical cores & hyper-threading enabled. Sometimes hyper-threading helps, and sometimes (counter-intuitively) it hurts; it's kind of application specific. In any case, you might try disabling hyper-threading in the BIOS to see if that speeds things up even faster (& change threads to 4, instead of 8).

    2. Maybe you should hook up all your old PC's & create a cluster!

    3. Finally, the prize structure of the HHP is interesting -- since there's a 1st & 2nd place prize, there's actually a DIS-incentive for the top two teams to merge or cooperate to win (i.e. if position 1 & 2 merged & won & split the 1st-place prize, they'd make less than if they remained unmerged). It'll be interesting to see how the competition plays out.

    ReplyDelete
  4. @Chris - that is what I recall the guy in the shop also telling me. Messing with the BIOS sounds scary though, but I might give it a try.
    A cluster sounds fun just to try for the experience, although I suspect my new pc will be doing 95% of the work and my other 7 (one of the still with Windows95!) not a lot.
    The prize structure is interesting, especialy as it is not clear how much exact detail you have to give a way to claim the milestone prizes. It would be possible for positions 2&3 to still 'merge' but not tell anybody and become 1&2 and take out the top 2 prizes and split the cash - something that wasn't an issue in Netflix as there was only one prize. I guess that would upset no. 1 a bit.

    ReplyDelete
  5. SaliMali - Great blog! I was trying the parallel Random Forest code, but it gave me an error

    "Error in checkForRemoteErrors(val) : 8 nodes produced errors; first error: object '.lec.Random.seed.table' not found

    Not sure if you have seen this before. Thought I would check.

    ReplyDelete
  6. Not seen this. I suspect though one of the libraries is not loaded.

    library(rlecuyer)

    probably?

    ReplyDelete
  7. I believe I do have all the libraries. I am trying on Windows though(don't know if that makes a difference, but there are some libraries that are not available on windows)

    ReplyDelete