Another Data Mining Blog: July 2011

Saturday, 30 July 2011

No Change at the Top

I just spotted something on the HHP leaderboard that I have not seen before. There has been no change in the top 12 positions for 1 week.

Does this mean we are all running out of ideas? Teams are submitting but the incremental improvement seems to be relatively minuscule.

Dave (He is smarter than me!) and Willem seem to have a big gap at the top, and Edward is close on my tail for 3rd place. The team I would look out for though are some Old Dogs With New Tricks.

Friday, 29 July 2011

A1 Melbourne and MelbURN: R vs SAS

Last night was another double header in the Melbourne analytic scene. Someone else has already described the events I attended, so pointing to their post will save me some typing.

A1 Melbourne and MelbURN: R vs SAS

Wednesday, 20 July 2011

Damn Computers II

In an earlier post I grumbled a bit about how computers can be very frustrating...

http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html

here is another one for the collection!

Chrome v Internet Explorer 9

As described in the previous post I have just upgraded my PC. Now generally my experience with Windows is that when new, computers are quite quick but with time they seem to slow down and will eventually take 30 mins just to turn on. There is a school of thought that this is an inbuilt Microsoft conspiracy to ensure you keep upgrading your operating system when they release a new one - which seems to be every second week at the moment. The other theory is that in time you install more software that want to do things for you without asking (why does Adobe think I need it to check for updates every 5 mins).

Wanting to not clutter my new machine I am resisting installing anything that is not essential at the moment. I declined the 'trial' version of the latest Microsoft Office suite during the install of the o/s. This caused a problem when I quickly wanted to jot some numbers down in a spreadsheet. Rather than install my old licenced version of Office 2000, I downloaded the free Open Office and was very impressed - seamless integration with Microsoft Office so no issues you won't be able to read files others send you and it basically looks the same - I think I will be sticking with that from now on.

I also resisted downloading the Firefox browser and stuck with IE, until now that is.

I wanted to reply to a blog post. Having spent a while composing the reply, I tried to submit it and got stuck in an endless loop of being prompted to log into my Google account. Having experienced things like this before, my first option these days is to try a different browser. I went for Chrome this time, just on the hunch that seeing Chrome was a Google offering, as is Blogger then it should work. And it did - first time.

Having developed a few web sites, these differences between browsers are a bit of a pain. You really have to test everything out for all scenarios these days. On the Kaggle leaderboard I'm getting different views depending on what browser I am using. In IE my team name is just in bold, in Firefox and Chrome the whole cell is red making it easier to spot me, which I think is the look that it intended.

Internet Explorer - no colouring of the cells

Chrome & Firefox - I think this is the intended look

Keyboard Issues

With this new PC, I am having trouble typing a single quote ( ' ), which is required quite a lot in programming. If I press the key once, nothing happens. I then press it again an two of them appear and I have to delete one to get the single quote required.

I'm sure there is a setting somewhere to prevent this as I've never had this issue before. Rather than me waste all day trying to figure it out, I'm hoping someone can point me in the right direction.

Thanks in advance....

Pump Up The Volume

The HHP competition convinced me that it was time to upgrade my tired old computer. As data sets get bigger, having more RAM and a 64 bit operating system can make things a little quicker.

For those who aren't aware, if you have a 32-bit operating system then there is no point having more than 4GB of RAM, as it can't be used. So if you need more RAM, then you will also need a 64-bit operating system.

I had a chat with the guy in the computer shop and he convinced me that it was going to be worth my while just to buy a new PC rather than upgrade my old one. My specifications were 16GB RAM, Windows 7 64-bit and the latest i7 processors. Size was also a consideration - due to the impending take over of my office by my daughter and me being relegated to the cupboard under the stairs.

I considered a laptop, but apparently the i7 processors that go in them are not as fast as the desktop versions. Instead I opted for a really small case, which increased the price as the case was more expensive and apparently you need a smaller motherboard to fit in them. A bog standard graphics card was added and Bobs your uncle.

The next day the shop rang me - with some bad news. The power supply that is required for the i7 processors won't fit into the nice little case I had opted for, so they recommended another for me that is kind of normal size (but still half the size and weight of my incumbent that comes with 5 fans and neon lights!).

The downgrade in case saved me some money, and the total cost was AU$1,300 - no monitor, keyboard or mouse included though. Incidentally, it would have still cost about $800 to upgrade my old PC (or so the salesman told me), and I now have 2 (actually about 8 as I never seem to get around to getting rid of my old ones).

Keen to see what it could do, I ran my old algorithms through the new beast, and it seemed to accomplish the same tasks as my Win 7 64-bit i5 laptop in 66% of the time. I then looked at the task manager to see what was going on under the hood.

The first thing I noticed is that with no applications running at all, just the operating system, it seems to be using 1.71GB, which seems a little excessive.

I then kicked off my algorithms and was a bit dismayed to see that only 14% of the available CPU was being used - so essentially only 1 of the 8 cores.

This is because the code I was running was doing things serially. It was doing a set of tasks that needed the prior task to be completed before the next one could begin. The extra processors would enable me to work on different tasks at the same time (such as browsing the Internet) - but the whole point of the investment was to speed up the number crunching.

One solution to utilise the extra processing power would have been to run several algorithms at the same time - but I wanted one algorithm to complete 8 times quicker.

A particularly popular algorithm is Random Forests. Now this is essentially an ensemble of lots of individual decision trees that are all built independently. There is no need to wait for one tree being finished before we can start building the next - an ideal candidate for parallel processing.

It took me a while to hack together a solution, but eventually I got 100% CPU usage and my model built in 1/8th of the time.

Below is the R code that will multi thread the building of Random Forests. The multithreading piece came from a generous Kaggle post in the Don't Overfit competition by Chris Pardy, and with that bit of info, it was trivial to adapt for my purposes.

Incidentally, Chris posted this code completely unexpectedly in response to some other code I had already shared. This was a nice surprise and is why the Kaggle concept is such a great learning environment.

There is now a dilemma though to sharing - a slight matter of $3million.

From what I understand, in order to win one of the milestone cash prizes you essentially have to tell everyone what you did in a way that they can replicate your results. If you do this, you are essentially putting yourself at the bottom of the leaderboard, as everyone else has your solution plus their own.

The milestone prizes increase in value every 6 months, so would it be a good strategy for the leader to not take the $30,000 in the first one and hold out for $50,000 6 months later or $60,000 12 months later, given that you will potentially be improving your model by learning from those who do take the money.

Any thoughts?

R code for parallel random forests

######################################
# some example code to multithread
# the building of random forests
######################################


#load the required libraries
library(randomForest)
library(snowfall)    
library(rlecuyer)

#load the data
theData <- iris

#set the formula
theFormula <- as.formula("Species ~ . ")

#specify number of threads
threads <- 8

#total number of trees in the random forest
totalTrees <- 1000

#number of trees to build per thread
trees <- as.integer(totalTrees / threads)


###########################
# now the funky stuff
###########################

#the function each thread calls
    parallelRF <- function(i){
    return(randomForest(theFormula, data=theData, ntree=trees))
}    
 
 # Initialise "cluster"
 sfInit(parallel = TRUE, cpus = threads, type = "SOCK")

 # Make data available to other R instances / nodes
 sfExport(list = c("theData", "theFormula","trees"))

 # To load a library on each R instance / node
 sfClusterEval(library(randomForest))
 
 # Use a parallel RNG to avoid correlated random numbers
 # Requires library(rlecuyer) installed on all nodes
 sfClusterSetupRNG()

# build the random forests
 allForests <- sfClusterApplyLB(1:threads, parallelRF)

 sfStop()
 
#everything finished, so
#merge all the forests into one
 myRandomForest <- allForests[[1]]
 for(i in 2:threads){
 myRandomForest <- combine(myRandomForest,allForests[[i]])
 }

#convince oursleves we have them all combined
myRandomForest$ntree

#what is important
myRandomForest$importance

Thursday, 14 July 2011

Small World in Melbourne

Having two young kids and no family in Melbourne (Australia) to babysit, my evening social activities are very minimal at the moment. They normally revolve around analytic type user groups that start after work and last for a couple of hours - then home for bath time.

Last night was a double header.

The Melbourne Tableau User Group had a meeting to demonstrate the new features of Tableau that are due to be released later this month. Tableau is a package that will draw you graphs and plot geographic data on maps. It is one of my favourite pieces of software for quickly drilling down on issues - you get to what you want very quickly. It costs ~$1,000 for a single licence or $2,000 for an advanced licence that lets you connect to ODBC data sources.

They gave away an IPad at the presentation - as one of the new features is dashboards for mobile devices. With only 27 people there I was feeling lucky, but #5 won (who incidentally already had one and was using it in the meeting to take notes), I was #9.

I spoke to the new Melbourne based Tableau employee, and asked him what he did, as I thought the biggest sales people for Tableau are the users themselves. He agreed - it is a great product and I've not heard a bad word about it.

The meeting was concluded with free beer and finger food that was nice. Here I met one of only two previous Australian KDDCup winner Bhavani Raskutti, who I know very well (the other was the University of Melbourne).

Immediately as this event drew to a close at 6:30, it was 100m down the road to a Data Melbourne talk. This one was even more social as a beer was thrust in my hand before the talk even began.

The reason I attended this event (permission for a leave pass had to be booked in advance) was that the speaker was non other than Matt Kwan from the HHP team Planet Melbourne who are currently 6th on the leaderborad. Matt talked about his PhD reasearch on using mobile phone data to analyse daytime population movements, which was really quite interesting. He then talked about the HHP, just basically introducing the concept of the competition.

He didn't give a way any secrets, but some interesting outtakes were that he had never heard of Kaggle before and was drawn to this by the lure of big money. He was an accomplished programmer but had done no analytics and had to learn linear algebra from scratch and is coding all his own software and wishes he had a pc with more grunt. All in all he is doing very well.

One interesting comment was that wouldn't it be more profitable to put all the effort into creating your own startup than spending all these hours for a potential but not very likely $500k. It is worth noting that several companies started up as a result of the Netflix Prize, including Commendo.

Melbourne is a small place. It is also the spiritual home of Kaggle. In fact I saw Anthony Goldblooms first ever presentation of the Kaggle concept at AusDM 2009, well before it was actually launched (incidentally I was running a data mining comp for this conference that was about blending solutions from the Netflix prize to get the best mega ensemble - may be handy for the HHP?).

At an R user group about 12 months ago, Anthony and I were sat at the same table as one Jeremy Howard, who became intrigued with the Kaggle concept and went on to become Kaggles most successful competitor and now part of the Kaggle team.

So, small world in Melbourne - my adopted home and a great place to live and work.

Friday, 8 July 2011

The Power of Twitter and Analyst First

I don't use twitter, but it must be important as I have just bought a new Panasonic DVD player that can connect to the internet - but the only things you can seem to do are look at Twitter, YouTube, Picasso and the Weather. I also watched a presentation from a colleague recently and every 10 seconds these little messages kept popping up on his screen saying who had tweeted what. Not being Twitter user I had not seen this before, but it makes you realise that information exchange is instantaneous these days.

I am saying this as I had another big spike in views of this blog, in a specific half hour. Being a data miner and generally curious chap I obviously wanted to know what was the cause of this, or more precisely, if I had the skills to find out what the cause was.

The plot below shows the spike...

and it occurred in a specific half hour exactly 19 hours ago...

and a look at the web stats for the day pointed me to the source...

and then a search of Twitter pointed me to the guilty party...

Note this was tweeted exactly 19 hours ago, the exact time the spike occurred.

Now this is all very interesting you may say, but so what?

My point is that you need a curious mind to be a good data scientist - not expensive software.

In business today, those who make the decisions think that buying the most expensive analytic software is going to solve all the problems know to man, and you can pluck any monkey out to use it. To me this is putting the cart before the horse. If you want the best results, this could not be further from the truth. You need to spend money on the best analysts first and then think about the software.

I am currently 2nd in the HHP leaderboard, and have not used any software that was not freely available or written by myself to generate my predictions (except for some commercial compilers bought years ago and maybe Windows if you count that).

There is a new movement called Analyast First that has started up here in Australia. If you are an analytic manager I recommend watching the video below.

Analyst First 101