Another Data Mining Blog: June 2011

Thursday, 30 June 2011

If you build it, they will come

Or will they?

(Apparently, the actual quote is "If you build it, he will come" from Field of Dreams.)

I figured if I am to make money out of this blog by writing about the HHP, then it would be a good advert if I was actually doing well in the competiton. I recently have been putting effort in on this front, and team Sali Mali eventually got to the top of the leaderboard.

Being a data geek, I looked at the subsequent stats of this blog and saw a huge spike for a particular half hour around the time I got to the top. 'Bingo' I thought - thats the way to generate traffic.

I then looked at my linkedin stats, as the only real way to get to this blog is via my Kaggle profile, which will take you to my linkedin page and then to this blog. Surprisingly there was no such spike there - so what was going on?

Blog Views:

LinkedIn views:

As I don't know what exact time zones everything is in, my blog stats pointed me to this post on the kaggle blog, which I am assuming is the cause for the spike.

http://www.heritagehealthprize.com/c/hhp/forums/t/664/cross-validation-discrepancies/4381#post4381

Anyway, the data mining point of all this is that sometimes people are quick to jump to conclusions that are completely wrong. The real answers are always in the data - which is why I think the HHP will be won by a data scientist and prior expert medical knowledge will pay no part at all.

Saturday, 18 June 2011

The more you practice, the luckier you become

Chris R has now opened up a gap at the top of the leaderboard. It may look like nothing, but when you plot it, it is quite significant.

If you plot the number of entries per team, we see that progress seems to come with more effort. Chris R and a few other teams are almost doing daily submissions.

I am now progressing steadily as I investigate more of the data and more algorithms. I have gone from 40th to 19th by just refining the algorithm, rather than adding any different data.

The technique I am using to determine if things are improving are quite simple. Rather than have an independent set to test the errors against, I randomly split the set in half, build a model on one set and then calculate the error on the rest. The error on the building set is the 'in bag error' and the error on the unseen set is 'out of bag' error. This random splitting is repeated many times with the predictions just averaged. What I am looking for is the OOB error to go down with improvements in the algorithm. This seems to work, and the leaderboard scores improve as by OOB scores improve, although the leaderboard error is about 0.01 higher than my OOB error.

For a sanity check, you can plot the distributions of the predictions by in bag, out of bag and leaderboard set. The plot below shows a remarkable similarity, meaning the data distributions in the sets are very similar.

Wednesday, 8 June 2011

Data Anomalies

This is a place where I will put some quirks of the HHP data that may or may not be relevant in wining the $3million

1) Be aware of the claims truncated flag for the claims in Y1.

2) There are some who have claimed for pediatrics who are clearly not children. This might indicate family accounts or incorrect age?

Sunday, 5 June 2011

Scraping Up the Leaderboard

Leaderboards in analytic competitions can be very revealing.

From a competitor viewpoint, you can see if you are in the right ballpark or completely missing something that others have obviously spotted.

If you are the competitor who is in the lead, then revealing your hand can be very detrimental as it encourages others to try harder and you tend to take your own foot off the gas if you are ahead.

I have been on both ends...

The right end a)

The pakdd09 Data Mining Competition was about predicting if credit card customers would pay their bills. Similar to the HHP, you had to predict into the future. You can see the leader board here.

Now the team at the top of the training data leaderboard, 'Tiberius Data Mining Solutions' is my team, but we weren't always there. You will notice the team in 2nd place, Suncorp, is way ahead of the field and they had obviously spotted something no one else had. Curiosity to figure out exactly what it was eventually lead us to discover new insights and overtake them.

Curiously in this competition we ended up 4th and the winners came from nowhere on the training data leaderboard to win on the out of time dataset.

The right end b)

The plot above shows the final standings for the Tourism2 Kaggle competiton. On the training leaderboard, Kaggles own Jeremy Howard (in his pre Kaggle days) was in front by a country mile after only a handful of submissions. This set a benchmark to aspire to, and pushed others to try to figure out what Jeremy had discovered. In the end I got to 2nd place on the leaderboard but was still way off Jeremy. Eventually though on the holdout sample, my submission did prove to be the winner.

The Wrong End

The Kaggle Informs contest was about predicting stock prices. This was one where I was surprised to make a submission that shot to the top of the leaderboard by a wide margin and stayed there for a while until the last few hours when I was overtaken by 3 teams. I had taken my eye off the ball and but would like to think had encouraged the eventual winners to squeeze every last bit of juice out of the data.

Leaderboard Plots

this blog post on the recent Chess comp is very interesting and studies how the leaderboard evolved through time and how big leaps in scores are made.

I ran a recent Kaggle competition called 'Don't Overfit' and posted some leaderbaord plots to the forum during the course of the comp. The way I did this seemed quick the first time around - just copy and paste into Excel and plot a graph. Doing this as a one off got me to see what I wanted fast, but it became tedious in time, doing it every week.

I then figured this task could be automated, and a bit of Googling soon provided me with the clues I needed to proceed. If I run the R code below, it will plot me the current leaderboard for any Kaggle comp and show me where I am (and how much more effort I need to put in).

What is interesting about the current scores is that maybe the top 5 teams are a bit ahead, but everyone else is on a nice trajectory, unlike some of the other leaderboards we have seen where there are marked step increases.

My own score is going slowly in the right direction as I explore more of the data and tighten up the techniques, but I am going backwards as others are doing this at a quicker pace!

The R code below will plot the live leaderboard positions as shown above...

##########################################
# Some R function to plot your position
# on a Kaggle leaderboard
##########################################

plotKaggleLeaderboard <- function(theURL,myTeam,topX=100,title){
    
#this library does all the hard work
#for windows users to install this packages see
#http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe
#and http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/
library(XML)

#go and read the tables from the web page
tables <- readHTMLTable(theURL)

#get the table of interest
#names(tables)
lb <- tables[['leaderboard-table']]

#see what the columns are
#colnames(lb)
#nrow(lb)

#convert to numeric - see ?readHTMLTable
#numeric_cols <- c('AUC','Entries')
#numeric_cols <- c('RMSLE','Entries')
numeric_cols <- c(4,5)
lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(".* ", "", as.character(x))))


#extract the team name from a messy field
team_col <- c('Team Name')
#lb[team_col]

#split the field by "\r\n" than denotes the break between fields within a field 
lb[,team_col] <- sapply(strsplit(as.character(lb[,team_col]), "[\r\n]"), function (x) x[1]) 


#####################
#      plot
#####################
myRank <- which(lb[team_col] == myTeam)
mytext = paste("Team",myTeam,"is in position",myRank,"out of",nrow(lb))

error_data <- lb[,4]

plot(error_data[1:topX]
    ,col = 'blue'
    ,main = paste(title,"- leaderboard at",Sys.Date())
    , sub = mytext
    , xlab = 'Rank'
    , ylab = 'Error'
    )

#mark my position
points(myRank,error_data[myRank],col="red",pch=19)

}
###########################
# End of Function
# plotKaggleLeaderboard
###########################




compURL <- "http://www.heritagehealthprize.com/c/hhp/Leaderboard"
compTeam <- 'Sali Mali'
compTopX <- 125
compTitle <- 'HHP'



plotKaggleLeaderboard(theURL = compURL
                      ,myTeam = compTeam
                      ,topX = compTopX
                      ,title = compTitle 
                      )