Sunday, 5 June 2011

Scraping Up the Leaderboard

Leaderboards in analytic competitions can be very revealing.


From a competitor viewpoint, you can see if you are in the right ballpark or completely missing something that others have obviously spotted.


If you are the competitor who is in the lead, then revealing your hand can be very detrimental as it encourages others to try harder and you tend to take your own foot off the gas if you are ahead.


I have been on both ends...

The right end  a)


The pakdd09 Data Mining Competition was about predicting if credit card customers would pay their bills. Similar to the HHP, you had to predict into the future. You can see the leader board here.

Now the team at the top of the training data leaderboard, 'Tiberius Data Mining Solutions' is my team, but we weren't always there.  You will notice the team in 2nd place, Suncorp, is way ahead of the field and they had obviously spotted something no one else had. Curiosity to figure out exactly what it was eventually lead us to discover new insights and overtake them.

Curiously in this competition we ended up 4th and the winners came from nowhere on the training data leaderboard to win on the out of time dataset.

The right end  b)


The plot above shows the final standings for the Tourism2 Kaggle competiton. On the training leaderboard, Kaggles own Jeremy Howard (in his pre Kaggle days) was in front by a country mile after only a handful of submissions. This set a benchmark to aspire to, and pushed others to try to figure out what Jeremy had discovered. In the end I got to 2nd place on the leaderboard but was still way off Jeremy. Eventually though on the holdout sample, my submission did prove to be the winner.




The Wrong End


The Kaggle Informs contest was about predicting stock prices. This was one where I was surprised to make a submission that shot to the top of the leaderboard by a wide margin and stayed there for a while until the last few hours when I was overtaken by 3 teams. I had taken my eye off the ball and but would like to think had encouraged the eventual winners to squeeze every last bit of juice out of the data.


Leaderboard Plots

this blog post on the recent Chess comp is very interesting and studies how the leaderboard evolved through time and how big leaps in scores are made.

I ran a recent Kaggle competition called 'Don't Overfit' and posted some leaderbaord plots to the forum during the course of the comp. The way I did this seemed quick the first time around - just copy and paste into Excel and plot a graph. Doing this as a one off got me to see what I wanted fast, but it became tedious in time, doing it every week.

I then figured this task could be automated, and a bit of Googling soon provided me with the clues I needed to proceed. If I run the R code below, it will plot me the current leaderboard for any Kaggle comp and show me where I am (and how much more effort I need to put in).

What is interesting about the current scores is that maybe the top 5 teams are a bit ahead, but everyone else is on a nice trajectory, unlike some of the other leaderboards we have seen where there are marked step increases.

My own score is going slowly in the right direction as I explore more of the data and tighten up the techniques, but I am going backwards as others are doing this at a quicker pace!



The R code below will plot the live leaderboard positions as shown above...



##########################################
# Some R function to plot your position
# on a Kaggle leaderboard
##########################################

plotKaggleLeaderboard <- function(theURL,myTeam,topX=100,title){
    
#this library does all the hard work
#for windows users to install this packages see
#http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe
#and http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/
library(XML)

#go and read the tables from the web page
tables <- readHTMLTable(theURL)

#get the table of interest
#names(tables)
lb <- tables[['leaderboard-table']]

#see what the columns are
#colnames(lb)
#nrow(lb)

#convert to numeric - see ?readHTMLTable
#numeric_cols <- c('AUC','Entries')
#numeric_cols <- c('RMSLE','Entries')
numeric_cols <- c(4,5)
lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(".* ", "", as.character(x))))


#extract the team name from a messy field
team_col <- c('Team Name')
#lb[team_col]

#split the field by "\r\n" than denotes the break between fields within a field 
lb[,team_col] <- sapply(strsplit(as.character(lb[,team_col]), "[\r\n]"), function (x) x[1]) 


#####################
#      plot
#####################
myRank <- which(lb[team_col] == myTeam)
mytext = paste("Team",myTeam,"is in position",myRank,"out of",nrow(lb))

error_data <- lb[,4]

plot(error_data[1:topX]
    ,col = 'blue'
    ,main = paste(title,"- leaderboard at",Sys.Date())
    , sub = mytext
    , xlab = 'Rank'
    , ylab = 'Error'
    )

#mark my position
points(myRank,error_data[myRank],col="red",pch=19)

}
###########################
# End of Function
# plotKaggleLeaderboard
###########################




compURL <- "http://www.heritagehealthprize.com/c/hhp/Leaderboard"
compTeam <- 'Sali Mali'
compTopX <- 125
compTitle <- 'HHP'



plotKaggleLeaderboard(theURL = compURL
                      ,myTeam = compTeam
                      ,topX = compTopX
                      ,title = compTitle 
                      )





No comments:

Post a Comment