Friday 7 October 2011

If you can't measure it, you can't manage it!

"If you can't measure it, you can't manage it!" - this is often quoted in the marketing and analytics world.

Apparently
"It comes from a remark attributed to GALILEO, who said 'Count what is
countable, measure what is measurable. What is not measurable, make
measurable'."

see here for more details on the origins of the quote.

Anyway, the point is that if you don't measure the before and after, you have no idea if the actions made a difference.

Last week in my previous blog post, I made available some code that would result in a score of around 0.4635 on the HHP leaderboard. The hope was that others would be able to get this code to work and re-ignite interest in the HHP and solving the worlds health problems.

When I looked at the leaderboard today, it looked on the face of it that this was happening, with a few high risers around the 0.4635 mark.


This was the after, but the before is not shown, so on this evidence there is a bit of a hole in the claim that the code was responsible for this change - we are only showing the part of the leaderboard that we want to - such movement could be going on all over the place.

Good old Jeff Moser at Kaggle has been hard at work making the leaderboard dynamic, so we can actually go back in time. Knowing this I modified a previous R script I posted to plot the leaderboard at 3 points in time, the day I released the code, and a week before and after (the new script can be found at the end of this post).

This resulted in the following 2 plots...



These are now a lot more convincing that the code did make a difference. If you follow Eu Jin Lok on the first plot, you will see he has improved his score but actually gone backwards in rank over the two week period. The second plot shows that the number of submissions to score 0.4635 has dropped.

Anyway, the point of this post is that data scientist have the power to make massive differences to the bottom lines of companies. I have been there and done it, but failed to get too excited as I am unassuming and knew exactly what the result would be because my models told me in advance. If analytics - and us analysts - are to be taken more seriously and given the recognition we deserve, we need to start banging our own drum more than we do. Those who get the recognition and rewards are not always those who do the best work, but those who shout the loudest!

And here is the R script to generate those plots...



##########################################
# Some R function to plot the Kaggle 
# leaderboard at different points in time
##########################################

plotKaggleLeaderboard <- function(theURL
,theDates
,myTeam
,topX=100
,title
,plottype=1){
    
#this library does all the hard work
#for windows users to install this packages see
#http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe
#and http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/
library(XML)

theColours <- c(552,254,26)

for (i in 1:length(compDates))
{

#go and read the tables from the web page
thisDate <- paste(theURL,"?asOf=",compDates[i],sep="")
tables <- readHTMLTable(thisDate)

#get the table of interest
#names(tables)
lb <- tables[['leaderboard-table']]

#see what the columns are
#colnames(lb)
#nrow(lb)

#convert to numeric - see ?readHTMLTable
#numeric_cols <- c('AUC','Entries')
#numeric_cols <- c('RMSLE','Entries')
numeric_cols <- c(4,5)
lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(".* ", "", as.character(x))))


#extract the team name from a messy field
team_col <- c('Team Name')
#lb[team_col]

#split the field by "\r\n" than denotes the break between fields within a field 
lb[,team_col] <- sapply(strsplit(as.character(lb[,team_col]), "[\r\n]"), function (x) x[1]) 


#####################
#      plot
#####################
myRank <- which(lb[team_col] == myTeam)
myText = paste("following team",myTeam)

error_data <- lb[,4]
entries_data <- lb[,5]

if(plottype==1){
if(i==1)  plot(error_data[1:topX],col = theColours[i],type="l",xlab='Rank',ylab='Error',main = title, sub=myText)
if(i>1) points(error_data[1:topX],col = theColours[i],type="l")
          
#mark position
points(myRank,error_data[myRank],col=theColours[i],pch=19)
}


if(plottype==2){
if(i==1){ plot(error_data[1:topX],entries_data[1:topX],type = 'p'
            ,main = title
            , xlab = 'Error'
            , ylab = 'Entries'
            , col= theColours[i]
    )}

if(i>1) points(error_data[1:topX],entries_data[1:topX],type = 'p', col= theColours[i])

lines(lowess(error_data[1:topX],entries_data[1:topX]), col=theColours[i], lwd=2.5) # lowess line (x,y)

#mark my position
points(error_data[myRank],entries_data[myRank],col=theColours[i],pch=19)
}

}

legend("topleft",legend=theDates,text.col=theColours)

}
###########################
# End of Function
# plotKaggleLeaderboard
###########################



compURL <- "http://www.heritagehealthprize.com/c/hhp/Leaderboard"
compDates <- c('2011-09-24','2011-10-01','2011-10-07')
compTeam <- 'Eu Jin Lok'
compTopX <- 150
compTitle <- 'HHP Leaderboard Through Time'

plotKaggleLeaderboard(theURL = compURL
            ,theDates = compDates
                      ,myTeam = compTeam
                      ,topX = compTopX
                      ,title = compTitle
                       ,plottype=1
                      )









3 comments:

  1. Your code made a difference for me, even if it was only to introduce me to the gbm() function [I'd been stuck on linear regression prior to reading the milestone papers]. Thanks for the help. :-)

    ReplyDelete
  2. Thanks for sharing such a nice information on Data mining.You have used the pictures and codes in between Which made me to understand the topic more easily.

    ReplyDelete
  3. Thanks for sharing it. I always enjoy reading such superb content with valuable information. The ideas presented are excellent and really cool, making the post truly enjoyable. Keep up the fantastic work.
    visit: Exploratory Testing in Agile: Uncovering Hidden Bugs

    ReplyDelete