Another Data Mining Blog: Learning from the Leaderboard

It is now day 12 of the HHP. I know this as I have been trying to make a submission every day, and the leaderboard says Sali Mali has made 12 submissions.

Dave is keeping his position near the top but Sali Mali is going downhill, as others overtake me.

I've now started to look at the problem in a little more detail, and decided I needed to a way of replicating the error function that is being used to judge this competiton. This will help in figuring out what sort of leaderboard score I expect my submissions to achieve.

I've also not been totally wasting my submissions during the time when I have had no useful model to submit. In previous time series competitons, it has been possible to learn a lot from the leaderboard (real examples will be given part 2 of this post), so I figured pinging a few constant valued submissions might help me extract some insight about the future (although I've not really thought about what that insight might be at the moment).

The following R code demonstrates

how the function to calculate the error was put together
how we can easily generate the errors for a sequence of constant valued predictions on Y2 and Y3
how the Y4 leaderboard results were read from an Excel file and compared to the Y2 and Y3 results.

Doing this was a sanity check that my error function gave similar results to the leaderboard calculations. It shows that a constant of 0.2 is a good guess and also that the Y4 curve looks more like Y2 than Y3. This is interesting and needs further thought on exactly what this means, if anything!

###########################################
#function to calculate the model error
###########################################
calc_error <- function(act,pred)
{
    aact <- as.matrix(act)
    ppred <- as.matrix(pred)
    
    if(nrow(aact) == nrow(ppred)){
    return (sqrt(colSums((log(ppred+1) - log(aact+1)) ^ 2) / nrow(aact)))
    } else {
    return (-99)
    }

}
#### EOF to calcualte model error ####



###########################################
# function to calculate the model errors
# given a sequence of constant values for
# the predictions and the known outcome
###########################################
err_seq <- function(sequence,act)
{
    
    actual <- as.matrix(act)

    #vector of errors
    errors <- vector(length=length(sequence))

    #vector for the predictions (will be constants)
    predictions <- vector(length=nrow(actual))

        ind <- 0
        for(i in sequence){
            
            #report progress
            ind = ind + 1
            cat("\n",i)
            flush.console()
            
            #set the prediction to the constant     
            predictions[] <- i    
            #calculate the error
            errors[ind] <- calc_error(act=actual,pred=predictions)
        }

    return (errors)
}
###EOF to calculate sequence of errors ####




#connect to data
library(RODBC) #for data connection
conn <- odbcConnect("sql_server_HHP")

#load the actual days in hospital
actualY2 <- sqlQuery(conn,"select DaysInHospital from DaysInHospital_Y2")
actualY3 <- sqlQuery(conn,"select DaysInHospital from DaysInHospital_Y3")

#set up a sequence of constants to be used as the predictions
const_preds <- seq(from=0, to=1, by=0.01)

#calculate the error sequences for Y2 & Y3
Y2Err <- err_seq(sequence=const_preds,act=actualY2)
Y3Err <- err_seq(sequence=const_preds,act=actualY3)

#read in the errors for Y4 from an excel file 
#where they were recorded
library(xlsReadWrite)
#this is also required to be run
#see http://www.swissr.org/software/xlsreadwrite
#xls.getshlib() 
Y4 <- read.xls("E:\\comps\\hhp\\constants.xls")



#############################
#plot prediction v error
#############################
ymin <- min(Y2Err,Y3Err)
ymax <- max(Y2Err,Y3Err)
plot(const_preds,Y2Err
    ,type='l'
    ,main='Learning from the Leaderboard'
    ,xlab='Predicted Constant'
    ,ylab='Error'
    ,ylim=c(ymin, ymax)
    ,col='blue'
    )
lines(const_preds,Y3Err,col='green')

#add the line for Y4, discovered via leaderboard
lines(Y4[,1],Y4[,2],col='red',lwd=2)

legend('bottomright', c('Y2 - known','Y3 - known','Y4 - via leaderboard'),lty=1, col=c("blue","green","red"))

Another Data Mining Blog

Monday, 16 May 2011

Learning from the Leaderboard - Part 1

No comments:

Post a Comment