Wednesday, 11 May 2011

Beware the HHP submission sort order

When a solution is submitted for the HHP, the file has to contain the memberID and the predicted Days In Hospital for year 4. 

It was noticed early on by a competitor that the memberIDs in the example file provided were not in numerical order, in fact they were quite random. Kaggle confirmed that the submissions had to be in the same memberID order as the supplied example file.

This now causes a concern, being as the data is now in a database. There is no real concept of row order in a database, just because you enter data in a particular order, there is no guarantee it will come out in that order. If you need a specific order, then the order by statement should be used in the SQL to specify the order required.

So how can we get the specified order?

We need to add a new column to the Target.csv file that contains the row order. A simple way to do this in Excel is just to add an extras column going from 1 to the number of cases, in ascending order.

Another way to create this extra column is to use R. When R reads data it conserves the order the file is read. The code below will create a new file with an extra column.



#########################################################
# the customer ID is not in any particular order in
# the submission file. This code creates a sort order
# that we can use to ensure things are sorted correctly.
#########################################################
setwd('E:\\comps\\hhp\\raw data\\HHP_release2')

#read in the original
Target <- read.csv('Target.csv')

#add an extra column called orderbyme
#with the row number as the value
Target$orderbyme <- 1:nrow(Target)

#take a look at the result
head(Target)

#write out a new file
write.csv(Target, file='Target_with_order.csv',row.names=FALSE,na = "")


(the syntax highlighting is not perfect for R, it is for Perl, but lets you see the comments easily)

Now if we load this into SQL Server with the rest of the data, we have a field we can always sort by.



 Once a submission file is generated, we can also double check it in R and plot the distribution to give it a sanity check.



#########################################################
# this R code reads in a submission file, checks it and
# plots the distribution of predicted days in hospital
#########################################################

#change this to the name of the submission file to check
file_name <- 'submission6.csv'

#the submission files and
#Target.csv should be here
setwd('E:\\comps\\hhp\\submissions')

#read in the files
sub_file <- read.csv(file_name, header=TRUE)
compare_file <- read.csv('Target.csv',header=TRUE)
 
#check the file has the correct number of rows and columns
if(nrow(sub_file) != 70942){cat('\n\n***invalid number of rows***\n\n')}
if(ncol(sub_file) != 3){cat('\n\n***invalid number of columns***\n\n')}

#check the example file
if(nrow(compare_file) != 70942){cat('\n\n***Target.csv is corrupted***\n\n')}
if(ncol(compare_file) != 3){cat('\n\n***Target.csv is corrupted***\n\n')}

#check the ids match
differences <- abs(sub_file[,1] - compare_file[,1])

if(max(differences) != 0){
cat('\n\n***IDs in wrong order***\n\n')
}else{
cat('\n\n***IDs are in the correct order***\n\n')
}

#check the range
if(max(sub_file[,3]) > 15){
cat('\n\n***you have predictions > 15 days!***\n\n')}

if(min(sub_file[,3]) < 0){
cat('\n\n***you have predictions < 0 days!***\n\n')}


#####################################
#plot the submission distribution
#####################################
num_bins <- 50
theplot <- hist(sub_file[,3]
    ,breaks=num_bins
    ,xlim = c(0, max(sub_file[,3]) + 0.1)
    ,main = paste("Distribution of ",file_name)
    ,xlab = "Predicted Days in Hospital"
    )

#the mean prediction
mean_pred <- colMeans(sub_file[3])

xpos = max(sub_file[,3]) * 0.5
ypos = max(theplot$counts) * 0.75
mytext = paste("Mean submitted DIH= ",sprintf("%.4f", mean_pred))
text(xpos,ypos,mytext,pos=4)


and this is what I  get for one of my submissions to date...


and interestingly if you change num_bins to 5,000, you get...


which will lead to some investigation why we get this funny distribution.



 

No comments:

Post a Comment