When a solution is submitted for the HHP, the file has to contain the memberID and the predicted Days In Hospital for year 4.
It was noticed early on by a competitor that the memberIDs in the example file provided were not in numerical order, in fact they were quite random. Kaggle confirmed that the submissions had to be in the same memberID order as the supplied example file.
This now causes a concern, being as the data is now in a database. There is no real concept of row order in a database, just because you enter data in a particular order, there is no guarantee it will come out in that order. If you need a specific order, then the order by statement should be used in the SQL to specify the order required.
So how can we get the specified order?
We need to add a new column to the Target.csv file that contains the row order. A simple way to do this in Excel is just to add an extras column going from 1 to the number of cases, in ascending order.
Another way to create this extra column is to use R. When R reads data it conserves the order the file is read. The code below will create a new file with an extra column.
(the syntax highlighting is not perfect for R, it is for Perl, but lets you see the comments easily)
Now if we load this into SQL Server with the rest of the data, we have a field we can always sort by.
Once a submission file is generated, we can also double check it in R and plot the distribution to give it a sanity check.
and this is what I get for one of my submissions to date...
and interestingly if you change num_bins to 5,000, you get...
which will lead to some investigation why we get this funny distribution.
It was noticed early on by a competitor that the memberIDs in the example file provided were not in numerical order, in fact they were quite random. Kaggle confirmed that the submissions had to be in the same memberID order as the supplied example file.
This now causes a concern, being as the data is now in a database. There is no real concept of row order in a database, just because you enter data in a particular order, there is no guarantee it will come out in that order. If you need a specific order, then the order by statement should be used in the SQL to specify the order required.
So how can we get the specified order?
We need to add a new column to the Target.csv file that contains the row order. A simple way to do this in Excel is just to add an extras column going from 1 to the number of cases, in ascending order.
Another way to create this extra column is to use R. When R reads data it conserves the order the file is read. The code below will create a new file with an extra column.
######################################################### # the customer ID is not in any particular order in # the submission file. This code creates a sort order # that we can use to ensure things are sorted correctly. ######################################################### setwd('E:\\comps\\hhp\\raw data\\HHP_release2') #read in the original Target <- read.csv('Target.csv') #add an extra column called orderbyme #with the row number as the value Target$orderbyme <- 1:nrow(Target) #take a look at the result head(Target) #write out a new file write.csv(Target, file='Target_with_order.csv',row.names=FALSE,na = "")
(the syntax highlighting is not perfect for R, it is for Perl, but lets you see the comments easily)
Now if we load this into SQL Server with the rest of the data, we have a field we can always sort by.
Once a submission file is generated, we can also double check it in R and plot the distribution to give it a sanity check.
######################################################### # this R code reads in a submission file, checks it and # plots the distribution of predicted days in hospital ######################################################### #change this to the name of the submission file to check file_name <- 'submission6.csv' #the submission files and #Target.csv should be here setwd('E:\\comps\\hhp\\submissions') #read in the files sub_file <- read.csv(file_name, header=TRUE) compare_file <- read.csv('Target.csv',header=TRUE) #check the file has the correct number of rows and columns if(nrow(sub_file) != 70942){cat('\n\n***invalid number of rows***\n\n')} if(ncol(sub_file) != 3){cat('\n\n***invalid number of columns***\n\n')} #check the example file if(nrow(compare_file) != 70942){cat('\n\n***Target.csv is corrupted***\n\n')} if(ncol(compare_file) != 3){cat('\n\n***Target.csv is corrupted***\n\n')} #check the ids match differences <- abs(sub_file[,1] - compare_file[,1]) if(max(differences) != 0){ cat('\n\n***IDs in wrong order***\n\n') }else{ cat('\n\n***IDs are in the correct order***\n\n') } #check the range if(max(sub_file[,3]) > 15){ cat('\n\n***you have predictions > 15 days!***\n\n')} if(min(sub_file[,3]) < 0){ cat('\n\n***you have predictions < 0 days!***\n\n')} ##################################### #plot the submission distribution ##################################### num_bins <- 50 theplot <- hist(sub_file[,3] ,breaks=num_bins ,xlim = c(0, max(sub_file[,3]) + 0.1) ,main = paste("Distribution of ",file_name) ,xlab = "Predicted Days in Hospital" ) #the mean prediction mean_pred <- colMeans(sub_file[3]) xpos = max(sub_file[,3]) * 0.5 ypos = max(theplot$counts) * 0.75 mytext = paste("Mean submitted DIH= ",sprintf("%.4f", mean_pred)) text(xpos,ypos,mytext,pos=4)
and this is what I get for one of my submissions to date...
and interestingly if you change num_bins to 5,000, you get...
which will lead to some investigation why we get this funny distribution.
No comments:
Post a Comment