Friday 7 October 2011

If you can't measure it, you can't manage it!

"If you can't measure it, you can't manage it!" - this is often quoted in the marketing and analytics world.

Apparently
"It comes from a remark attributed to GALILEO, who said 'Count what is
countable, measure what is measurable. What is not measurable, make
measurable'."

see here for more details on the origins of the quote.

Anyway, the point is that if you don't measure the before and after, you have no idea if the actions made a difference.

Last week in my previous blog post, I made available some code that would result in a score of around 0.4635 on the HHP leaderboard. The hope was that others would be able to get this code to work and re-ignite interest in the HHP and solving the worlds health problems.

When I looked at the leaderboard today, it looked on the face of it that this was happening, with a few high risers around the 0.4635 mark.


This was the after, but the before is not shown, so on this evidence there is a bit of a hole in the claim that the code was responsible for this change - we are only showing the part of the leaderboard that we want to - such movement could be going on all over the place.

Good old Jeff Moser at Kaggle has been hard at work making the leaderboard dynamic, so we can actually go back in time. Knowing this I modified a previous R script I posted to plot the leaderboard at 3 points in time, the day I released the code, and a week before and after (the new script can be found at the end of this post).

This resulted in the following 2 plots...



These are now a lot more convincing that the code did make a difference. If you follow Eu Jin Lok on the first plot, you will see he has improved his score but actually gone backwards in rank over the two week period. The second plot shows that the number of submissions to score 0.4635 has dropped.

Anyway, the point of this post is that data scientist have the power to make massive differences to the bottom lines of companies. I have been there and done it, but failed to get too excited as I am unassuming and knew exactly what the result would be because my models told me in advance. If analytics - and us analysts - are to be taken more seriously and given the recognition we deserve, we need to start banging our own drum more than we do. Those who get the recognition and rewards are not always those who do the best work, but those who shout the loudest!

And here is the R script to generate those plots...



##########################################
# Some R function to plot the Kaggle 
# leaderboard at different points in time
##########################################

plotKaggleLeaderboard <- function(theURL
,theDates
,myTeam
,topX=100
,title
,plottype=1){
    
#this library does all the hard work
#for windows users to install this packages see
#http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe
#and http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/
library(XML)

theColours <- c(552,254,26)

for (i in 1:length(compDates))
{

#go and read the tables from the web page
thisDate <- paste(theURL,"?asOf=",compDates[i],sep="")
tables <- readHTMLTable(thisDate)

#get the table of interest
#names(tables)
lb <- tables[['leaderboard-table']]

#see what the columns are
#colnames(lb)
#nrow(lb)

#convert to numeric - see ?readHTMLTable
#numeric_cols <- c('AUC','Entries')
#numeric_cols <- c('RMSLE','Entries')
numeric_cols <- c(4,5)
lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(".* ", "", as.character(x))))


#extract the team name from a messy field
team_col <- c('Team Name')
#lb[team_col]

#split the field by "\r\n" than denotes the break between fields within a field 
lb[,team_col] <- sapply(strsplit(as.character(lb[,team_col]), "[\r\n]"), function (x) x[1]) 


#####################
#      plot
#####################
myRank <- which(lb[team_col] == myTeam)
myText = paste("following team",myTeam)

error_data <- lb[,4]
entries_data <- lb[,5]

if(plottype==1){
if(i==1)  plot(error_data[1:topX],col = theColours[i],type="l",xlab='Rank',ylab='Error',main = title, sub=myText)
if(i>1) points(error_data[1:topX],col = theColours[i],type="l")
          
#mark position
points(myRank,error_data[myRank],col=theColours[i],pch=19)
}


if(plottype==2){
if(i==1){ plot(error_data[1:topX],entries_data[1:topX],type = 'p'
            ,main = title
            , xlab = 'Error'
            , ylab = 'Entries'
            , col= theColours[i]
    )}

if(i>1) points(error_data[1:topX],entries_data[1:topX],type = 'p', col= theColours[i])

lines(lowess(error_data[1:topX],entries_data[1:topX]), col=theColours[i], lwd=2.5) # lowess line (x,y)

#mark my position
points(error_data[myRank],entries_data[myRank],col=theColours[i],pch=19)
}

}

legend("topleft",legend=theDates,text.col=theColours)

}
###########################
# End of Function
# plotKaggleLeaderboard
###########################



compURL <- "http://www.heritagehealthprize.com/c/hhp/Leaderboard"
compDates <- c('2011-09-24','2011-10-01','2011-10-07')
compTeam <- 'Eu Jin Lok'
compTopX <- 150
compTitle <- 'HHP Leaderboard Through Time'

plotKaggleLeaderboard(theURL = compURL
            ,theDates = compDates
                      ,myTeam = compTeam
                      ,topX = compTopX
                      ,title = compTitle
                       ,plottype=1
                      )









Saturday 1 October 2011

Code for a Respectable HHP Model

Our team Market Makers did well in the HHP first milestone prize, taking out top spot just above Willem Mestrom. Congratulations to all who took part.

For those of you who haven't seen the presentation, there is a recording at http://www.youtube.com/watch?v=xziMY3nDB_E

Reports were written on our methods and are available here...
http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1

Other competitors now have 30 days to read the reports and come back with any concerns.

In our report we included some source code, which is a bit of a pain to copy and paste, so I have reproduced it here. There are two bits of code to run, and at the end a file ready for submission should appear.

The first bit is the data preparation which I did in SQL. I used SQL Server, but it should be easy to run it in any SQL database without much messing around. This took about 8 minutes to run, and at the end is a table ready for use by the second part, which is the modelling performed in R, which takes about 15 minutes.

There are two modifications to the code to make - in the SQL you need to set the correct path to the source files - and in R you need to put in the correct server name so the data can be read directly from the database.

So hopefully in two mouse clicks there will be a file sitting there ready for submission.

If you are not sure about using SQL Server, then the earlier blog posts should point you in the right direction...

http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html

Note that the SQL provided does not generate all the variables we used - others were created, all using SQL, but the code was a little more complicated. The extra variables are described in our write up. Also the parameters used in the GBM modelling are not the most efficient. You will need more trees and a lower shrinkage factor to tease out a better model (see the R documentaion) - but this one runs in a reasonable time.

I am keen to see if this code is useful to anybody - if you can't get it to work then please post a comment hear. If you do use it then I would also like to hear from you!

SQL Code



/**************************************************************
* SQL Code to create an example data set for the HHP
*
* Edit the path in the 'bulk insert' commands to locate
* the source data
* The end result is a table called 'modelling_set' which can
* then be used to build predictive models
*
* created in SQL server express
* http://www.microsoft.com/sqlserver/en/us/editions/express.aspx
*****************************************************************/




/**************************
create a new database
**************************/
CREATE DATABASE HHP_comp
GO
USE HHP_comp




/**************************
load in the raw data
**************************/


--claims
CREATE TABLE Claims
(
MemberID VARCHAR(8) --integers starting with 0, could be text!
, ProviderID VARCHAR(7) --integers starting with 0, could be text!
, Vendor VARCHAR(6) --integers starting with 0, could be text!
, PCP VARCHAR(5) --integers starting with 0, could be text!
, Year VARCHAR(2)
, Specialty VARCHAR(25)
, PlaceSvc VARCHAR(19)
, PayDelay VARCHAR(4)
, LengthOfStay VARCHAR(10)
, DSFS VARCHAR(12)
, PrimaryConditionGroup VARCHAR(8)
, CharlsonIndex VARCHAR(3)
, ProcedureGroup VARCHAR(4)
, SupLOS TINYINT
)


BULK INSERT Claims
FROM 'F:\comps\hhp\raw data\HHP_release2\Claims.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




--members
CREATE TABLE Members
(
MemberID_M VARCHAR(8) --integers starting with 0, could be text!
, AgeAtFirstClaim VARCHAR(5)
, Sex VARCHAR(1)
)


BULK INSERT Members
FROM 'F:\comps\hhp\raw data\HHP_release2\Members.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




-- drug count
CREATE TABLE DrugCount
(
MemberID INT
, Year VARCHAR(2)
, DSFS VARCHAR(12)
, DrugCount VARCHAR(2)
)


BULK INSERT DrugCount
FROM 'F:\comps\hhp\raw data\HHP_release3\DrugCount.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




-- Lab Count
CREATE TABLE LabCount
(
MemberID INT
, Year VARCHAR(2)
, DSFS VARCHAR(12)
, LabCount VARCHAR(3)
)




BULK INSERT LabCount
FROM 'F:\comps\hhp\raw data\HHP_release3\LabCount.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




--DaysInHospital_Y2
CREATE TABLE DaysInHospital_Y2
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)




BULK INSERT DaysInHospital_Y2
FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y2.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




-- DaysInHospital_Y3
CREATE TABLE DaysInHospital_Y3
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)




BULK INSERT DaysInHospital_Y3
FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y3.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)




-- Target
CREATE TABLE Target
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)




BULK INSERT Target
FROM 'F:\comps\hhp\raw data\HHP_release2\Target.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)






/*************************
adjust the claims data to
convert text to integers
**************************/




-- PayDelay
ALTER TABLE Claims
ADD PayDelayI integer
GO


UPDATE Claims
SET PayDelayI = CASE WHEN PayDelay = '162+' THEN 162 ELSE CAST(PayDelay AS integer) END




--dsfs
ALTER TABLE Claims
ADD dsfsI integer
GO


UPDATE Claims
SET dsfsI =
CASE
WHEN dsfs = '0- 1 month' THEN 1
WHEN dsfs = '1- 2 months' THEN 2
WHEN dsfs = '2- 3 months' THEN 3
WHEN dsfs = '3- 4 months' THEN 4
WHEN dsfs = '4- 5 months' THEN 5
WHEN dsfs = '5- 6 months' THEN 6
WHEN dsfs = '6- 7 months' THEN 7
WHEN dsfs = '7- 8 months' THEN 8
WHEN dsfs = '8- 9 months' THEN 9
WHEN dsfs = '9-10 months' THEN 10
WHEN dsfs = '10-11 months' THEN 11
WHEN dsfs = '11-12 months' THEN 12
WHEN dsfs IS NULL THEN NULL
END




-- CharlsonIndex
ALTER TABLE Claims
ADD CharlsonIndexI INTEGER
GO


UPDATE Claims
SET CharlsonIndexI =
CASE
WHEN CharlsonIndex = '0' THEN 0
WHEN CharlsonIndex = '1-2' THEN 2
WHEN CharlsonIndex = '3-4' THEN 4
WHEN CharlsonIndex = '5+' THEN 6
END




-- LengthOfStay
ALTER TABLE Claims
ADD LengthOfStayI INTEGER
GO


UPDATE Claims
SET LengthOfStayI =
CASE
WHEN LengthOfStay = '1 day' THEN 1
WHEN LengthOfStay = '2 days' THEN 2
WHEN LengthOfStay = '3 days' THEN 3
WHEN LengthOfStay = '4 days' THEN 4
WHEN LengthOfStay = '5 days' THEN 5
WHEN LengthOfStay = '6 days' THEN 6
WHEN LengthOfStay = '1- 2 weeks' THEN 11
WHEN LengthOfStay = '2- 4 weeks' THEN 21
WHEN LengthOfStay = '4- 8 weeks' THEN 42
WHEN LengthOfStay = '26+ weeks' THEN 180
WHEN LengthOfStay IS NULL THEN null
END




/**************************
create a summary table
at the member/year level
***************************/
SELECT
year
,Memberid


,COUNT(*) AS no_Claims
,COUNT(DISTINCT ProviderID) AS no_Providers
,COUNT(DISTINCT Vendor) AS no_Vendors
,COUNT(DISTINCT PCP) AS no_PCPs
,COUNT(DISTINCT PlaceSvc) AS no_PlaceSvcs
,COUNT(DISTINCT Specialty) AS no_Specialities
,COUNT(DISTINCT PrimaryConditionGroup) AS no_PrimaryConditionGroups
,COUNT(DISTINCT ProcedureGroup) AS no_ProcedureGroups


,MAX(PayDelayI) AS PayDelay_max
,MIN(PayDelayI) AS PayDelay_min
,AVG(PayDelayI) AS PayDelay_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(PayDelayI) END) AS PayDelay_stdev


,MAX(LengthOfStayI) AS LOS_max
,MIN(LengthOfStayI) AS LOS_min
,AVG(LengthOfStayI) AS LOS_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(LengthOfStayI) END) AS LOS_stdev


,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 0 THEN 1 ELSE 0 END) AS LOS_TOT_UNKNOWN
,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 1 THEN 1 ELSE 0 END) AS LOS_TOT_SUPRESSED
,SUM(CASE WHEN LENGTHOFSTAY IS NOT NULL THEN 1 ELSE 0 END) AS LOS_TOT_KNOWN


,MAX(dsfsI) AS dsfs_max
,MIN(dsfsI) AS dsfs_min
,MAX(dsfsI) - MIN(dsfsI) AS dsfs_range
,AVG(dsfsI) AS dsfs_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(dsfsI) END) AS dsfs_stdev


,MAX(CharlsonIndexI) AS CharlsonIndexI_max
,MIN(CharlsonIndexI) AS CharlsonIndexI_min
,AVG(CharlsonIndexI) AS CharlsonIndexI_ave
,MAX(CharlsonIndexI) - MIN(CharlsonIndexI) AS CharlsonIndexI_range
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(CharlsonIndexI) END) AS CharlsonIndexI_stdev




,SUM(CASE WHEN PrimaryConditionGroup = 'MSC2a3' THEN 1 ELSE 0 END) AS pcg1
,SUM(CASE WHEN PrimaryConditionGroup = 'METAB3' THEN 1 ELSE 0 END) AS pcg2
,SUM(CASE WHEN PrimaryConditionGroup = 'ARTHSPIN' THEN 1 ELSE 0 END) AS pcg3
,SUM(CASE WHEN PrimaryConditionGroup = 'NEUMENT' THEN 1 ELSE 0 END) AS pcg4
,SUM(CASE WHEN PrimaryConditionGroup = 'RESPR4' THEN 1 ELSE 0 END) AS pcg5
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCHRT' THEN 1 ELSE 0 END) AS pcg6
,SUM(CASE WHEN PrimaryConditionGroup = 'SKNAUT' THEN 1 ELSE 0 END) AS pcg7
,SUM(CASE WHEN PrimaryConditionGroup = 'GIBLEED' THEN 1 ELSE 0 END) AS pcg8
,SUM(CASE WHEN PrimaryConditionGroup = 'INFEC4' THEN 1 ELSE 0 END) AS pcg9
,SUM(CASE WHEN PrimaryConditionGroup = 'TRAUMA' THEN 1 ELSE 0 END) AS pcg10
,SUM(CASE WHEN PrimaryConditionGroup = 'HEART2' THEN 1 ELSE 0 END) AS pcg11
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL3' THEN 1 ELSE 0 END) AS pcg12
,SUM(CASE WHEN PrimaryConditionGroup = 'ROAMI' THEN 1 ELSE 0 END) AS pcg13
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL5' THEN 1 ELSE 0 END) AS pcg14
,SUM(CASE WHEN PrimaryConditionGroup = 'ODaBNCA' THEN 1 ELSE 0 END) AS pcg15
,SUM(CASE WHEN PrimaryConditionGroup = 'UTI' THEN 1 ELSE 0 END) AS pcg16
,SUM(CASE WHEN PrimaryConditionGroup = 'COPD' THEN 1 ELSE 0 END) AS pcg17
,SUM(CASE WHEN PrimaryConditionGroup = 'GYNEC1' THEN 1 ELSE 0 END) AS pcg18
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRB' THEN 1 ELSE 0 END) AS pcg19
,SUM(CASE WHEN PrimaryConditionGroup = 'FXDISLC' THEN 1 ELSE 0 END) AS pcg20
,SUM(CASE WHEN PrimaryConditionGroup = 'AMI' THEN 1 ELSE 0 END) AS pcg21
,SUM(CASE WHEN PrimaryConditionGroup = 'PRGNCY' THEN 1 ELSE 0 END) AS pcg22
,SUM(CASE WHEN PrimaryConditionGroup = 'HEMTOL' THEN 1 ELSE 0 END) AS pcg23
,SUM(CASE WHEN PrimaryConditionGroup = 'HEART4' THEN 1 ELSE 0 END) AS pcg24
,SUM(CASE WHEN PrimaryConditionGroup = 'SEIZURE' THEN 1 ELSE 0 END) AS pcg25
,SUM(CASE WHEN PrimaryConditionGroup = 'APPCHOL' THEN 1 ELSE 0 END) AS pcg26
,SUM(CASE WHEN PrimaryConditionGroup = 'CHF' THEN 1 ELSE 0 END) AS pcg27
,SUM(CASE WHEN PrimaryConditionGroup = 'GYNECA' THEN 1 ELSE 0 END) AS pcg28
,SUM(CASE WHEN PrimaryConditionGroup IS NULL THEN 1 ELSE 0 END) AS pcg29
,SUM(CASE WHEN PrimaryConditionGroup = 'PNEUM' THEN 1 ELSE 0 END) AS pcg30
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL2' THEN 1 ELSE 0 END) AS pcg31
,SUM(CASE WHEN PrimaryConditionGroup = 'GIOBSENT' THEN 1 ELSE 0 END) AS pcg32
,SUM(CASE WHEN PrimaryConditionGroup = 'STROKE' THEN 1 ELSE 0 END) AS pcg33
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRA' THEN 1 ELSE 0 END) AS pcg34
,SUM(CASE WHEN PrimaryConditionGroup = 'FLaELEC' THEN 1 ELSE 0 END) AS pcg35
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL1' THEN 1 ELSE 0 END) AS pcg36
,SUM(CASE WHEN PrimaryConditionGroup = 'HIPFX' THEN 1 ELSE 0 END) AS pcg37
,SUM(CASE WHEN PrimaryConditionGroup = 'METAB1' THEN 1 ELSE 0 END) AS pcg38
,SUM(CASE WHEN PrimaryConditionGroup = 'PERVALV' THEN 1 ELSE 0 END) AS pcg39
,SUM(CASE WHEN PrimaryConditionGroup = 'LIVERDZ' THEN 1 ELSE 0 END) AS pcg40
,SUM(CASE WHEN PrimaryConditionGroup = 'CATAST' THEN 1 ELSE 0 END) AS pcg41
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRM' THEN 1 ELSE 0 END) AS pcg42
,SUM(CASE WHEN PrimaryConditionGroup = 'PERINTL' THEN 1 ELSE 0 END) AS pcg43
,SUM(CASE WHEN PrimaryConditionGroup = 'PNCRDZ' THEN 1 ELSE 0 END) AS pcg44
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL1' THEN 1 ELSE 0 END) AS pcg45
,SUM(CASE WHEN PrimaryConditionGroup = 'SEPSIS' THEN 1 ELSE 0 END) AS pcg46


,SUM(CASE WHEN Specialty = 'Internal' THEN 1 ELSE 0 END) AS sp1
,SUM(CASE WHEN Specialty = 'Laboratory' THEN 1 ELSE 0 END) AS sp2
,SUM(CASE WHEN Specialty = 'General Practice' THEN 1 ELSE 0 END) AS sp3
,SUM(CASE WHEN Specialty = 'Surgery' THEN 1 ELSE 0 END) AS sp4
,SUM(CASE WHEN Specialty = 'Diagnostic Imaging' THEN 1 ELSE 0 END) AS sp5
,SUM(CASE WHEN Specialty = 'Emergency' THEN 1 ELSE 0 END) AS sp6
,SUM(CASE WHEN Specialty = 'Other' THEN 1 ELSE 0 END) AS sp7
,SUM(CASE WHEN Specialty = 'Pediatrics' THEN 1 ELSE 0 END) AS sp8
,SUM(CASE WHEN Specialty = 'Rehabilitation' THEN 1 ELSE 0 END) AS sp9
,SUM(CASE WHEN Specialty = 'Obstetrics and Gynecology' THEN 1 ELSE 0 END) AS sp10
,SUM(CASE WHEN Specialty = 'Anesthesiology' THEN 1 ELSE 0 END) AS sp11
,SUM(CASE WHEN Specialty = 'Pathology' THEN 1 ELSE 0 END) AS sp12
,SUM(CASE WHEN Specialty IS NULL THEN 1 ELSE 0 END) AS sp13


,SUM(CASE WHEN ProcedureGroup = 'EM' THEN 1 ELSE 0 END ) AS pg1
,SUM(CASE WHEN ProcedureGroup = 'PL' THEN 1 ELSE 0 END ) AS pg2
,SUM(CASE WHEN ProcedureGroup = 'MED' THEN 1 ELSE 0 END ) AS pg3
,SUM(CASE WHEN ProcedureGroup = 'SCS' THEN 1 ELSE 0 END ) AS pg4
,SUM(CASE WHEN ProcedureGroup = 'RAD' THEN 1 ELSE 0 END ) AS pg5
,SUM(CASE WHEN ProcedureGroup = 'SDS' THEN 1 ELSE 0 END ) AS pg6
,SUM(CASE WHEN ProcedureGroup = 'SIS' THEN 1 ELSE 0 END ) AS pg7
,SUM(CASE WHEN ProcedureGroup = 'SMS' THEN 1 ELSE 0 END ) AS pg8
,SUM(CASE WHEN ProcedureGroup = 'ANES' THEN 1 ELSE 0 END ) AS pg9
,SUM(CASE WHEN ProcedureGroup = 'SGS' THEN 1 ELSE 0 END ) AS pg10
,SUM(CASE WHEN ProcedureGroup = 'SEOA' THEN 1 ELSE 0 END ) AS pg11
,SUM(CASE WHEN ProcedureGroup = 'SRS' THEN 1 ELSE 0 END ) AS pg12
,SUM(CASE WHEN ProcedureGroup = 'SNS' THEN 1 ELSE 0 END ) AS pg13
,SUM(CASE WHEN ProcedureGroup = 'SAS' THEN 1 ELSE 0 END ) AS pg14
,SUM(CASE WHEN ProcedureGroup = 'SUS' THEN 1 ELSE 0 END ) AS pg15
,SUM(CASE WHEN ProcedureGroup IS NULL THEN 1 ELSE 0 END ) AS pg16
,SUM(CASE WHEN ProcedureGroup = 'SMCD' THEN 1 ELSE 0 END ) AS pg17
,SUM(CASE WHEN ProcedureGroup = 'SO' THEN 1 ELSE 0 END ) AS pg18


,SUM(CASE WHEN PlaceSvc = 'Office' THEN 1 ELSE 0 END) AS ps1
,SUM(CASE WHEN PlaceSvc = 'Independent Lab' THEN 1 ELSE 0 END) AS ps2
,SUM(CASE WHEN PlaceSvc = 'Urgent Care' THEN 1 ELSE 0 END) AS ps3
,SUM(CASE WHEN PlaceSvc = 'Outpatient Hospital' THEN 1 ELSE 0 END) AS ps4
,SUM(CASE WHEN PlaceSvc = 'Inpatient Hospital' THEN 1 ELSE 0 END) AS ps5
,SUM(CASE WHEN PlaceSvc = 'Ambulance' THEN 1 ELSE 0 END) AS ps6
,SUM(CASE WHEN PlaceSvc = 'Other' THEN 1 ELSE 0 END) AS ps7
,SUM(CASE WHEN PlaceSvc = 'Home' THEN 1 ELSE 0 END) AS ps8
,SUM(CASE WHEN PlaceSvc IS NULL THEN 1 ELSE 0 END) AS ps9


INTO claims_per_member
FROM Claims
GROUP BY year,Memberid


-- remove some nulls
UPDATE claims_per_member
SET LOS_max = 0 WHERE LOS_max IS NULL


UPDATE claims_per_member
SET LOS_min = 0 WHERE LOS_min IS NULL


UPDATE claims_per_member
SET LOS_ave = 0 WHERE LOS_ave IS NULL


UPDATE claims_per_member
SET LOS_stdev = -1 WHERE LOS_stdev IS NULL


UPDATE claims_per_member
SET dsfs_max = 0 WHERE dsfs_max IS NULL


UPDATE claims_per_member
SET dsfs_min = 0 WHERE dsfs_min IS NULL


UPDATE claims_per_member
SET dsfs_ave = 0 WHERE dsfs_ave IS NULL


UPDATE claims_per_member
SET dsfs_stdev = -1 WHERE dsfs_stdev IS NULL


UPDATE claims_per_member
SET dsfs_range = -1 WHERE dsfs_range IS NULL


UPDATE claims_per_member
SET CharlsonIndexI_range = -1 WHERE CharlsonIndexI_range IS NULL






/***********************************
Members
***********************************/


-- create binary flags for age
ALTER TABLE Members ADD age_05 INT
ALTER TABLE Members ADD age_15 INT
ALTER TABLE Members ADD age_25 INT
ALTER TABLE Members ADD age_35 INT
ALTER TABLE Members ADD age_45 INT
ALTER TABLE Members ADD age_55 INT
ALTER TABLE Members ADD age_65 INT
ALTER TABLE Members ADD age_75 INT
ALTER TABLE Members ADD age_85 INT
ALTER TABLE Members ADD age_MISS INT


GO


UPDATE Members SET age_05 = CASE WHEN ageATfirstclaim = '0-9' THEN 1 ELSE 0 END
UPDATE Members SET age_15 = CASE WHEN ageATfirstclaim = '10-19' THEN 1 ELSE 0 END
UPDATE Members SET age_25 = CASE WHEN ageATfirstclaim = '20-29' THEN 1 ELSE 0 END
UPDATE Members SET age_35 = CASE WHEN ageATfirstclaim = '30-39' THEN 1 ELSE 0 END
UPDATE Members SET age_45 = CASE WHEN ageATfirstclaim = '40-49' THEN 1 ELSE 0 END
UPDATE Members SET age_55 = CASE WHEN ageATfirstclaim = '50-59' THEN 1 ELSE 0 END
UPDATE Members SET age_65 = CASE WHEN ageATfirstclaim = '60-69' THEN 1 ELSE 0 END
UPDATE Members SET age_75 = CASE WHEN ageATfirstclaim = '70-79' THEN 1 ELSE 0 END
UPDATE Members SET age_85 = CASE WHEN ageATfirstclaim = '80+' THEN 1 ELSE 0 END
UPDATE Members SET age_MISS = CASE WHEN ageATfirstclaim IS NULL THEN 1 ELSE 0 END




--create binary flags for sex
ALTER TABLE Members
ADD sexMALE INT
GO


UPDATE Members
SET SexMALE =
CASE
WHEN Sex = 'M' THEN 1 ELSE 0
END




ALTER TABLE Members
ADD sexFEMALE INT
GO


UPDATE Members
SET SexFEMALE =
CASE
WHEN Sex = 'F' THEN 1 ELSE 0
END




ALTER TABLE Members
ADD sexMISS INT
GO


UPDATE Members
SET SexMISS =
CASE
WHEN Sex IS NULL THEN 1 ELSE 0
END






/******************
DRUG COUNTS
******************/


-- convert to integers
ALTER TABLE drugcount ADD DrugCountI INT
GO
UPDATE DRUGCOUNT
SET DrugCountI =
CASE WHEN DrugCount = '7+' THEN 7 ELSE DrugCount END




SELECT
memberID AS memberID_dc
,Year AS YEAR_dc
,MAX(drugcountI) AS drugCount_max
,MIN(drugcountI) AS drugCount_min
,AVG(drugcountI * 1.0) AS drugCount_ave
,COUNT(*) AS drugcount_months
INTO DRUGCOUNT_SUMMARY
FROM
drugcount
GROUP BY
memberID
,Year






/******************
LAB COUNTS
******************/


-- convert to integers
ALTER TABLE LabCount ADD LabCountI INT
GO
UPDATE LabCount
SET LabCountI =
CASE WHEN LabCount = '10+' THEN 10 ELSE LabCount END


SELECT
memberID AS memberID_lc
,Year AS YEAR_lc
,MAX(labcountI) AS labCount_max
,MIN(labcountI) AS labCount_min
,AVG(labcountI * 1.0) AS labCount_ave
,COUNT(*) AS labcount_months
INTO LABCOUNT_SUMMARY
FROM
labcount
GROUP BY
memberID
,Year




/********************************
Targets
********************************/


SELECT *
INTO DIH
FROM
(
SELECT
MemberID AS MemberID_t
,'Y1' AS YEAR_t
,ClaimsTruncated
,DaysInHospital
,1 AS trainset
FROM DaysInHospital_Y2


UNION ALL


SELECT
MemberID AS MemberID_t
,'Y2' AS YEAR_t
,ClaimsTruncated
,DaysInHospital
,1 AS trainset
FROM DaysInHospital_Y3


UNION ALL


SELECT
MemberID AS MemberID_t
,'Y3' AS YEAR_t
,ClaimsTruncated
,null AS DaysInHospital
,0 AS trainset
FROM Target
) a






/*****************************
Now merge them all together to
create the modeling data SET
******************************/
SELECT a.*,b.*
INTO #temp1
FROM
DIH a
LEFT OUTER JOIN
members b
on a.MemberID_t = B.Memberid_M


ALTER TABLE #temp1 DROP COLUMN Memberid_M
ALTER TABLE #temp1 DROP COLUMN AgeAtFirstClaim
ALTER TABLE #temp1 DROP COLUMN Sex
GO






SELECT a.*,b.*
INTO #temp2
FROM
#temp1 a
LEFT OUTER JOIN
claims_per_member b
on a.MemberID_t = B.Memberid
AND a.YEAR_t = b.year


ALTER TABLE #temp2 DROP COLUMN Memberid
ALTER TABLE #temp2 DROP COLUMN year
GO




SELECT a.*,b.*
INTO #temp3
FROM
#temp2 a
LEFT OUTER JOIN
DRUGCOUNT_SUMMARY b
on a.MemberID_t = B.Memberid_dc
AND a.YEAR_t = b.YEAR_dc


ALTER TABLE #temp3 DROP COLUMN Memberid_dc
ALTER TABLE #temp3 DROP COLUMN YEAR_dc
GO






SELECT a.*,b.*
INTO #temp4
FROM
#temp3 a
LEFT OUTER JOIN
LABCOUNT_SUMMARY b
on a.MemberID_t = B.Memberid_lc
AND a.YEAR_t = b.YEAR_lc


ALTER TABLE #temp4 DROP COLUMN Memberid_lc
ALTER TABLE #temp4 DROP COLUMN YEAR_lc
GO






-- removel nulls for those who had
-- no lab or drug information
ALTER TABLE #temp4 ADD labNull INT
ALTER TABLE #temp4 ADD drugNull INT
GO


UPDATE #temp4 SET labNull = 0
UPDATE #temp4 SET labNull = 1 WHERE labCount_max IS NULL


UPDATE #temp4 SET drugNull = 0
UPDATE #temp4 SET drugNull = 1 WHERE drugCount_max IS NULL


UPDATE #temp4 SET labCount_max = 0 WHERE labCount_max IS NULL
UPDATE #temp4 SET labCount_min = 0 WHERE labCount_min IS NULL
UPDATE #temp4 SET labCount_ave = 0 WHERE labCount_ave IS NULL
UPDATE #temp4 SET labcount_months = 0 WHERE labcount_months IS NULL


UPDATE #temp4 SET drugCount_max = 0 WHERE drugCount_max IS NULL
UPDATE #temp4 SET drugCount_min = 0 WHERE drugCount_min IS NULL
UPDATE #temp4 SET drugCount_ave = 0 WHERE drugCount_ave IS NULL
UPDATE #temp4 SET drugcount_months = 0 WHERE drugcount_months IS NULL




SELECT *
INTO modelling_set
FROM #temp4




R code



########################################
# Example GBM model for HHP
# scores ~ 0.4635 on leaderboard
# which would be 55th position of 510
# as at 9th Sept 2011
#
# Requires the data having been prepared
# using the SQL supplied
#
########################################

starttime <- proc.time()

########################################
#load the data
########################################
library(RODBC)

#set a connection to the database 
conn <- odbcDriverConnect("driver=SQL Server;database=HHP_comp;server=servername\\SQLEXPRESS;")

#or this method involves setting up a DSN (Data Source Name) called HHP_comp
#conn <- odbcConnect("HHP_comp")

alldata <- sqlQuery(conn,"select * from modelling_set")


########################################
# arrange the data
########################################

#identify train and leaderboard data
trainrows <- which(alldata$trainset == 1)
scorerows <- which(alldata$trainset == 0)

#sanity check the size of each set
length(trainrows)
length(scorerows)

#display the column names
colnames(alldata)

#memberid is required as key for submission set
memberid <- alldata[scorerows,'MemberID_t']

#remove redundant fields
alldata$MemberID_t <- NULL
alldata$YEAR_t <- NULL
alldata$trainset <- NULL

#target - what we are predicting
theTarget <- 'DaysInHospital'

#put the target on the log scale
alldata[trainrows,theTarget] <- log1p(alldata[trainrows,theTarget]) 

#find the position of the target
targindex <-  which(names(alldata)==theTarget)


########################################
# build the model
########################################

#GBM model settings, these can be varied
GBM_NTREES = 500
GBM_SHRINKAGE = 0.05
GBM_DEPTH = 4
GBM_MINOBS = 50

#build the GBM model
library(gbm)
GBM_model <- gbm.fit(
             x = alldata[trainrows,-targindex]
            ,y = alldata[trainrows,targindex]
            ,distribution = "gaussian"
            ,n.trees = GBM_NTREES
            ,shrinkage = GBM_SHRINKAGE
            ,interaction.depth = GBM_DEPTH
            ,n.minobsinnode = GBM_MINOBS
            ,verbose = TRUE) 

#list variable importance
summary(GBM_model,GBM_NTREES)

#predict for the leaderboard data
prediction <- predict.gbm(object = GBM_model
              ,newdata = alldata[scorerows,-targindex]
              ,GBM_NTREES)

#put on correct scale and cap
prediction <- expm1(prediction)
prediction <- pmin(15,prediction)
prediction <- pmax(0,prediction)

#plot the submission distribution
hist(prediction, breaks=500)


########################################
#write the submission to file
########################################
submission <- cbind(memberid,prediction)
colnames(submission) <- c("MemberID","DaysInHospital")
fnname <- "C:\\GBM_demo1.csv"
write.csv(submission, file=fnname, row.names = FALSE)

elapsedtime <- proc.time() - starttime
cat("\nFinished\n",elapsedtime)