Another Data Mining Blog: Code for a Respectable HHP Model

Our team Market Makers did well in the HHP first milestone prize, taking out top spot just above Willem Mestrom. Congratulations to all who took part.

For those of you who haven't seen the presentation, there is a recording at http://www.youtube.com/watch?v=xziMY3nDB_E

Reports were written on our methods and are available here...
http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1

Other competitors now have 30 days to read the reports and come back with any concerns.

In our report we included some source code, which is a bit of a pain to copy and paste, so I have reproduced it here. There are two bits of code to run, and at the end a file ready for submission should appear.

The first bit is the data preparation which I did in SQL. I used SQL Server, but it should be easy to run it in any SQL database without much messing around. This took about 8 minutes to run, and at the end is a table ready for use by the second part, which is the modelling performed in R, which takes about 15 minutes.

There are two modifications to the code to make - in the SQL you need to set the correct path to the source files - and in R you need to put in the correct server name so the data can be read directly from the database.

So hopefully in two mouse clicks there will be a file sitting there ready for submission.

If you are not sure about using SQL Server, then the earlier blog posts should point you in the right direction...

http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html

Note that the SQL provided does not generate all the variables we used - others were created, all using SQL, but the code was a little more complicated. The extra variables are described in our write up. Also the parameters used in the GBM modelling are not the most efficient. You will need more trees and a lower shrinkage factor to tease out a better model (see the R documentaion) - but this one runs in a reasonable time.

I am keen to see if this code is useful to anybody - if you can't get it to work then please post a comment hear. If you do use it then I would also like to hear from you!

SQL Code

/**************************************************************
* SQL Code to create an example data set for the HHP
*
* Edit the path in the 'bulk insert' commands to locate
* the source data
* The end result is a table called 'modelling_set' which can
* then be used to build predictive models
*
* created in SQL server express
* http://www.microsoft.com/sqlserver/en/us/editions/express.aspx
*****************************************************************/

/**************************
create a new database
**************************/
CREATE DATABASE HHP_comp
GO
USE HHP_comp

/**************************
load in the raw data
**************************/

--claims
CREATE TABLE Claims
(
MemberID VARCHAR(8) --integers starting with 0, could be text!
, ProviderID VARCHAR(7) --integers starting with 0, could be text!
, Vendor VARCHAR(6) --integers starting with 0, could be text!
, PCP VARCHAR(5) --integers starting with 0, could be text!
, Year VARCHAR(2)
, Specialty VARCHAR(25)
, PlaceSvc VARCHAR(19)
, PayDelay VARCHAR(4)
, LengthOfStay VARCHAR(10)
, DSFS VARCHAR(12)
, PrimaryConditionGroup VARCHAR(8)
, CharlsonIndex VARCHAR(3)
, ProcedureGroup VARCHAR(4)
, SupLOS TINYINT
)

BULK INSERT Claims
FROM 'F:\comps\hhp\raw data\HHP_release2\Claims.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

--members
CREATE TABLE Members
(
MemberID_M VARCHAR(8) --integers starting with 0, could be text!
, AgeAtFirstClaim VARCHAR(5)
, Sex VARCHAR(1)
)

BULK INSERT Members
FROM 'F:\comps\hhp\raw data\HHP_release2\Members.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

-- drug count
CREATE TABLE DrugCount
(
MemberID INT
, Year VARCHAR(2)
, DSFS VARCHAR(12)
, DrugCount VARCHAR(2)
)

BULK INSERT DrugCount
FROM 'F:\comps\hhp\raw data\HHP_release3\DrugCount.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

-- Lab Count
CREATE TABLE LabCount
(
MemberID INT
, Year VARCHAR(2)
, DSFS VARCHAR(12)
, LabCount VARCHAR(3)
)

BULK INSERT LabCount
FROM 'F:\comps\hhp\raw data\HHP_release3\LabCount.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

--DaysInHospital_Y2
CREATE TABLE DaysInHospital_Y2
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)

BULK INSERT DaysInHospital_Y2
FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y2.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

-- DaysInHospital_Y3
CREATE TABLE DaysInHospital_Y3
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)

BULK INSERT DaysInHospital_Y3
FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y3.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

-- Target
CREATE TABLE Target
(
MemberID INT
, ClaimsTruncated TINYINT
, DaysInHospital TINYINT
)

BULK INSERT Target
FROM 'F:\comps\hhp\raw data\HHP_release2\Target.csv'
WITH
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)

/*************************
adjust the claims data to
convert text to integers
**************************/

-- PayDelay
ALTER TABLE Claims
ADD PayDelayI integer
GO

UPDATE Claims
SET PayDelayI = CASE WHEN PayDelay = '162+' THEN 162 ELSE CAST(PayDelay AS integer) END

--dsfs
ALTER TABLE Claims
ADD dsfsI integer
GO

UPDATE Claims
SET dsfsI =
CASE
WHEN dsfs = '0- 1 month' THEN 1
WHEN dsfs = '1- 2 months' THEN 2
WHEN dsfs = '2- 3 months' THEN 3
WHEN dsfs = '3- 4 months' THEN 4
WHEN dsfs = '4- 5 months' THEN 5
WHEN dsfs = '5- 6 months' THEN 6
WHEN dsfs = '6- 7 months' THEN 7
WHEN dsfs = '7- 8 months' THEN 8
WHEN dsfs = '8- 9 months' THEN 9
WHEN dsfs = '9-10 months' THEN 10
WHEN dsfs = '10-11 months' THEN 11
WHEN dsfs = '11-12 months' THEN 12
WHEN dsfs IS NULL THEN NULL
END

-- CharlsonIndex
ALTER TABLE Claims
ADD CharlsonIndexI INTEGER
GO

UPDATE Claims
SET CharlsonIndexI =
CASE
WHEN CharlsonIndex = '0' THEN 0
WHEN CharlsonIndex = '1-2' THEN 2
WHEN CharlsonIndex = '3-4' THEN 4
WHEN CharlsonIndex = '5+' THEN 6
END

-- LengthOfStay
ALTER TABLE Claims
ADD LengthOfStayI INTEGER
GO

UPDATE Claims
SET LengthOfStayI =
CASE
WHEN LengthOfStay = '1 day' THEN 1
WHEN LengthOfStay = '2 days' THEN 2
WHEN LengthOfStay = '3 days' THEN 3
WHEN LengthOfStay = '4 days' THEN 4
WHEN LengthOfStay = '5 days' THEN 5
WHEN LengthOfStay = '6 days' THEN 6
WHEN LengthOfStay = '1- 2 weeks' THEN 11
WHEN LengthOfStay = '2- 4 weeks' THEN 21
WHEN LengthOfStay = '4- 8 weeks' THEN 42
WHEN LengthOfStay = '26+ weeks' THEN 180
WHEN LengthOfStay IS NULL THEN null
END

/**************************
create a summary table
at the member/year level
***************************/
SELECT
year
,Memberid

,COUNT(*) AS no_Claims
,COUNT(DISTINCT ProviderID) AS no_Providers
,COUNT(DISTINCT Vendor) AS no_Vendors
,COUNT(DISTINCT PCP) AS no_PCPs
,COUNT(DISTINCT PlaceSvc) AS no_PlaceSvcs
,COUNT(DISTINCT Specialty) AS no_Specialities
,COUNT(DISTINCT PrimaryConditionGroup) AS no_PrimaryConditionGroups
,COUNT(DISTINCT ProcedureGroup) AS no_ProcedureGroups

,MAX(PayDelayI) AS PayDelay_max
,MIN(PayDelayI) AS PayDelay_min
,AVG(PayDelayI) AS PayDelay_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(PayDelayI) END) AS PayDelay_stdev

,MAX(LengthOfStayI) AS LOS_max
,MIN(LengthOfStayI) AS LOS_min
,AVG(LengthOfStayI) AS LOS_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(LengthOfStayI) END) AS LOS_stdev

,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 0 THEN 1 ELSE 0 END) AS LOS_TOT_UNKNOWN
,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 1 THEN 1 ELSE 0 END) AS LOS_TOT_SUPRESSED
,SUM(CASE WHEN LENGTHOFSTAY IS NOT NULL THEN 1 ELSE 0 END) AS LOS_TOT_KNOWN

,MAX(dsfsI) AS dsfs_max
,MIN(dsfsI) AS dsfs_min
,MAX(dsfsI) - MIN(dsfsI) AS dsfs_range
,AVG(dsfsI) AS dsfs_ave
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(dsfsI) END) AS dsfs_stdev

,MAX(CharlsonIndexI) AS CharlsonIndexI_max
,MIN(CharlsonIndexI) AS CharlsonIndexI_min
,AVG(CharlsonIndexI) AS CharlsonIndexI_ave
,MAX(CharlsonIndexI) - MIN(CharlsonIndexI) AS CharlsonIndexI_range
,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(CharlsonIndexI) END) AS CharlsonIndexI_stdev

,SUM(CASE WHEN PrimaryConditionGroup = 'MSC2a3' THEN 1 ELSE 0 END) AS pcg1
,SUM(CASE WHEN PrimaryConditionGroup = 'METAB3' THEN 1 ELSE 0 END) AS pcg2
,SUM(CASE WHEN PrimaryConditionGroup = 'ARTHSPIN' THEN 1 ELSE 0 END) AS pcg3
,SUM(CASE WHEN PrimaryConditionGroup = 'NEUMENT' THEN 1 ELSE 0 END) AS pcg4
,SUM(CASE WHEN PrimaryConditionGroup = 'RESPR4' THEN 1 ELSE 0 END) AS pcg5
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCHRT' THEN 1 ELSE 0 END) AS pcg6
,SUM(CASE WHEN PrimaryConditionGroup = 'SKNAUT' THEN 1 ELSE 0 END) AS pcg7
,SUM(CASE WHEN PrimaryConditionGroup = 'GIBLEED' THEN 1 ELSE 0 END) AS pcg8
,SUM(CASE WHEN PrimaryConditionGroup = 'INFEC4' THEN 1 ELSE 0 END) AS pcg9
,SUM(CASE WHEN PrimaryConditionGroup = 'TRAUMA' THEN 1 ELSE 0 END) AS pcg10
,SUM(CASE WHEN PrimaryConditionGroup = 'HEART2' THEN 1 ELSE 0 END) AS pcg11
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL3' THEN 1 ELSE 0 END) AS pcg12
,SUM(CASE WHEN PrimaryConditionGroup = 'ROAMI' THEN 1 ELSE 0 END) AS pcg13
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL5' THEN 1 ELSE 0 END) AS pcg14
,SUM(CASE WHEN PrimaryConditionGroup = 'ODaBNCA' THEN 1 ELSE 0 END) AS pcg15
,SUM(CASE WHEN PrimaryConditionGroup = 'UTI' THEN 1 ELSE 0 END) AS pcg16
,SUM(CASE WHEN PrimaryConditionGroup = 'COPD' THEN 1 ELSE 0 END) AS pcg17
,SUM(CASE WHEN PrimaryConditionGroup = 'GYNEC1' THEN 1 ELSE 0 END) AS pcg18
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRB' THEN 1 ELSE 0 END) AS pcg19
,SUM(CASE WHEN PrimaryConditionGroup = 'FXDISLC' THEN 1 ELSE 0 END) AS pcg20
,SUM(CASE WHEN PrimaryConditionGroup = 'AMI' THEN 1 ELSE 0 END) AS pcg21
,SUM(CASE WHEN PrimaryConditionGroup = 'PRGNCY' THEN 1 ELSE 0 END) AS pcg22
,SUM(CASE WHEN PrimaryConditionGroup = 'HEMTOL' THEN 1 ELSE 0 END) AS pcg23
,SUM(CASE WHEN PrimaryConditionGroup = 'HEART4' THEN 1 ELSE 0 END) AS pcg24
,SUM(CASE WHEN PrimaryConditionGroup = 'SEIZURE' THEN 1 ELSE 0 END) AS pcg25
,SUM(CASE WHEN PrimaryConditionGroup = 'APPCHOL' THEN 1 ELSE 0 END) AS pcg26
,SUM(CASE WHEN PrimaryConditionGroup = 'CHF' THEN 1 ELSE 0 END) AS pcg27
,SUM(CASE WHEN PrimaryConditionGroup = 'GYNECA' THEN 1 ELSE 0 END) AS pcg28
,SUM(CASE WHEN PrimaryConditionGroup IS NULL THEN 1 ELSE 0 END) AS pcg29
,SUM(CASE WHEN PrimaryConditionGroup = 'PNEUM' THEN 1 ELSE 0 END) AS pcg30
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL2' THEN 1 ELSE 0 END) AS pcg31
,SUM(CASE WHEN PrimaryConditionGroup = 'GIOBSENT' THEN 1 ELSE 0 END) AS pcg32
,SUM(CASE WHEN PrimaryConditionGroup = 'STROKE' THEN 1 ELSE 0 END) AS pcg33
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRA' THEN 1 ELSE 0 END) AS pcg34
,SUM(CASE WHEN PrimaryConditionGroup = 'FLaELEC' THEN 1 ELSE 0 END) AS pcg35
,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL1' THEN 1 ELSE 0 END) AS pcg36
,SUM(CASE WHEN PrimaryConditionGroup = 'HIPFX' THEN 1 ELSE 0 END) AS pcg37
,SUM(CASE WHEN PrimaryConditionGroup = 'METAB1' THEN 1 ELSE 0 END) AS pcg38
,SUM(CASE WHEN PrimaryConditionGroup = 'PERVALV' THEN 1 ELSE 0 END) AS pcg39
,SUM(CASE WHEN PrimaryConditionGroup = 'LIVERDZ' THEN 1 ELSE 0 END) AS pcg40
,SUM(CASE WHEN PrimaryConditionGroup = 'CATAST' THEN 1 ELSE 0 END) AS pcg41
,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRM' THEN 1 ELSE 0 END) AS pcg42
,SUM(CASE WHEN PrimaryConditionGroup = 'PERINTL' THEN 1 ELSE 0 END) AS pcg43
,SUM(CASE WHEN PrimaryConditionGroup = 'PNCRDZ' THEN 1 ELSE 0 END) AS pcg44
,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL1' THEN 1 ELSE 0 END) AS pcg45
,SUM(CASE WHEN PrimaryConditionGroup = 'SEPSIS' THEN 1 ELSE 0 END) AS pcg46

,SUM(CASE WHEN Specialty = 'Internal' THEN 1 ELSE 0 END) AS sp1
,SUM(CASE WHEN Specialty = 'Laboratory' THEN 1 ELSE 0 END) AS sp2
,SUM(CASE WHEN Specialty = 'General Practice' THEN 1 ELSE 0 END) AS sp3
,SUM(CASE WHEN Specialty = 'Surgery' THEN 1 ELSE 0 END) AS sp4
,SUM(CASE WHEN Specialty = 'Diagnostic Imaging' THEN 1 ELSE 0 END) AS sp5
,SUM(CASE WHEN Specialty = 'Emergency' THEN 1 ELSE 0 END) AS sp6
,SUM(CASE WHEN Specialty = 'Other' THEN 1 ELSE 0 END) AS sp7
,SUM(CASE WHEN Specialty = 'Pediatrics' THEN 1 ELSE 0 END) AS sp8
,SUM(CASE WHEN Specialty = 'Rehabilitation' THEN 1 ELSE 0 END) AS sp9
,SUM(CASE WHEN Specialty = 'Obstetrics and Gynecology' THEN 1 ELSE 0 END) AS sp10
,SUM(CASE WHEN Specialty = 'Anesthesiology' THEN 1 ELSE 0 END) AS sp11
,SUM(CASE WHEN Specialty = 'Pathology' THEN 1 ELSE 0 END) AS sp12
,SUM(CASE WHEN Specialty IS NULL THEN 1 ELSE 0 END) AS sp13

,SUM(CASE WHEN ProcedureGroup = 'EM' THEN 1 ELSE 0 END ) AS pg1
,SUM(CASE WHEN ProcedureGroup = 'PL' THEN 1 ELSE 0 END ) AS pg2
,SUM(CASE WHEN ProcedureGroup = 'MED' THEN 1 ELSE 0 END ) AS pg3
,SUM(CASE WHEN ProcedureGroup = 'SCS' THEN 1 ELSE 0 END ) AS pg4
,SUM(CASE WHEN ProcedureGroup = 'RAD' THEN 1 ELSE 0 END ) AS pg5
,SUM(CASE WHEN ProcedureGroup = 'SDS' THEN 1 ELSE 0 END ) AS pg6
,SUM(CASE WHEN ProcedureGroup = 'SIS' THEN 1 ELSE 0 END ) AS pg7
,SUM(CASE WHEN ProcedureGroup = 'SMS' THEN 1 ELSE 0 END ) AS pg8
,SUM(CASE WHEN ProcedureGroup = 'ANES' THEN 1 ELSE 0 END ) AS pg9
,SUM(CASE WHEN ProcedureGroup = 'SGS' THEN 1 ELSE 0 END ) AS pg10
,SUM(CASE WHEN ProcedureGroup = 'SEOA' THEN 1 ELSE 0 END ) AS pg11
,SUM(CASE WHEN ProcedureGroup = 'SRS' THEN 1 ELSE 0 END ) AS pg12
,SUM(CASE WHEN ProcedureGroup = 'SNS' THEN 1 ELSE 0 END ) AS pg13
,SUM(CASE WHEN ProcedureGroup = 'SAS' THEN 1 ELSE 0 END ) AS pg14
,SUM(CASE WHEN ProcedureGroup = 'SUS' THEN 1 ELSE 0 END ) AS pg15
,SUM(CASE WHEN ProcedureGroup IS NULL THEN 1 ELSE 0 END ) AS pg16
,SUM(CASE WHEN ProcedureGroup = 'SMCD' THEN 1 ELSE 0 END ) AS pg17
,SUM(CASE WHEN ProcedureGroup = 'SO' THEN 1 ELSE 0 END ) AS pg18

,SUM(CASE WHEN PlaceSvc = 'Office' THEN 1 ELSE 0 END) AS ps1
,SUM(CASE WHEN PlaceSvc = 'Independent Lab' THEN 1 ELSE 0 END) AS ps2
,SUM(CASE WHEN PlaceSvc = 'Urgent Care' THEN 1 ELSE 0 END) AS ps3
,SUM(CASE WHEN PlaceSvc = 'Outpatient Hospital' THEN 1 ELSE 0 END) AS ps4
,SUM(CASE WHEN PlaceSvc = 'Inpatient Hospital' THEN 1 ELSE 0 END) AS ps5
,SUM(CASE WHEN PlaceSvc = 'Ambulance' THEN 1 ELSE 0 END) AS ps6
,SUM(CASE WHEN PlaceSvc = 'Other' THEN 1 ELSE 0 END) AS ps7
,SUM(CASE WHEN PlaceSvc = 'Home' THEN 1 ELSE 0 END) AS ps8
,SUM(CASE WHEN PlaceSvc IS NULL THEN 1 ELSE 0 END) AS ps9

INTO claims_per_member
FROM Claims
GROUP BY year,Memberid

-- remove some nulls
UPDATE claims_per_member
SET LOS_max = 0 WHERE LOS_max IS NULL

UPDATE claims_per_member
SET LOS_min = 0 WHERE LOS_min IS NULL

UPDATE claims_per_member
SET LOS_ave = 0 WHERE LOS_ave IS NULL

UPDATE claims_per_member
SET LOS_stdev = -1 WHERE LOS_stdev IS NULL

UPDATE claims_per_member
SET dsfs_max = 0 WHERE dsfs_max IS NULL

UPDATE claims_per_member
SET dsfs_min = 0 WHERE dsfs_min IS NULL

UPDATE claims_per_member
SET dsfs_ave = 0 WHERE dsfs_ave IS NULL

UPDATE claims_per_member
SET dsfs_stdev = -1 WHERE dsfs_stdev IS NULL

UPDATE claims_per_member
SET dsfs_range = -1 WHERE dsfs_range IS NULL

UPDATE claims_per_member
SET CharlsonIndexI_range = -1 WHERE CharlsonIndexI_range IS NULL

/***********************************
Members
***********************************/

-- create binary flags for age
ALTER TABLE Members ADD age_05 INT
ALTER TABLE Members ADD age_15 INT
ALTER TABLE Members ADD age_25 INT
ALTER TABLE Members ADD age_35 INT
ALTER TABLE Members ADD age_45 INT
ALTER TABLE Members ADD age_55 INT
ALTER TABLE Members ADD age_65 INT
ALTER TABLE Members ADD age_75 INT
ALTER TABLE Members ADD age_85 INT
ALTER TABLE Members ADD age_MISS INT

GO

UPDATE Members SET age_05 = CASE WHEN ageATfirstclaim = '0-9' THEN 1 ELSE 0 END
UPDATE Members SET age_15 = CASE WHEN ageATfirstclaim = '10-19' THEN 1 ELSE 0 END
UPDATE Members SET age_25 = CASE WHEN ageATfirstclaim = '20-29' THEN 1 ELSE 0 END
UPDATE Members SET age_35 = CASE WHEN ageATfirstclaim = '30-39' THEN 1 ELSE 0 END
UPDATE Members SET age_45 = CASE WHEN ageATfirstclaim = '40-49' THEN 1 ELSE 0 END
UPDATE Members SET age_55 = CASE WHEN ageATfirstclaim = '50-59' THEN 1 ELSE 0 END
UPDATE Members SET age_65 = CASE WHEN ageATfirstclaim = '60-69' THEN 1 ELSE 0 END
UPDATE Members SET age_75 = CASE WHEN ageATfirstclaim = '70-79' THEN 1 ELSE 0 END
UPDATE Members SET age_85 = CASE WHEN ageATfirstclaim = '80+' THEN 1 ELSE 0 END
UPDATE Members SET age_MISS = CASE WHEN ageATfirstclaim IS NULL THEN 1 ELSE 0 END

--create binary flags for sex
ALTER TABLE Members
ADD sexMALE INT
GO

UPDATE Members
SET SexMALE =
CASE
WHEN Sex = 'M' THEN 1 ELSE 0
END

ALTER TABLE Members
ADD sexFEMALE INT
GO

UPDATE Members
SET SexFEMALE =
CASE
WHEN Sex = 'F' THEN 1 ELSE 0
END

ALTER TABLE Members
ADD sexMISS INT
GO

UPDATE Members
SET SexMISS =
CASE
WHEN Sex IS NULL THEN 1 ELSE 0
END

/******************
DRUG COUNTS
******************/

-- convert to integers
ALTER TABLE drugcount ADD DrugCountI INT
GO
UPDATE DRUGCOUNT
SET DrugCountI =
CASE WHEN DrugCount = '7+' THEN 7 ELSE DrugCount END

SELECT
memberID AS memberID_dc
,Year AS YEAR_dc
,MAX(drugcountI) AS drugCount_max
,MIN(drugcountI) AS drugCount_min
,AVG(drugcountI * 1.0) AS drugCount_ave
,COUNT(*) AS drugcount_months
INTO DRUGCOUNT_SUMMARY
FROM
drugcount
GROUP BY
memberID
,Year

/******************
LAB COUNTS
******************/

-- convert to integers
ALTER TABLE LabCount ADD LabCountI INT
GO
UPDATE LabCount
SET LabCountI =
CASE WHEN LabCount = '10+' THEN 10 ELSE LabCount END

SELECT
memberID AS memberID_lc
,Year AS YEAR_lc
,MAX(labcountI) AS labCount_max
,MIN(labcountI) AS labCount_min
,AVG(labcountI * 1.0) AS labCount_ave
,COUNT(*) AS labcount_months
INTO LABCOUNT_SUMMARY
FROM
labcount
GROUP BY
memberID
,Year

/********************************
Targets
********************************/

SELECT *
INTO DIH
FROM
(
SELECT
MemberID AS MemberID_t
,'Y1' AS YEAR_t
,ClaimsTruncated
,DaysInHospital
,1 AS trainset
FROM DaysInHospital_Y2

UNION ALL

SELECT
MemberID AS MemberID_t
,'Y2' AS YEAR_t
,ClaimsTruncated
,DaysInHospital
,1 AS trainset
FROM DaysInHospital_Y3

UNION ALL

SELECT
MemberID AS MemberID_t
,'Y3' AS YEAR_t
,ClaimsTruncated
,null AS DaysInHospital
,0 AS trainset
FROM Target
) a

/*****************************
Now merge them all together to
create the modeling data SET
******************************/
SELECT a.*,b.*
INTO #temp1
FROM
DIH a
LEFT OUTER JOIN
members b
on a.MemberID_t = B.Memberid_M

ALTER TABLE #temp1 DROP COLUMN Memberid_M
ALTER TABLE #temp1 DROP COLUMN AgeAtFirstClaim
ALTER TABLE #temp1 DROP COLUMN Sex
GO

SELECT a.*,b.*
INTO #temp2
FROM
#temp1 a
LEFT OUTER JOIN
claims_per_member b
on a.MemberID_t = B.Memberid
AND a.YEAR_t = b.year

ALTER TABLE #temp2 DROP COLUMN Memberid
ALTER TABLE #temp2 DROP COLUMN year
GO

SELECT a.*,b.*
INTO #temp3
FROM
#temp2 a
LEFT OUTER JOIN
DRUGCOUNT_SUMMARY b
on a.MemberID_t = B.Memberid_dc
AND a.YEAR_t = b.YEAR_dc

ALTER TABLE #temp3 DROP COLUMN Memberid_dc
ALTER TABLE #temp3 DROP COLUMN YEAR_dc
GO

SELECT a.*,b.*
INTO #temp4
FROM
#temp3 a
LEFT OUTER JOIN
LABCOUNT_SUMMARY b
on a.MemberID_t = B.Memberid_lc
AND a.YEAR_t = b.YEAR_lc

ALTER TABLE #temp4 DROP COLUMN Memberid_lc
ALTER TABLE #temp4 DROP COLUMN YEAR_lc
GO

-- removel nulls for those who had
-- no lab or drug information
ALTER TABLE #temp4 ADD labNull INT
ALTER TABLE #temp4 ADD drugNull INT
GO

UPDATE #temp4 SET labNull = 0
UPDATE #temp4 SET labNull = 1 WHERE labCount_max IS NULL

UPDATE #temp4 SET drugNull = 0
UPDATE #temp4 SET drugNull = 1 WHERE drugCount_max IS NULL

UPDATE #temp4 SET labCount_max = 0 WHERE labCount_max IS NULL
UPDATE #temp4 SET labCount_min = 0 WHERE labCount_min IS NULL
UPDATE #temp4 SET labCount_ave = 0 WHERE labCount_ave IS NULL
UPDATE #temp4 SET labcount_months = 0 WHERE labcount_months IS NULL

UPDATE #temp4 SET drugCount_max = 0 WHERE drugCount_max IS NULL
UPDATE #temp4 SET drugCount_min = 0 WHERE drugCount_min IS NULL
UPDATE #temp4 SET drugCount_ave = 0 WHERE drugCount_ave IS NULL
UPDATE #temp4 SET drugcount_months = 0 WHERE drugcount_months IS NULL

SELECT *
INTO modelling_set
FROM #temp4

R code

########################################
# Example GBM model for HHP
# scores ~ 0.4635 on leaderboard
# which would be 55th position of 510
# as at 9th Sept 2011
#
# Requires the data having been prepared
# using the SQL supplied
#
########################################

starttime <- proc.time()

########################################
#load the data
########################################
library(RODBC)

#set a connection to the database 
conn <- odbcDriverConnect("driver=SQL Server;database=HHP_comp;server=servername\\SQLEXPRESS;")

#or this method involves setting up a DSN (Data Source Name) called HHP_comp
#conn <- odbcConnect("HHP_comp")

alldata <- sqlQuery(conn,"select * from modelling_set")


########################################
# arrange the data
########################################

#identify train and leaderboard data
trainrows <- which(alldata$trainset == 1)
scorerows <- which(alldata$trainset == 0)

#sanity check the size of each set
length(trainrows)
length(scorerows)

#display the column names
colnames(alldata)

#memberid is required as key for submission set
memberid <- alldata[scorerows,'MemberID_t']

#remove redundant fields
alldata$MemberID_t <- NULL
alldata$YEAR_t <- NULL
alldata$trainset <- NULL

#target - what we are predicting
theTarget <- 'DaysInHospital'

#put the target on the log scale
alldata[trainrows,theTarget] <- log1p(alldata[trainrows,theTarget]) 

#find the position of the target
targindex <-  which(names(alldata)==theTarget)


########################################
# build the model
########################################

#GBM model settings, these can be varied
GBM_NTREES = 500
GBM_SHRINKAGE = 0.05
GBM_DEPTH = 4
GBM_MINOBS = 50

#build the GBM model
library(gbm)
GBM_model <- gbm.fit(
             x = alldata[trainrows,-targindex]
            ,y = alldata[trainrows,targindex]
            ,distribution = "gaussian"
            ,n.trees = GBM_NTREES
            ,shrinkage = GBM_SHRINKAGE
            ,interaction.depth = GBM_DEPTH
            ,n.minobsinnode = GBM_MINOBS
            ,verbose = TRUE) 

#list variable importance
summary(GBM_model,GBM_NTREES)

#predict for the leaderboard data
prediction <- predict.gbm(object = GBM_model
              ,newdata = alldata[scorerows,-targindex]
              ,GBM_NTREES)

#put on correct scale and cap
prediction <- expm1(prediction)
prediction <- pmin(15,prediction)
prediction <- pmax(0,prediction)

#plot the submission distribution
hist(prediction, breaks=500)


########################################
#write the submission to file
########################################
submission <- cbind(memberid,prediction)
colnames(submission) <- c("MemberID","DaysInHospital")
fnname <- "C:\\GBM_demo1.csv"
write.csv(submission, file=fnname, row.names = FALSE)

elapsedtime <- proc.time() - starttime
cat("\nFinished\n",elapsedtime)

14 comments:

Sargis1 October 2011 at 11:02
Thank you so very much. I was able to get public leaderboard score of 0.463549 with this code. Congratulations on the well deserved Heritage Health Prize Round 1 Milestone prize!
Zach Deane-Mayer1 October 2011 at 14:25
Could your provide the result of the SQL as a datafile?
Sali Mali1 October 2011 at 22:37
Sorry Zach, I don't think the rules allow redistributing the data. There are many free tools though to let you generate it yourself. In my earlier post on loading the data I think there is a link to a site that lists all the software.
Brian Chase2 October 2011 at 11:09
Thank you for sharing this and congratulations on winning the first milestone prize. I scored 0.463372 on the public leaderboard using this approach.
Sali Mali2 October 2011 at 11:25
Thanks Sarkis & Brian for sharing this. Interestingly you both got slightly different results, which is to do with the stochastic nature of the algorithm. GBMs will overfit, and if you read the documentation, you will discover you can find the optimal number of trees for a specific learning rate by cross validation. Also if you read our paper, you will discover that there are some variables in the SQL that we chose not to use in the modelling! If you optimise this model and add the extra variables as described, a score of 0.461 should be achievable with a single model.
Christian5 October 2011 at 11:11
Thanks for continuing to blog while competing. I haven't been active recently, but this has given me a great place to jump back in!
pidtis5 October 2011 at 17:08
Thank you so much for posting these details. I'm a newbie and this blog has served as a great learning resource.
B. Riemann8 October 2011 at 15:55
I was able to reproduce the results from using the 'gbm' only model. Are you going to post the parameters from the other models that you used in your ensemble methods?
Rstohr23 October 2011 at 13:38
Thanks for these details and congratulations with the first milestone price!

I'm participating in this contest as a grad student in engineering and writing a thesis about it. But i still seem to lack a theoretical base to fully understand the given problem. I now finished reading "Machine Learning, Tom Mitchell, McGraw Hill, 1997" as an introductionary work, are there other must read books, papers, resources, .. that you could recommend?

Thanks in advance!
Signipinnis25 October 2011 at 11:49
You are Heroes of the RRRRRRRRevolution. Your generous sharing is much appreciated.

I've never used SQL to create a lot of varied additional data elements/features, so that part also was very educational to me. Your code has an elegantly sparse, honed directness to it. Clearly "not your first rodeo" as they say.

Best of luck the rest of the way. (Not that there seems to be much "luck" about what you're doing.)
Sali Mali25 October 2011 at 12:37
@Signipinnis

Thanks for the comments. R is relatively new to me (<2 years) but the more I need to do stuff, the more I find that someone has already done it before and has written a nice function for me.

I did the data prep in SQL because that is what I am more familiar with. I'm sure the same thing can be done more easily in R, so if anyone wants to replicate the SQL in R then I'd appreciate the kickstart lesson.
train29 August 2012 at 02:11
Thank you so much for posting.
I'm a student. I'm interested in Kaggle.

Could you provide the more SQL code ?
including code to make Subset5,Subset6 (Data Set 1) and Data Set 2
Sali Mali29 August 2012 at 02:25
Sure, but will cost you $3.1 million ;-)
train29 August 2012 at 03:41
wow,,, thanks for your reply :)

Saturday, 1 October 2011

Code for a Respectable HHP Model

14 comments: