tag:blogger.com,1999:blog-82365802144786829372024-03-09T01:22:00.578-08:00Another Data Mining BlogA blog about playing with data and other maybe unrelated stuffSali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.comBlogger45125tag:blogger.com,1999:blog-8236580214478682937.post-82858515196854442532019-10-20T07:29:00.002-07:002019-10-20T07:34:41.899-07:00Milestone Prize Solution<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s1600/datathonLogoPB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2019<br />
<< next <a href="https://anotherdataminingblog.blogspot.com/2019/09/buy-low-sell-high-leaderboard-and-rule.html" target="_blank">previous >></a></td></tr>
</tbody></table>
<br />
<br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Congratulations to Samuel Messina for winning the $1,000 milestone prize. The write-up of his technique can be found <a href="https://drive.google.com/open?id=1mViDHsVJq2tsPDr9s0FYAPYnEFFlN3q3" target="_blank">here</a>.</span></div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com3tag:blogger.com,1999:blog-8236580214478682937.post-55431253654222609952019-09-15T01:46:00.003-07:002019-10-20T07:30:53.325-07:00Buy Low Sell High - leaderboard and rule clarification<div dir="ltr" style="text-align: left;" trbidi="on">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><br />
<h4>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Buy Low Sell High - leaderboard and rule clarification</span></h4>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s1600/datathonLogoPB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2019<br />
<a href="http://anotherdataminingblog.blogspot.com/2019/09/2019-s-next-in-previous-post-there-was.html" target="_blank"><<Previous</a> <a href="https://anotherdataminingblog.blogspot.com/2019/10/milestone-prize-solution.html" target="_blank">Next>></a></td></tr>
</tbody></table>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: 12.8px;"><br /></span>
</div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">You can now register your team and submit your trades files to see how you do on the leader board:</span></div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="font-size: 12.8px;"> </span></span><span style="font-family: "arial" , "helvetica" , sans-serif;"><a href="https://melbournedatathon.com/">https://melbournedatathon.com</a></span></div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Just to clarify a few things:</span></div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br /></span></div>
<div style="text-align: left;">
</div>
<ol style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">You should only use the values of the <i><b>current</b></i> and<b> <i>previous</i></b> predictions to make a decision</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">If you use future prediction values then you will probably get a good result but this logic is unimplementable</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Any strategies found to be using future prediction values will be disqualified, so there is no point in trying </span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">Do not use anything else in your algorithm such as pair name, price or absolute time</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">If you want to use previous prediction values, then use the relative time differences to determine them</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">The new file we provide will be consistent in having predictions generated every 5 minutes intervals, but the absolute values could be anything</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">After the deadline, you will be asked to nominate your 3 long and 3 short strategies you want to be evaluated</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">All teams beating the Benchmark solution on the private leader board for their nominated strategies will then qualify for stage 2</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">We will then invite those teams to run their code over <i><b>several</b></i> new files. They must only use the strategies nominated and the strategies must be locked in with no further parameter tuning allowed</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">The new files will have different pair names and the start time for the field minutesSinceStart will not necessarily be the same start time as the file already provided</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">The winners will be the team that gives the best return on the new data providing they still beat the Benchmark and we are confident no future prediction values have been used.</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">There will be a winners for Short and a winners for Long</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">If we suspect future prediction are being used then we will say how we came to this conclusion and the team will have a right of reply to prove otherwise</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><span style="color: blue;">There will be a benchmark prize for the first leading team on the private leaderboard as at 12 pm on Thu Oct 17th that wishes to reveal their method. In order to receive the prize the team must write a blog post describing their method so it can be reproduced by others. It must not use future information. It is not compulsory to reveal your method, so we will proceed down the ranking and award the prize to the first team that wishes to do so.</span></span></li>
</ol>
<div style="text-align: left;">
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Just to clarify what we mean by 'future information'. The data set does contain records that are 'in the future' to the times we have asked you to make decisions for. It is OK to use this data to come up with a set of coefficients for a model.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">What it is NOT OK to do is use the raw prediction values at time 'x' as inputs to a model making a decision for a time prior to 'x'. This is an unimplementable solution. </span></div>
<br />
<br />
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">If there is anything else that needs clarifying, ask below and we will add to the list</span></div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Good Luck</span><br />
<div style="text-align: left;">
</div>
<div style="text-align: left;">
<br /></div>
<div style="text-align: left;">
<br /></div>
<br />
<div style="text-align: left;">
<br /></div>
</div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com16tag:blogger.com,1999:blog-8236580214478682937.post-65507237065633104692019-09-05T00:09:00.001-07:002019-10-16T18:38:02.402-07:00Buy Low Sell High - file checking<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s1600/datathonLogoPB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2019<br />
<a href="http://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high-data-and-code.html" target="_blank"><<previous</a> <a href="https://anotherdataminingblog.blogspot.com/2019/09/buy-low-sell-high-leaderboard-and-rule.html" target="_blank">next>></a> </td></tr>
</tbody></table>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">In the previous post there was code to generate a submission file.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Submission files need to be valid so we can process them, otherwise they will be rejected by the calculation engine. Below is some code which your file should pass to ensure it will not be rejected.</span><br />
<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>#----------------------------------------------------------
#
# some R code to test your submission file is valid
#
#----------------------------------------------------------
library(data.table)
#define where the data is
fileToTest <- 'test_submission_file_1.csv'
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,fileToTest)
#read in the data
tradeData <- fread(theDataFile)
#set error message
ermsg <- 'OK'
#these are the column names required
requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
#------------------------
#do some basic checks
#-----------------------
if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
missing_cols <- setdiff(requiredCols,colnames(tradeData))
ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
} else if (class(tradeData$enterTime) == "character") {
ermsg <- '\n the field enterTime contains non numeric values'
} else if (class(tradeData$exitTime) == "character") {
ermsg <- '\n the field exitTime contains non numeric values'
} else if (class(tradeData$direction) == "character") {
ermsg <- '\n the field direction contains non numeric values'
} else if (class(tradeData$percentPair) == "character") {
ermsg <- '\n the field percentPair contains non numeric values'
} else if (nrow(tradeData) == 0){
ermsg <- '\n the file contains no records'
} else if(length(which(is.na(tradeData)==TRUE)) > 0) {
ermsg <- '\n the file contains missing values'
} else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
ermsg <- '\n the exit time needs to be after the entry time'
}
if (ermsg != 'OK'){
stop(ermsg)
}
#-------------------------------------
# check that we are not over 100% in
# a trade at any particualr time
#-------------------------------------
timeStep <- 5
tradeData[,xKey := paste(keys_pair,strategyName,direction,sep="_")]
allKeys <- unique(tradeData$xKey)
for (thisKey in allKeys){
pairStrategyTradeData <- subset(tradeData,xKey == thisKey)
E_numberOfTrades <- nrow(pairStrategyTradeData)
cat("\n",thisKey,E_numberOfTrades)
#get the times we are in a trade
for (tradeNumber in 1:E_numberOfTrades){
thisTimes1 <- seq(from = pairStrategyTradeData[tradeNumber,enterTime],to=pairStrategyTradeData[tradeNumber,exitTime- timeStep],by=timeStep )
thisPercent1 <- rep(pairStrategyTradeData[tradeNumber,percentPair],length(thisTimes1))
if (tradeNumber==1){
allTimes1 <- thisTimes1
allPercent1 <- thisPercent1
} else {
allTimes1 <- c(allTimes1,thisTimes1)
allPercent1 <- c(allPercent1,thisPercent1)
}
} #tradeNumber
#make sure the max percent at any one time is not > 1
d <- cbind.data.frame(allTimes1,allPercent1)
posPercents <- aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent
maxPercent <- max(posPercents)
if (maxPercent > 1){
aBadTime <- d[which(aggregate(allPercent1 ~ allTimes1,data=d,sum)$allPercent > 1)[1],'allTimes1']
ermsg <- paste("\n you cannot be more than 100% in a trade at any one time:\n",thisKey,"\ntime=",aBadTime,"\npercent =",maxPercent)
stop(ermsg)
}
} # thisKey
cat('\nthe file appears to be OK!')
</code></pre>
<br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Update 17th Oct 2019. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">The file validator is now more strict. The basic checks section in the above code has now expand to the following set of rules:</span><br />
<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
requiredCols <- c('strategyName','keys_pair','enterTime','exitTime','direction','percentPair')
requiredPairs <- paste0('pair_',1:22)
#do some checks
if (!identical(sort(intersect(colnames(tradeData),requiredCols)) , sort(requiredCols))) {
missing_cols <- setdiff(requiredCols,colnames(tradeData))
ermsg <- paste('\n incorrect column names in the trades file. We are mising:\n',paste(missing_cols,collapse = ","))
} else if (class(tradeData$enterTime) == "character") {
ermsg <- '\n the field enterTime contains non numeric values'
} else if (class(tradeData$exitTime) == "character") {
ermsg <- '\n the field exitTime contains non numeric values'
} else if (class(tradeData$direction) == "character") {
ermsg <- '\n the field direction contains non numeric values'
} else if (class(tradeData$percentPair) == "character") {
ermsg <- '\n the field percentPair contains non numeric values'
} else if (nrow(tradeData) == 0){
ermsg <- '\n the file contains no records'
} else if(length(which(is.na(tradeData)==TRUE)) > 0) {
ermsg <- '\n the file contains missing values'
} else if (min(tradeData$exitTime - tradeData$enterTime) <= 0) {
ermsg <- '\n the exit time needs to be after the entry time'
} else if (nrow(tradeData) < 50){
ermsg <- '\n not enough trades'
} else if (max(tradeData$direction) > 1){
ermsg <- '\n trade direction > 1'
} else if (min(tradeData$direction) < 0){
ermsg <- '\n trade direction < 0'
} else if (length(unique(tradeData$direction)) > 2){
ermsg <- '\n too many trade directions'
} else if (max(tradeData$enterTime %% 5) > 0) {
ermsg <- '\n wrong times'
} else if (max(tradeData$exitTime %% 5) > 0) {
ermsg <- '\n wrong times'
} else if (min(tradeData$percentPair) < 0) {
ermsg <- '\n negative percent pairs'
} else if ( length(intersect(unique(tradeData$keys_pair),requiredPairs)) == 0 ){
ermsg <- '\n you need to predict for all pairs'
} else if ( !identical(sort(intersect(unique(tradeData$keys_pair),requiredPairs)),sort(requiredPairs))){
ermsg <- '\n you need to predict for all pairs'
} else if (min(tradeData[,.N,by = c('keys_pair','direction')]$N) < 5) {
ermsg <- '\n you need at least 5 trades pairs per '
}<br />
</code></pre>
<br />
<br /></div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com4tag:blogger.com,1999:blog-8236580214478682937.post-20285522474049280932019-08-27T21:59:00.001-07:002019-09-05T00:11:55.265-07:00Buy Low Sell High - a simple strategy<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s1600/datathonLogoPB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2019<br />
<br />
<a href="http://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high-data-and-code.html" target="_blank"> << previous post (getting the data)</a> <a href="https://anotherdataminingblog.blogspot.com/2019/09/2019-s-next-in-previous-post-there-was.html" target="_blank"> next>></a></td></tr>
</tbody></table>
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">Below is some R code to generate a valid submission file for a very simple strategy.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Is up to you to find the strategies that will give the highest returns.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">In the next post we'll introduce the leaderboard where you can upload your strategies to.</span><br />
<br /></div>
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>
############################################################
#
# sample R code for the 2019 Melbourne Datathon that will
# generate a valid submission file.
#
############################################################
library(data.table)
#----------------------------------------------
#define where the data is & load
#----------------------------------------------
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
dt <- fread(theDataFile)
#-------------------------
# strategy settings
#-------------------------
feePerc <- 0.0015 #fixed const for the competition
longThresh <- 0.002 #the value the prediction has to be higher than
longRemainWindow <- 12 #the number of hours to remain in a trade
#---------------------------------------------------
# flag if the criteria for entry is passed
#---------------------------------------------------
dt[,longEntry := 0]
dt[Lpred7b >= longThresh, longEntry := 1]
#----------------------
# set the exit times
#----------------------
exitLongs <- subset(dt,longEntry == 1,select = c('keys_pair','minutesSinceStart'))
exitLongs[,minutesSinceStart := minutesSinceStart + (longRemainWindow * 60)]
exitLongs[,longExit := 1]
#-------------------------------
#merge the entry and exit times
#-------------------------------
myKeys <- c("keys_pair","minutesSinceStart")
setkeyv(dt,myKeys)
setkeyv(exitLongs,myKeys)
dt <- merge(dt,exitLongs,all.x=TRUE)
#-----------------------------
# no match = no exit
#-----------------------------
dt[is.na(longExit), longExit := 0]
#--------------------------------------------------
# if there is an exit and enter, change the exit
#--------------------------------------------------
dt[longEntry == 1 & longExit == 1, longExit := 0]
#-----------------------------------------------------
#assign a group number to consecutive runs of entries
#-----------------------------------------------------
setkeyv(dt,myKeys)
dt[,grp := rleidv(dt, cols="longEntry")]
dt[,grp1 := paste0(keys_pair,grp)]
dt[, inc := cumsum(longEntry),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]
#--------------------------------------
# the first entry in the run is a BUY
#--------------------------------------
dt[,buy := 0]
dt[longEntry==1 & inc1 == 1,buy :=1 , by=grp1]
#-----------------------------------------------------
#assign a group number to consecutive runs of exits
#-----------------------------------------------------
dt[,grp := rleidv(dt, cols="longExit")]
dt[,grp1 := paste0(keys_pair,grp)]
dt[, inc := cumsum(longExit),by=grp1]
dt[, inc1 := cumsum(inc),by=grp1]
#-------------------------------------
# The first exit in the run is a SELL
#-------------------------------------
dt[,sell := 0]
dt[longExit==1 & inc1 == 1,sell :=1 , by=grp1]
#---------------------------------------------------
# Keep just the rows where we are BUYING or SELLING
#---------------------------------------------------
d1 <- subset(dt,(sell==1 & buy==0 | sell==0 & buy==1),select = c('keys_pair','minutesSinceStart','tradePrice','buy','sell'))
#------------------------------------------
# Make sure each buy is followed by a sell
#------------------------------------------
d1[,prevRowBuy := shift(buy, 1, type="lag") , by=keys_pair]
d1[,prevRowSell := shift(sell, 1, type="lag") , by=keys_pair]
d1 <- subset(d1,prevRowBuy != buy & prevRowSell != sell)
#--------------------------------------------------------------
#find the next price (for buy rows it will be the sell price)
#--------------------------------------------------------------
d1[,sellPrice := shift(tradePrice, 1, type="lead"), by=keys_pair]
d1[,minutesSinceStartExit := shift(minutesSinceStart, 1, type="lead"), by=keys_pair]
d1[, tradeLength := minutesSinceStartExit - minutesSinceStart]
#----------------------------------------------------------------
# now we only need the buy rows as we have the sell time & price
#----------------------------------------------------------------
d1 <- subset(d1,buy==1 & !is.na(minutesSinceStartExit))
#-------------------------------------------------------------------
# long profit estimate (we only have prices for some pairs though)
# this does not include a stoploss calculation
#--------------------------------------------------------------------
d1[,tradeProfit := ( (sellPrice * (1 - feePerc)) - (tradePrice * (1 + feePerc)) )/ tradePrice]
hist(d1$tradeProfit,breaks=100,col='blue');abline(v=0,col='red')
#------------------------------
# generate a submission file
#------------------------------
stratName <- 'demo_LongOnly'
d1[,enterTime := minutesSinceStart]
d1[,exitTime := minutesSinceStartExit]
d1[,direction := 1]
d1[,percentPair := 1]
d1[,strategyName := stratName]
d1 <- subset(d1,select = c('keys_pair','enterTime','exitTime','direction','percentPair','strategyName'))
fwrite(d1,paste0(dataFolder,stratName,'.csv'))
</code></pre>
</div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com24tag:blogger.com,1999:blog-8236580214478682937.post-39737146021160658632019-08-27T00:43:00.002-07:002019-08-27T22:08:28.825-07:00Buy Low, Sell High - DATA and Code<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s1600/datathonLogoPB.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">2019</td></tr>
</tbody></table>
<div style="text-align: center;">
<span style="font-family: Arial, Helvetica, sans-serif;"><a href="https://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high.html" target="_blank"><<previous post</a> <a href="https://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high-simple-strategy.html" target="_blank">next post>></a></span></div>
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">This is information about the 2019 Melbourne Datathon - <a href="http://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high.html" target="_blank">Please read the previous post first</a></span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">The data is a single file and can be downloaded from the following links:</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><a href="https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR">https://drive.google.com/open?id=11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR</a> (553mb zipped)</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><a href="https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP">https://drive.google.com/open?id=1WuFFk7NkBgYJxgDx8jeohwiLwwOMzLZP</a> (1.5gb unzipped)</span><br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">The column names should be self explanatory. There are 2 price columns;</span><br />
<div style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"> <br />
<b>barClosePrice </b>- this is the price at the current time stamp 'minutesSinceStart'. It is the price which the predictions (LPred1b-LPred14b) are predicting the change of.<br />
<br />
<b>tradePrice </b>- this is the price 5 minutes later. It is the price that will be used in the returns calculations. The reason for this is that we are assuming it takes 5 minutes from the time we decide to trade to actually executing the trade.<br />
<br />
</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Below us some R code to get you going, and we'll be posting more in the coming days.</span></div>
<br />
<br />
<pre style="background-color: #eeeeee; border: 1px dashed #999999; color: black; font-family: "andale mono" , "lucida console" , "monaco" , "fixed" , monospace; font-size: 12px; line-height: 14px; overflow: auto; padding: 5px; width: 100%;"><code>#--------------------------------------------------------------------------------------
# some R code to get you started with the 2019 Melbourne Datathon analytic challenge
# download the data from the link below and unzip to 'dataFolder'
# https://drive.google.com/file/d/11bWbg9kSmGXNBUOdGXv7MtWszTfVTUfR (553 mb)
#--------------------------------------------------------------------------------------
library(data.table)
#define where the data is
dataFolder <- "D:/buylowsellhigh/downloaded/"
theDataFile <- paste0(dataFolder,"melbdatathon2019_buylowsellhigh.csv")
#read in the data
dt <- fread(theDataFile)
nrow(dt)
#4,907,361
colnames(dt)
# [1] "keys_pair" "minutesSinceStart" "gap" "barClosePrice" "tradePrice" "Lpred1b" "Lpred2b" "Lpred3b" "Lpred4b"
#[10] "Lpred5b" "Lpred6b" "Lpred7b" "Lpred8b" "Lpred9b" "Lpred10b" "Lpred11b" "Lpred12b" "Lpred13b"
#[19] "Lpred14b"
unique(dt$keys_pair)
#[1] "0x_bitcoin" "bitcoin_usdollar" "bitcoincash_usdollar" "cardano_bitcoin" "dash_usdollar" "litecoin_bitcoin" "litecoin_tetherusd" "monero_bitcoin"
#[9] "qtum_bitcoin" "ripple_bitcoin" "ripple_usdollar" "stratis_bitcoin" "tron_tetherusd" "zcash_bitcoin" "pair_1" "pair_2"
#[17] "pair_3" "pair_4" "pair_5" "pair_6" "pair_7" "pair_8" "pair_9" "pair_10"
#[25] "pair_11" "pair_12" "pair_13" "pair_14" "pair_15" "pair_16" "pair_17" "pair_18"
#[33] "pair_19" "pair_20" "pair_21" "pair_22"
#price information is missing for the unnamed pairs
x1 <- subset(dt,keys_pair== "pair_1")
nrow(x1)
#132,721
summary(x1)
#distribution of the predictions
hist(dt$Lpred1b,breaks=100)
#time series of prices and predictions
x <- subset(dt,keys_pair== "bitcoin_usdollar")[1:1000]
plot(x$minutesSinceStart,x$barClosePrice,type='l',col="blue")
plot(x$minutesSinceStart,x$Lpred1b,type='l',col="blue")
</code></pre>
<br />
<br /></div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com4tag:blogger.com,1999:blog-8236580214478682937.post-22295802837046910052019-08-27T00:40:00.001-07:002019-08-28T04:37:27.296-07:00'Buy Low Sell High' 2019 Melbourne Datathon - INTRODUCTION<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<br />
<h2 style="text-align: center;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">"Buy Low - Sell High"</span></h2>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h4 style="text-align: left;">
</h4>
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><span style="margin-left: auto; margin-right: auto;"><a href="http://www.datasciencemelbourne.com/datathon/" target="_blank"><img border="0" data-original-height="304" data-original-width="1500" height="64" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg5RxEQZSVpB6hYY5BONoIXoPNc74F8P5lQmZpFdLSuVpEQEHNIVTw2ABZpFYadoab4hxGUvM0hS2ElHeaqIJkcNA0_6PhLXu2uKqoQOO__9ZNi6igl7hyphenhyphentONvFe2ufvksGTku1Q2MOIUfF/s320/datathonLogoPB.png" width="320" /></a></span></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><a href="http://www.datasciencemelbourne.com/datathon/" target="_blank">2019</a></td></tr>
</tbody></table>
<div style="text-align: center;">
<a href="http://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high-data-and-code.html" target="_blank">next post >></a></div>
<br />
<br />
<br />
<div>
<div style="text-align: left;">
<h4 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"> </span></h4>
</div>
</div>
<br />
<br />
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
</h4>
<h4 style="text-align: left;">
</h4>
<h3 style="text-align: left;">
</h3>
<div>
<div style="text-align: left;">
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><a href="http://www.datasciencemelbourne.com/datathon/" target="_blank">The analytics challenge for the 2019 Melbourne Datathon</a></span></h2>
</div>
</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhWiIbMJUX0DyisoWBD4i0LG6QLcTi4RmqLfKGHJlE73kSsz66n_D33zECZwta9sPhoV4xo3QN0ssXEjSGhpsVz0p53MnPtaAd9DAjHco2F5vbgHRV-QARGKtl0oTzYQC_oKKdRiu2NpjB/s1600/tradingpit.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="394" data-original-width="700" height="180" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhWiIbMJUX0DyisoWBD4i0LG6QLcTi4RmqLfKGHJlE73kSsz66n_D33zECZwta9sPhoV4xo3QN0ssXEjSGhpsVz0p53MnPtaAd9DAjHco2F5vbgHRV-QARGKtl0oTzYQC_oKKdRiu2NpjB/s320/tradingpit.jpg" width="320" /></a></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><i><br />
</i></span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><i><br />
</i></span></div>
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Scenario</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">You are a trader for a pension fund that makes money for it's investors by constantly buying and selling financial instruments such as shares in companies, foreign currencies or futures contracts. The general idea is that you aim to buy when the price is low and sell when it is high, hence making a profit.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Traditionally you'd stare at the live prices on a screen and through your experience know exactly when to trade. Recently your company has employed a team of machine learning experts to predict if prices are going up or down so as to hopefully remove the guess work form your trading decisions.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Every 5 minutes you are now receiving 14 different predictions from mathematical models. What they forgot to tell you is how to actually use those predictions. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">The challenge is to figure out how to use these predictions to determine trading decisions. The winner of the challenge will be the 'trader' who's 'strategy' makes the most money.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNVOG9AovraO-ve7QfkQwCePHfTL9QXS-P08n8meIgTyznNgIVhZ1cm_5_RxLFzLgYP6I3yTPRmeCXL5_c-aeJ2zyiX4sC4KNh3AVxnIyrA7Mn8SYW-0cpLN6fZW7Wj4wnJL5vhuApTbg1/s1600/trader1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="423" data-original-width="1024" height="132" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNVOG9AovraO-ve7QfkQwCePHfTL9QXS-P08n8meIgTyznNgIVhZ1cm_5_RxLFzLgYP6I3yTPRmeCXL5_c-aeJ2zyiX4sC4KNh3AVxnIyrA7Mn8SYW-0cpLN6fZW7Wj4wnJL5vhuApTbg1/s320/trader1.png" width="320" /></a></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>The Data</b></span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">You have been provided with over 12 months of actual prices and predictions of future price movements at 5 minute intervals for some crypto-currency pairs. This is the data you will use to develop your 'trading strategy'.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Additionally you have also been provided with just the predictions (no prices) for some further currency pairs over the same time period. This is the data on which you will apply your strategy to come up with some 'buy' and 'sell' times that determine a 'trade'.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">These times will be submitted and the returns calculated.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheC_t6JmomOk2JdGyzE0TTprEtf3OnGtxjyxWkE7aJpWlIk59SkBVwXDlAO86cMWGCohw1GWDL4ZFcC3FgmrxjVQuaag2rgSkT0cAbk_hcrUSSi716IcDZ00I6E-ByD2IpSLSn3TbhmWKd/s1600/CRYPTO_IMAGE.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="539" data-original-width="960" height="179" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheC_t6JmomOk2JdGyzE0TTprEtf3OnGtxjyxWkE7aJpWlIk59SkBVwXDlAO86cMWGCohw1GWDL4ZFcC3FgmrxjVQuaag2rgSkT0cAbk_hcrUSSi716IcDZ00I6E-ByD2IpSLSn3TbhmWKd/s320/CRYPTO_IMAGE.png" width="320" /></a></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">What does a 'strategy' look like?</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">A trading strategy uses only the predictions to make '<i>entry</i>' and '<i>exit</i>' trading decisions. It can be very simple logic or sophisticated logic - it is the challenge to come up with that logic. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">e.g</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">If prediction4 > 0.4 & </span><span style="font-family: "arial" , "helvetica" , sans-serif;">prediction7 > 0.2</span><span style="font-family: "arial" , "helvetica" , sans-serif;"> </span><span style="font-family: "arial" , "helvetica" , sans-serif;"> then 'Enter Long' (buy)</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">If prediction4 < 0.3 & </span><span style="font-family: "arial" , "helvetica" , sans-serif;">prediction7 < 0.2 </span><span style="font-family: "arial" , "helvetica" , sans-serif;">then 'Exit Long' (sell)</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Code for an example strategy will be supplied</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhizViEFl8iwmuG8Rp0zy_NRNVxkwunkKL9HHq36cTdGyED47XCP_AM3B1ateh-zcDZaHGiH0msCjkcVqig2sIk14rCBKCofITAyR4LUrWMrULKxkc8u-uQUjm7KBLZlAsrq1VZO-pOp0Bk/s1600/trader2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="436" data-original-width="1024" height="136" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhizViEFl8iwmuG8Rp0zy_NRNVxkwunkKL9HHq36cTdGyED47XCP_AM3B1ateh-zcDZaHGiH0msCjkcVqig2sIk14rCBKCofITAyR4LUrWMrULKxkc8u-uQUjm7KBLZlAsrq1VZO-pOp0Bk/s320/trader2.png" width="320" /></a></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">What does a Submission look like?</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">You will be submitting a file with 6 columns. </span><span style="font-family: "arial" , "helvetica" , sans-serif;">Each row of the file will be a trade that will have it's return calculated.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"> </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs627Z43ydFNDDrtg0yw05HmXDSpAviETyTHyHIqX_UuZM7VKjw7rPjE6Z3tKcVW4dDhoV0eDTFCWh_7AbBj8xothOiEarB_wuP7nLFnIV9B_czw8KWB7tzVXzrhVvriZ8D1lrzN2H72b9/s1600/submission.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="212" data-original-width="652" height="130" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjs627Z43ydFNDDrtg0yw05HmXDSpAviETyTHyHIqX_UuZM7VKjw7rPjE6Z3tKcVW4dDhoV0eDTFCWh_7AbBj8xothOiEarB_wuP7nLFnIV9B_czw8KWB7tzVXzrhVvriZ8D1lrzN2H72b9/s400/submission.PNG" width="400" /></a></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><b>keys_pair</b>: the pair to trade </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>enterTime</b>: the time to enter the trade</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>exitTime</b>: the time to exit the trade</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>direction</b>: are we going long (1) or short (0) </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>percentPair </b>: what percent of the 'pair balance' do we want to put on this trade</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><b>strategyName</b>: what are you calling this strategy (for reference)</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">The times we supply are just integers which represent 'minutes from start', hence no need to mess around with time stamps.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><i>Direction </i>means are we predicting the price will go up (<i>'going long'</i>) or down (<i>'short selling'</i>).</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;"><i>percentPair </i>is the percent of the pair 'balance' that we want to put on each trade. There can be multiple trades for a single pair that overlap in time, but the sum of percentPair for that time stamp cannot be more than 1 (for each direction - see below*). </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Long and Short?</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">It's possible to still make money if we think the price is going down, which is called<i> 'short selling'</i>. This is how it's done,</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Say X are 10 shares in IBM and we think the share price is dropping.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Rather than buy X, we borrow X from a stockbroker and immediately sell it on the stock exchange. When the price drops we buy X back from the exchange and return X to the stockbroker. We have made a profit as we sold X for more than we bought it back. We would pay the stockbroker 'interest' out of our profit.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">'<i>going long</i>' is simpler as we expect the price to go up, so the stockbroker does not need to be involved. We simple <i>'buy low and sell high</i>' directly from the stock exchange.</span><br />
<br />
<span style="font-family: "arial" , "helvetica" , sans-serif;">*For this exercise, Long and Short trade directions are evaluated separately, which means you will get a total return for long trades and a total return for short trades. Each submitted trades file can contain just longs, just shorts, or both longs and shorts. </span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhS1FxijbCqNI4htokXbpkW0ddqzJ7NIuDkk6PIN9e08FU7TvQ8MfB4NLmQgGkwrL_oLI9P9VZbAzrj408agBOc3pfNtMlZseLBdrJzW84a90rYC5IvyOf0Gzp4R7baUYUGEBQn43QKLO0X/s1600/1668-long-and-short-trading.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="110" data-original-width="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhS1FxijbCqNI4htokXbpkW0ddqzJ7NIuDkk6PIN9e08FU7TvQ8MfB4NLmQgGkwrL_oLI9P9VZbAzrj408agBOc3pfNtMlZseLBdrJzW84a90rYC5IvyOf0Gzp4R7baUYUGEBQn43QKLO0X/s1600/1668-long-and-short-trading.gif" /></a></div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">How is the Strategy Evaluated?</span></h3>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">The general idea of how the strategies are evaluated are it is just a sum of the individual percentage gains of each trade. There is no re-investment of any profit.</span></div>
<br />
<ol style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">for each trade we determine the enter (EN) and exit (EX) prices<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">the enter and exit prices are adjusted for the trading fee percentage (F), and then the trade percentage gain (TPG) calculated as a percentage of the entry price<br />
<br />
long: ( (EX*(1-F)) - (EN*(1+F)) )/ EN<br />
<br />
short: ( (EN*(1-F)) - (EX*(1+F)) )/ EN<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">The percent weighting is applied<br />
TPG = TPG * percentPair<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">TPG is capped<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">It is determined if a stoploss has occurred<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"> TPG then summed over all trades to give a total return</span></li>
</ol>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<br />
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Fees?</span></h3>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">There is a cost associated with each trade made, called the fee. This means that we don't quite get the actual price price as the exchange takes their cut. When we want to buy, we pay slightly more, and when we want to sell we get slightly less. The fee is applied to the price on entering and exiting a trade.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Hence when we make a trade, we need to be confident the returns will cover the fees we are charged. For this exercise, the fee F = 0.15%.</span></div>
<div>
<br /></div>
<h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Capping?</span></h3>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">We need to have confidence that the return of each strategy is not influenced by one or two 'lucky' large trades. To do this we take all raw TPGs and calculate the mean (M) and standard deviation (SD) of these individual gains. Any trade that has a return greater than M + (2*SD) or less than M - (2*SD) is capped at that value. This has the effect of removing extreme values. </span></div>
<br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Stop Losses?</span></h3>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">If the price goes too far in the wrong direction then a<i> 'stop loss'</i> will be executed for that trade. For example, if we are 'going long' (hoping the price will go up) but the price falls by more than X% before the trade exit time is reached, then the trade will be immediately exited for you. This trade will use the price at the time the stop loss was triggered to calculate the trade returns.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">For this exercise the stop loss value is 5%</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Are there any other trade restrictions?</span></h3>
<h4 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></h4>
<h4 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">trade length</span></h4>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">A 'dumb' strategy would just be to do a single trade for each currency pair that lasts the whole of the time period and guess if you think think it should be a long (price going up) or a short (price going down). Such a strategy might end up making a profit but it would be just guessing and not using the predictions. To prevent this, any trades that are longer than 4 days (60 x 24 x 4 minutes) in length will be eliminated.</span></div>
<h4 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></h4>
<h4 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">number of trades</span></h4>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">It's also possible if you think the predictions are no good to just submit zero trades and have a return of 0, hoping everyone else's strategy loses money. Hence all currency pairs must have at least 5 trades each.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">How do we Submit our Trades?</span></h2>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">Once you have created a trades file, you will upload it via our web portal. Each team can only upload 1 file every 6 hours, so be prudent. Each file can only contain 1 long strategy and 1 short strategy. Make sure your file names for each submission are different and that each strategy has a different name.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">Several times a day we will calculate the returns and display them along with other metrics on the leader board.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;">You must submit trades for all the pairs in the data set. These will be divided into 3 groups and returns calculated for each group. </span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<ol style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Train group</b> - these are the currency pairs that have real names and prices. Use this set to make sure your calculations of the returns match ours.<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Public group</b> - this will be a subset of the remaining currency pairs (which have masked names and no prices). You will be able to see the returns for these pairs.<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;"><b>Private group</b> - the remaining currency pairs. You won't be able to see the returns until the competition is over.</span></li>
</ol>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
</div>
<div>
</div>
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Who Wins?</span></h3>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">In order to win there are several criteria:</span><br />
<ol style="text-align: left;"><span style="font-family: "arial" , "helvetica" , sans-serif;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">The top 3 teams whose combined best long and best short returns on the <b>private group</b> exceed those of the benchmark solution will be qualify for the next stage.<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">We will then release a further data set. We will need to witness your code crunching this new data to generate the predictions file.<br />
<br />
<i>Note: we don't need the code itself or the logic - that is your IP. We just need to see stuff happening to make sure there has been no dodgy business occurring!<br />
</i></span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">It is the returns on this set that will determine the grand champion, so long as they still beat the benchmark solution</span></li>
</span></ol>
<span style="font-family: "arial" , "helvetica" , sans-serif;"> </span> <br />
<h3 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"> What do we Win?</span></h3>
<span style="font-family: "arial" , "helvetica" , sans-serif;"> </span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">There will be a cash prize for the top 3 teams beating the benchmark solution. Any team or individual may be asked to come and develop their strategy for a real trading company, so there is a good chance that this could lead to a job and wealth directly correlated to the performance of your solution!</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
</div>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></h2>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Skills, Team size etc...</span></h2>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">In order to do well in this challenge, you must have a couple of skills;</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span></div>
<div>
<ol style="text-align: left;">
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">the ability to think logically to come up with ideas<br />
</span></li>
<li><span style="font-family: "arial" , "helvetica" , sans-serif;">the ability to write code to test those ideas</span></li>
</ol>
<span style="font-family: "arial" , "helvetica" , sans-serif;">It will not be advantageous to form teams from the start. The benefit of team forming will come later on when you want to combine ideas.</span></div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <br />
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Can you tell us more about the predictions?</span></h2>
</div>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">There are 14 predictions for each currency pair for each time stamp. The predictions are weakly correlated to the actual percentage change in price at some point in the future. Prediction1 is the shortest term prediction (1 hour ahead) and prediction14 is a longest term prediction (96 hours ahead), with the remaining being sequential in their prediction horizon.</span><br />
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">The predictions are simulated but do represent what can be achieved. We cannot divulge any further information about the predictions.</span><br />
<br />
<br /></div>
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Where is the Data?</span></h2>
<div>
<span style="font-family: "arial" , "helvetica" , sans-serif;"><br />
</span> <span style="font-family: "arial" , "helvetica" , sans-serif;">See our<a href="http://anotherdataminingblog.blogspot.com/2019/08/buy-low-sell-high-data-and-code.html" target="_blank"> next post</a> for information on the data and some sample code. </span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeZnUfY8yZO5QnjlKmAlN-f8LQViJfK6mpQtzV7mMK_QLpGBxmeLWxppy3y3DSE_j-3dBd-1jCGOdWAxTolGx3VanwabJZbA6LYkuvulWVTKSBvlpevA6DT7lpOdCd6FTn4csJwGIf1uiv/s1600/CRYPTO_IMAGE.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br />
</a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeZnUfY8yZO5QnjlKmAlN-f8LQViJfK6mpQtzV7mMK_QLpGBxmeLWxppy3y3DSE_j-3dBd-1jCGOdWAxTolGx3VanwabJZbA6LYkuvulWVTKSBvlpevA6DT7lpOdCd6FTn4csJwGIf1uiv/s1600/CRYPTO_IMAGE.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br />
</a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeZnUfY8yZO5QnjlKmAlN-f8LQViJfK6mpQtzV7mMK_QLpGBxmeLWxppy3y3DSE_j-3dBd-1jCGOdWAxTolGx3VanwabJZbA6LYkuvulWVTKSBvlpevA6DT7lpOdCd6FTn4csJwGIf1uiv/s1600/CRYPTO_IMAGE.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br />
</a><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeZnUfY8yZO5QnjlKmAlN-f8LQViJfK6mpQtzV7mMK_QLpGBxmeLWxppy3y3DSE_j-3dBd-1jCGOdWAxTolGx3VanwabJZbA6LYkuvulWVTKSBvlpevA6DT7lpOdCd6FTn4csJwGIf1uiv/s1600/CRYPTO_IMAGE.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><br />
</a></div>
</div>
</div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com13Melbourne VIC, Australia-37.8136276 144.96305759999996-39.415753599999995 142.38127059999997 -36.2115016 147.54484459999995tag:blogger.com,1999:blog-8236580214478682937.post-43070033151613879362013-10-17T03:23:00.000-07:002013-10-23T04:10:24.691-07:00Techniques to improve the accuracy of your Predictive Models<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">I did a presentation at the MelbURN R User Group last night on a few techniques to improve the accuracy of your predictive models and also talked a bit about the Heritage Health Prize story and how our win unfolded. The video of the talk is below and the R code I used in the simulations below that.</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">You will probably be able to enhance the code and make it more efficient. If you do, please let us all know. </span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">Thanks for all your positive feedback - hope you all had an enjoyable night. </span><br />
<br />
<br />
<iframe allowfullscreen="" frameborder="0" height="344" src="//www.youtube.com/embed/1fIyQL9FiAk" width="459"></iframe></div>
<br />
<br />
<br /></div>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 12px;
margin: 8px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre {
margin-top: 0;
max-width: 95%;
border: 1px solid #ccc;
white-space: pre-wrap;
}
pre code {
display: block; padding: 0.5em;
}
code.r, code.cpp {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
<!-- Styles for R syntax highlighter -->
<style type="text/css">
pre .operator,
pre .paren {
color: rgb(104, 118, 135)
}
pre .literal {
color: rgb(88, 72, 246)
}
pre .number {
color: rgb(0, 0, 205);
}
pre .comment {
color: rgb(76, 136, 107);
}
pre .keyword {
color: rgb(0, 0, 255);
}
pre .identifier {
color: rgb(0, 0, 0);
}
pre .string {
color: rgb(3, 106, 7);
}
</style>
<!-- R syntax highlighter -->
<script type="text/javascript">
var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.childNodes[r].nodeName=="BR"){p+="\n"}else{p+=h(t.childNodes[r])}}}if(/MSIE [678]/.test(navigator.userAgent)){p=p.replace(/\r/g,"\n")}return p}function a(s){var r=s.className.split(/\s+/);r=r.concat(s.parentNode.className.split(/\s+/));for(var q=0;q<r.length;q++){var p=r[q].replace(/^language-/,"");if(e[p]){return p}}}function c(q){var p=[];(function(s,t){for(var r=0;r<s.childNodes.length;r++){if(s.childNodes[r].nodeType==3){t+=s.childNodes[r].nodeValue.length}else{if(s.childNodes[r].nodeName=="BR"){t+=1}else{if(s.childNodes[r].nodeType==1){p.push({event:"start",offset:t,node:s.childNodes[r]});t=arguments.callee(s.childNodes[r],t);p.push({event:"stop",offset:t,node:s.childNodes[r]})}}}}return t})(q,0);return p}function k(y,w,x){var q=0;var z="";var s=[];function u(){if(y.length&&w.length){if(y[0].offset!=w[0].offset){return(y[0].offset<w[0].offset)?y:w}else{return w[0].event=="start"?y:w}}else{return y.length?y:w}}function t(D){var A="<"+D.nodeName.toLowerCase();for(var B=0;B<D.attributes.length;B++){var C=D.attributes[B];A+=" "+C.nodeName.toLowerCase();if(C.value!==undefined&&C.value!==false&&C.value!==null){A+='="'+m(C.value)+'"'}}return A+">"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("</"+p.nodeName.toLowerCase()+">")}while(p!=v.node);s.splice(r,1);while(r<s.length){z+=t(s[r]);r++}}}}return z+m(x.substr(q))}function j(){function q(x,y,v){if(x.compiled){return}var u;var s=[];if(x.k){x.lR=f(y,x.l||hljs.IR,true);for(var w in x.k){if(!x.k.hasOwnProperty(w)){continue}if(x.k[w] instanceof Object){u=x.k[w]}else{u=x.k;w="keyword"}for(var r in u){if(!u.hasOwnProperty(r)){continue}x.k[r]=[w,u[r]];s.push(r)}}}if(!v){if(x.bWK){x.b="\\b("+s.join("|")+")\\s"}x.bR=f(y,x.b?x.b:"\\B|\\b");if(!x.e&&!x.eW){x.e="\\B|\\b"}if(x.e){x.eR=f(y,x.e)}}if(x.i){x.iR=f(y,x.i)}if(x.r===undefined){x.r=1}if(!x.c){x.c=[]}x.compiled=true;for(var t=0;t<x.c.length;t++){if(x.c[t]=="self"){x.c[t]=x}q(x.c[t],y,false)}if(x.starts){q(x.starts,y,false)}}for(var p in e){if(!e.hasOwnProperty(p)){continue}q(e[p].dM,e[p],true)}}function d(B,C){if(!j.called){j();j.called=true}function q(r,M){for(var L=0;L<M.c.length;L++){if((M.c[L].bR.exec(r)||[null])[0]==r){return M.c[L]}}}function v(L,r){if(D[L].e&&D[L].eR.test(r)){return 1}if(D[L].eW){var M=v(L-1,r);return M?M+1:0}return 0}function w(r,L){return L.i&&L.iR.test(r)}function K(N,O){var M=[];for(var L=0;L<N.c.length;L++){M.push(N.c[L].b)}var r=D.length-1;do{if(D[r].e){M.push(D[r].e)}r--}while(D[r+1].eW);if(N.i){M.push(N.i)}return f(O,M.join("|"),true)}function p(M,L){var N=D[D.length-1];if(!N.t){N.t=K(N,E)}N.t.lastIndex=L;var r=N.t.exec(M);return r?[M.substr(L,r.index-L),r[0],false]:[M.substr(L),"",true]}function z(N,r){var L=E.cI?r[0].toLowerCase():r[0];var M=N.k[L];if(M&&M instanceof Array){return M}return false}function F(L,P){L=m(L);if(!P.k){return L}var r="";var O=0;P.lR.lastIndex=0;var M=P.lR.exec(L);while(M){r+=L.substr(O,M.index-O);var N=z(P,M);if(N){x+=N[1];r+='<span class="'+N[0]+'">'+M[0]+"</span>"}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'<span class="'+M.cN+'">':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"</span>":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L>1){O=D[D.length-2].cN?"</span>":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length>1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.r>r.keyword_count+r.r){r=s}if(s.keyword_count+s.r>p.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((<[^>]+>|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"<br>")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML="<pre><code>"+y.value+"</code></pre>
";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p<r.length;p++){var q=b(r[p]);if(q){n(q,hljs.tabReplace)}}}function l(){if(window.addEventListener){window.addEventListener("DOMContentLoaded",o,false);window.addEventListener("load",o,false)}else{if(window.attachEvent){window.attachEvent("onload",o)}else{window.onload=o}}}var e={};this.LANGUAGES=e;this.highlight=d;this.highlightAuto=g;this.fixMarkup=i;this.highlightBlock=n;this.initHighlighting=o;this.initHighlightingOnLoad=l;this.IR="[a-zA-Z][a-zA-Z0-9_]*";this.UIR="[a-zA-Z_][a-zA-Z0-9_]*";this.NR="\\b\\d+(\\.\\d+)?";this.CNR="\\b(0[xX][a-fA-F0-9]+|(\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)";this.BNR="\\b(0b[01]+)";this.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|\\.|-|-=|/|/=|:|;|<|<<|<<=|<=|=|==|===|>|>=|>>|>>=|>>>|>>>=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"</",c:[hljs.CLCM,hljs.CBLCLM,hljs.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},hljs.CNM,{cN:"preprocessor",b:"#",e:"$"},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"<\\-(?!\\s*\\d)",e:hljs.IMMEDIATE_RE,r:2},{cN:"operator",b:"\\->|<\\-",e:hljs.IMMEDIATE_RE,r:1},{cN:"operator",b:"%%|~",e:hljs.IMMEDIATE_RE},{cN:"operator",b:">=|<=|==|!=|\\|\\||&&|=|\\+|\\-|\\*|/|\\^|>|<|!|&|\\||\\$|:",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"%",e:"%",i:"\\n",r:1},{cN:"identifier",b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE],r:0},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0}]}};
hljs.initHighlightingOnLoad();
</script>
<br />
<h1>
Code from the presentation</h1>
<h2>
Footy Tipping</h2>
<pre><code class="r">#--------------------------------------
# Footy Tipping Simulation
# Ensembling of Tipsters
#
# Phil Brierley
# Oct 2013
#---------------------------------------
#clean all objects from memory
rm(list=ls())
#------------------------------
# adjustable parameters to set
#------------------------------
Number_of_Tipsters <- 12
Number_of_Games <- 2000
Number_ofSeasons <- 100
Tipster_Strength <- 0.6
#------------------------------
#simulation each season
for (Season in 1:Number_ofSeasons){
#gernerate random tips
Results = matrix(rbinom(Number_of_Games*Number_of_Tipsters,1,Tipster_Strength),Number_of_Games,Number_of_Tipsters)
#majority vote = median score
Results[,Number_of_Tipsters] <- apply(Results[,1:(Number_of_Tipsters-1)],1,median)
#find the mean score per tipster over the season
seasonSummary <- apply(Results,2,mean)
#stack the seasons together
if (Season == 1) {
runningSummary <- seasonSummary
} else {
runningSummary <- rbind(runningSummary,seasonSummary)
}
} #Number_ofSeasons
#give the columns sensible names
colnames(Results)[1:Number_of_Tipsters] <- paste('Tipster',1:Number_of_Tipsters)
colnames(Results)[Number_of_Tipsters] <- paste('Majority Vote')
colnames(runningSummary) <- colnames(Results)
#plot the results
bestPunter <- apply(runningSummary[,1:(Number_of_Tipsters-1)],1,max)
plot(runningSummary[,Number_of_Tipsters]
,type='l'
,col='red'
,ylim=c(Tipster_Strength - 0.1,1)
,xlab='Season'
,ylab='% of Games Correct')
lines(bestPunter,col='blue')
abline(h=Tipster_Strength,col='green')
bp <- mean(bestPunter)
abline(h=bp,col='blue')
mv <- mean(runningSummary[,Number_of_Tipsters])
abline(h=mv,col='red')
legend("topright"
, inset=.05
, c(paste("Majority Vote (avg=",mv,")"),paste("Best Tipster (avg=",bp,")"),paste('expected (',Tipster_Strength,')'))
,fill=c('red','blue','green')
,horiz=FALSE)
</code></pre>
<img alt="plot of chunk unnamed-chunk-1" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAAsVBMVEX9/v0AAAAAADkAAGUAAP8AOWUAOY8AZo8AZrUA/wA5AAA5ADk5AGU5OWU5OY85Zo85ZrU5j9plAABlADllAGVlOQBlOTllOY9lZgBlZjllZmVlZrVltf2POQCPOTmPOWWPZgCPZo+PjzmPj2WPtY+P27WP29qP2/21ZgC1Zjm1tWW124+1/rW1/tq1/v3ajznaj2Xa24/a/rXa/tra/v39tWX924/9/rX9/tr9/v3/AADBAAOaAAAAO3RSTlP///////////////////////////////////////////////////////////////////////////8A/x/iKeEAAAAJcEhZcwAACxIAAAsSAdLdfvwAABXtSURBVHic7Z0Le6PIlYYjt6bXcTvekbs7nUxkTzY7GSs7E2szsWTB//9hoS5AAQeoEoUAfd/7dFs2oAPFWzcoLr9LCCS/m3oDyDRQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoQ8SvyJwZUfyA75KxoXhQKB4UigeF4kEZLP54p/uIH15DQ5NJGSr+9LTVn4ePb4GhyaQMFf/+5bXy6R+aTApLPCiD2/j3R7bxS4S9elAoHpRY4p3OnecwAJkUlnhQKB4UigeF4kEZfObu0fbkmgfyyxd/6YtiRkNKW3fS+0v86WnTttt6vzt3lp8Cwyjik/fPL/6rWxbLT4FhHPFBq1sWy0+BgeIDWX4KDBQfyPJTYKD4QJafAgPFB7L8FBgoPpDlp8BA8YEsPwUGig9k+SkwUHwgTgo6T3/OHYoPxBWfWpxpx7vb7OeuGKU4PbvnMI/3r9m/fJa+JvX48FbMc9ezbzvrbXl/XJlrWfc66231Z7ZadU/DNsk/DmbAxM5z1kbxofSJ//Tfb8n7H++b15nmC5SztNtCcFV8aUhG5Zr9bf6Xup55py9tVifLj59e7IeKqZYy84p11tIhpU2C4vNfJfH3f852+Q/3edFTJT4rnKoo/uHrh1/uf81+/9PWeFU/ivnlDx1nl5dau1A257s/OpWHuqytyCpKs61aDrrC2doPs0WvTrVTXA5H8YH0iv+/bfL/P92/6jJn9vluo4rk8W5rq/pD9rcqeGqe8mrnZxqz30wxVn5MhH+ahdQyNy8HW62bTFMMhanv6LFwU7DtdPuRzXTmFYWf4gPpFf/L96cffzFlMdOnCquRqMQa8e/f3v5hnGzUv3K+UmWKZF7TZ3/phdTUSndBVe65eP2ZVex2CTsobj6Odzcvzryyrqf4QHrF//N/fvuLroR3qkOV7W/1h/0w4k/Pf/+mvR4f/v2s6wU7X5fMGyXI3ISkI+iF8lahXI9T4ss7llRxfn/Uau1HOURuizrFn0nP4Vym7x//u9ESt7aqb5T4ZP+D2ftZ1aD8lSW+uB5de80j/PTwZkt8WdW7bfyuOADI5KoWRQW42zoTnQ+KP5OeFKgm/ObFCs6qWKeNL8WrqldjckCljTfFN/eqltyvNmaZG6fEq4rc9upNTaC+d/rxterdNgh2nl6abfyZ9IvPi/Vq9d3XrdOrN+JPT1n9byXYHGDmqxnZbzdlxWwimIWyOf9VOSVgjuNV0LwTp1sJe1TvHNyrgPYjYa/+fAJTIF6Edvy+93vCcfyx/eSAPzyOP5ewFOxXzVvFk/1NyxWJlYUqZ+5OT0WZHQTP3J3N8lNgoPhAlp8CA8UHwtG5aKtbFq74taUyOrcqT4+WU1W/TLXTGR//Ve+kCb0277E5FdU0/vZIwu3T2xFAYWyO4oPpE68kFsfplalJW8e8OTVgbC477tOH/od8eMeexzMfe6VfGJuj+GB8xJcH72ZAvBh0M9lCnaL7my6n5dxyBE8t6D82lx+X725+suMD7ghNFm+bSGNzFB+MV4k3Q25qXOyLHhCvlHh9Pvfj2+HDqzPXLG5PuAWMzR0f/upU9UmSVwT64/T8c1Y1SGNzFB+MVxv/8c0OtNny1xSvC2I51y5ulwsYm9ODvXpxWwO4BX6/UW2CNDZH8cF4VfVP23ygLcsINy9N8cqLHlGxc+3idrmAsTnnN/1dt4XP5uWPHKwP0VB8MD7is93stKZ2AKaYa8Xnl8zouXZxp8R7js29f6uKtyN1+sN07u0Eih9I/7CsqWjNQJsqeZL4x9va3HKELgkbm9tVqnqbm4rqQZV4aWyO4oPpG53Lj+PtQJuur/XIm55blPg/1+aaxYsR9oCxucoRQ/Wqq/I4vj42R/HBRElB64MjLKOPzVF8MBcRP/rYHMUHs/wUGCg+kOWnwEDxgSw/BQaKD2TgsGy1j5afX9nbK6KbQ249ozXnQ/GBuOJ/bxkq3ugVhtz6x2fPhuID6RNvjqv1mEs+BmcPtdWh+t+dO+TUkNtXLX5XDM1Uh9yqv46Wjs5pDhSf/yqJz0flfv2Sj8HZSeYk2r/KO+TU6bjiBGwiD7m559pGS0fnNAeKz38VxOe3v6kzrc4YXDHyVo7E6RPw7i3y0pDbeHU9xQfSJz4flfv0ktgxuHzkLT+j7o7E6fJsm3NpyI3iZ0Of+PyKmB9u89vWKyNv5Uhco8RLQ24UPxv6DufsMJsaRrdjcHaSvibi/pfyDrl6Gy8NubGNnw19KdDDbKYj95sdg7MDdXZOcYdcdrBe6dVLQ27s1c+GgBT0DsXkdJym4XH8XBhDfLtenrmbDctPgYHiA1l+CgwUH8jyU2Cg+ECWnwIDxQey/BQYKD6Q5afAMI74XXHpf1joBbD8FBhGEa8vC98mu9vGnOXvtq6XNy4KKW3dSe9/xegXc8eocLpx+eKvmaHi9Q08h1vnUZu+ocmkDK7q1Yjzx7eDcA8Axc8Z9upBoXhQYol3783s6EuSucASDwrFgzJYvHk6wOpD86ohip8zg4/j7R1hPI5fGFHO3CXihYIUP2dY4kGJcuaObfzyYK8eFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UL/HmQbWh77Wn+DnjId4+tFS9cyhmaDIpASU+cmgyKWzjQfETf1htk73warkhocmk+FX1n5Xz4z07d9eDl3jzNgLhXQRDQpNJ8avqdcdeeBfBkNBkUti5A4XiQfETf3paffztM3v1V4Rn525zfHhj5+6a8D1zl4nnufprIqDE78USf7xb3e75+rHF4d3Gt4zRnJ5fkv1t5v+BLxxcFDFeMbrf8BWji2Po6BxL/ELxa+Of24/ksjZ+wzZ+efiV+Nb3xw4ITSaFZ+5AiXUFjrPIKmfgppExGdzGnx2aTArbeFAGt/FZr74lU1D8nBkq3lycI16eQ/FzZugp27xTxzN3C8N3kCb7KQ7SsMQvlMG3ULV3/Ch+zgwt8eeHJpMytI0/PzSZFJ6yBYXiQfESv7tVnbht3NBkUnzEm17d++MmamgyKT4PRrAX1AsX2QwJTSbFR3z7ybkhocmkeIjvODk3JDSZFJ82/vhJ3x5/F9i7o/g54z8eH3qX9Jni07O+dYlg18XsjuMp/jLMTXwqu2qZfF6w5nLnBF84sxCfur8KFlJxqk9cv691LLX8PNGSgnmIL4pmKm1ompwpwFN812ILER+edechvqiVJfGp8zMgYksuEhdO3TVVZl2BeHmen/j9x7f9ajXW4VxpVizbFfEhrban+GKlwi46s40Zj+AuUFsKfJ9zl/2L8Jy7VPyz/JDENxbyIW3NRvJG2YwiZLopxKdpq8kW8e2bOUz8l9eszI8t3qnuha/k8/wr77zU9y+c5LluLuJbt7vtoGck8cl+dfNyGF7V1zYirYl3K/U0tdn+TPH5kv2LVwI3c+Zg8eEBOmoqOf3lZoo9YzHWRTt3dfFyv80x4FZ5Xc12vSCEi7chahsUQbxP97B9BzSX7BEvlaW5ipcadWFb28U3D/PLEL27vVqj6BWklYnnmi/W77EJQusmfiuVk+VMzguMG2uA+DjPuattRGszLO8spw9WmZw20+Yu1rPbm3VFvX3pDNAxU2+YT85pLOMjPhWXL3dp2phX45LPuRtBfFpJYCL8kdaXrCwotonVn2eKT8vw1Q1O6332+lo6qipP8c4OTKs/Si74nLt6glLnf23BloZMDFH/rfZHuaPaJEt4ipcOAJu20/o8aW6ZHzzEp80IxY9KnVXGqscb/Jy7dlIyZ3zEx7mhosh9adqSdYsFW7J7vbxIpSqpR0ybXV3xT2FrhRJfrWlrJc9dT22D6ntabKIa/Qthm+oVZerUE80ao1ioEetyvfpKY9spvuUQqFN8e+VcPyJs+UvYiGZ+SYsDkTIX1xeXataevkRlaoD4tGOlVfHNWL7vpBn+RIxqgsq92pZAYWptnwjtelsXv+d3mcZqnLxabVeTRnfNM3qQ+FrRbtYhjR3SsRUB76QJpTv0KOLTvp3eVvq7qFRV+XZUxIvnj/rjSmlvVmv1DWkT2jgb0L0VU793LuRkeLPo1WemvQHl4z+P1QrnEJtzyiW8wne2cm3i24I3J3f24DzP1QfeROMT2hAy4N0pviyGfW13LZbfahvLO+KFrwSEl9bWukbz6Sm+E99BmtGeenW2+NbmsS+iLLLzC3LnzKtJOZ96FphA/BhtfM5Zu0ja694J7+n1CF8I39ETiA9c59Rt/JkMuljGr+9VWV6cGsOux2rrY1kXFD9iG38mcrfW/9sRnCGIX8aTLQPER1E2snf3uK5aR0VZ8Syusr04SxJvuyQUj4NzkJJOJH7EwznSiis+evXiezh3uNWvnokYmvTh9ukmEv/l1fyLGZr0EnzGIQDfFxVk/yJcV0/CCD3jEIBfG585P6xWfOrVxRnv3j326ucNxZO4eInP+vPyOwWHhCaT4vtky/2Gh3NXhe+TLTPx7NVfE3yyJSh8siUoPm28uTP+cDP0pkkyIzxvobrcky3JZeBxPChDxXe8voDi58xg8V/+1vZsHIqfMwGHcyJq5k5u/yl+zniJ/7VdvX0J5aPgnuLnjE9Vv1utWi+96qgOKH7OLPSGCjIUHs6BEuvdsk6lsMqJsHlkLPg2aVAGvz/+7NBkUgaX+ONdW4+f4ufM0Da+Y8yW4ufM8FO2rVdpUPycGSqeJX6hDD6Ob793nuLnDE/ggELxoPiKVzU63yZ9RfiK32+T0488gXM9eN9QQfHXhU+JPz2tNqzqrwzPqn6/Crxxrj80mRS/Gyqyw/TD4DdUkDnh08Z/e0sOm4Q3TV4VAeIjhyaT4l3VRw9NJoVn7kCheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGh+EWyHhyB4hcJxc+M4UI81zN4RdOIv9T+uTgU3x36WsWvL5SyCOuZRPyl9s/FGT9h6/zn0BXNWfzw+mwc2jerEBK26QELL128z3arpQa6HyXndGxTnrC1n/p1/rGu/F2bLax7Lc8NYQrxjc2W07DumOfHOTVvVZikrz1qkTA/Mesyl0gr61j3gsW72y3vyUYpWGtCtiBA/LpcXbmStVRyO6qrmseOhdxIzv9KmqWtl8W7u8V3F00m3tmj69reqP6S74zwCqCzXDQKthO/mufqBbFXfO3r9WXW0tLrSo6p5oPmuitFZ12mc73u2Loq04mvuqxNq2eFSi5xQ1Xzfa2ergauzVuvhQXXwjer6+zKTbUcXc0H5cRKFi4+1rWvdIh3AxRV1dqttcQNqzCi+DWZM+OJF/JUvXV3s/VayM5J43dpjlA9u1PcVArth1uw3dmN9egiVd06oYapTRAjJe07vij5jV1VbobTHMpRypQ7X6stetGqfi3tB2fHC1V/k6baUojHPllXa/1KW1HpYIj1e6Wn4WQbeZub+nrb32rW6hPfGiTf1PY2/5LizZa0N9LuYh1xpba3/lvXsbZPnmptmcv51Wwqb3SjGuoVX19N8y8P8fWKVOLS4uVc3JjkJV5cqKe+8KQpvi2/rov/HhkqeLvcyqLsu3nXGx0MFn+826rXEQqPtq6H9k5132a3916TWOIbEbqs9lYydl5vj6p1Myp9Ap+zE/1LxHjv3G6b+X/oe+9cQKJ7d9C643jVpyr02ITwRTu/0t+Pbo3daB1jpC/GmyZPzy+R3zTpk7C+fvGVICQzztjV4Ko+K+7qmeaH5hssproCZ66DemchpWUe4pOdftOk8OaSycRPtN6LMRPx54Ym50Lx5HxiiXc6dyvL78mciSRegCV+zrCqByXCmTu+P36JxDhzpzg0z9lS/JyJcOau8ukfmkwKSzwog9t49ZpptvHLg716UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB6UGC8cVG8ga752juJnTQTx+iWTx+9DQ5NJiSD++PDGV4wujsHiH29+/lGV+Ae+YnRRDO/cnZ5Wt8mBrxhdGOzVg0LxoMQS73TuVjnnbxUZHZZ4UCgelMHij3e6WmevfmEMFX962urPQ/OcLcXPmRjn6t1P/9BkUljiQRncxquhObbxy4O9elAoHhSKB4XiQaF4UCgelDHFkzkznng3D8QJM0Y0bpoIxU8YjeLHjMZNE6H4CaNR/JjRuGkiFD9hNIofMxo3TSSSeLI0KB4UigeF4kGheFAoHhSKB4XiQaF4UCgelBji3x+lh6GdhbpDcxsvor4NKFKw09Pq5iVWtCyd6g6VKMGO92Uk/4ARxKu9u78dHifj/fNLcvz0Ei3iPstFsYLttuo2sjjRVDr3kYIdVBaykQICRhCvbqfUuW44B7XRu22siMc/fN3G2jx702icaOoRYvYBgkOD7W5+yiLYSAEBI4jXqciycCSyUJEinp5/zkpApGDHh7+qqj5ONFvi4wRTpm2kgIARxKv7aOOJPz1tYkXcb1TVFynY8U7noUjRTFscJ5gSbyMFBJxbiX9/3MSKmEU5xSzxoWWqK9inF/XkwIWX+IhtvC5XsSLu9dXlm1ht/De9T+NEsyUzUodhqjZeVc6RevXGe7yIqsTHCrbbmjokRjRb4iMFu3/NExkQcF7H8aaQbmd5HJ+FiXbonR2DxTspMNlxPFkiFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoGCKV2/WEl6rhQSk+PfHbaJuYZp6O6YEUry5IPnzS37JtL05+7DKL6D+8KrvlVttp97S8YAUf3qypX2n7z/QN2eb21DU3Qg7ffPe8W4jvV3zaoAUn9/QoAt9cfuzvelMTVC3od3HuzFsjoCKT/RtTPrtqeqOlp2u5LMa/+bF3JH0TPHXiH7+gn0AQ2L6etZxVrmzxF8vulevanbbnCvHn/TN5eq/M5HirwxVx9ubFtXHfrX67uvWVvh5r57iyTVC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoPwHBuRsQnDgHN0AAAAASUVORK5CYII=" /> <br />
<h2>
Footy Tipping Parallel</h2>
<pre><code>I think there is a 'parallel' package in base R now also.
library(help = "parallel")
</code></pre>
<pre><code class="r">
###############################
# Footy Tipping Simulation
# Ensembling of Tipsters
#
# Parallel Version
#
# Phil Brierley
# Oct 2013
#
###############################
#clean all objects from memory
rm(list=ls())
#set memory
memsize <- 3200
if (memory.limit() < memsize) memory.limit(size=memsize)
#----------------------
#parameters to set
#-----------------------
Number_of_Tipsters <- 12
Number_of_Games <- 20000
Number_ofSeasons <- 100
Tipster_Strength <- 0.6
threads <- 8 #depends on how many processors you have
#----------------------------------------------
# main function to simulate a season
#----------------------------------------------
simulateSeason <- function(f){
#gernerate random tips
Results = matrix(rbinom(Number_of_Games*Number_of_Tipsters,1,Tipster_Strength),Number_of_Games,Number_of_Tipsters)
#my tip is a majority vote - hence median score
Results[,Number_of_Tipsters] <- apply(Results[,1:(Number_of_Tipsters-1)],1,median)
#find the mean score per tipster over the season
seasonSummary <- apply(Results,2,mean)
}
#end of function
#---------------------------------------------
#--------------------------------
# the parallel stuff
#--------------------------------
#load library
library(snowfall)
#initiate clusters
sfStop()
sfInit(parallel = TRUE, cpus = threads, type = "SOCK")
sfExport(list = c("Number_of_Games","Number_of_Tipsters","Tipster_Strength"))
#start the clock
timeStart <- Sys.time()
#do the calculation in parallel
seasonSummary <- sfClusterApplyLB(1:Number_ofSeasons, simulateSeason)
#stack results together into a data frame
runningSummary <- do.call(rbind.data.frame, seasonSummary)
colnames(runningSummary) <- paste('punter',1:ncol(runningSummary))
colnames(runningSummary)[ncol(runningSummary)] <- paste('ensemble')
#record the time it took
totTime <- as.numeric(Sys.time() - timeStart, units = "secs")
myText <- paste('Avg calculation time per season = ', formatC(totTime/ Number_ofSeasons,digits=2,format='f') ,'seconds')
#stop clusters
sfStop()
#--------------------------------------
#plot the results
#--------------------------------------
bestPunter <- apply(runningSummary[,1:(Number_of_Tipsters-1)],1,max)
plot(runningSummary[,Number_of_Tipsters]
,type='l'
,col='red'
,ylim=c(Tipster_Strength - 0.1,1)
,xlab='Season'
,ylab='% of Games Correct'
,main=myText)
lines(bestPunter,col='blue')
abline(h=Tipster_Strength,col='green')
bp <- mean(bestPunter)
abline(h=bp,col='blue')
mv <- mean(runningSummary[,Number_of_Tipsters])
abline(h=mv,col='red')
legend("topright"
,inset=.05
,c(paste("Majority Vote (avg=",mv,")")
,paste("Best Tipster (avg=",bp,")")
,paste('expected (',Tipster_Strength,')'))
,fill=c('red','blue','green')
,horiz=FALSE)
</code></pre>
<img alt="plot of chunk unnamed-chunk-2" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAAwFBMVEX9/v0AAAAAADkAAGUAAP8AOTkAOWUAOY8AZo8AZrUA/wA5AAA5ADk5AGU5OTk5OWU5OY85Zo85ZrU5j485j7U5j9plAABlADllAGVlOQBlOTllOY9lZgBlZjllZmVlZrVltbVltf2POQCPOTmPOWWPZgCPZo+PjzmPj2WPtY+P27WP29qP2/21ZgC1Zjm1tWW124+1/rW1/tq1/v3ajznaj2Xa24/a/rXa/tra/v39tWX924/9/rX9/tr9/v3/AABN86vRAAAAQHRSTlP//////////////////////////////////////////////////////////////////////////////////wD/fF2juwAAAAlwSFlzAAALEgAACxIB0t1+/AAAFgFJREFUeJztnQt/m8i9hovtbuo67vpsj5yk6e4pdnrZrmiztU63Rhb6/t+qc4UBBjSIkQC97/NLbAvQH4aHucAA86uCQPKrqTeATAPFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUD8pZxO+ektvghd8er18OTMxS/1Jhs8+CSLKT5u2d/PvtMRmwH2q8Pb57jbNhlrOI394l4R4Oi8+StC/AgdnnIUuSynKm/pTHwrHmFyp+k/wgZWQq1fKn2AfXP9u0yB0iteZit1yttWMzbfckfshEq4l6AbX/bvWRsBF/rmSE638m5tiqZufJb8RqXsQyaaEXNQeEs7iZunu6+l6uu1AHaW2W3az6jMZK1bRqBXLh6xd7sIot+p1M+fbu3av8X3hXY/dCO7yc8ye5s8rvxOAc4kUavsqyLk/E1r89Gne/NvtAZwS5n+Rv7dhOq4k3C/ynEq+ylT6OyszkireUi66Kao1yY+xUNUVtjt2Kcla1WbUZ9ZW2xOdyeqYPpd3zOrcZPTeb0FqNTXE7fLm15XeicA7xooZTBkW+Wiv7Mj+IfaXToD6Y4ln7vX6x0+o53iygFtZLqQJCHUgrfVhJ7Ow8kYskqVxC5TQVrVDildC0nFo1QmyWrH1Br8zO8K60gRJc1TlWfGa8t1djU9wOX+6ssrSIwjnEb8QOkP/Fj5XaG2o/2GrLZgKdDbT4vMybNfF6AcfsrY6ujqiyGnRn6+Ljam1yv8mCT7qaWJVTVYByGxJTrZgv2M2yM3wrbed4r3jnUGqsxqa4Hb7cWfY7cTiD+KpMFJv/y6PdeZV4vVNkDVaJvzVfdcTbBXzizWLqS3Xx714Hideb4XyhXGs5w7fSvqLeftTTzDLN1dgUt8NXO8t8J46VM4jXu04dq9nV3+SR3S7qhQVZHNg2kZ0mjplULSgn2gV8Rf0h8bVisl7U6ym1/Zk5s8q1ljO8K22luWrcFUannuasyF2NTXE7fG1nxTtlOYP4jSrF1BaLY0Am3Nu4y5MqxzeaOlq8XUCdHTUad3XxtnFXiS8DVWusmlZmEYVTrOhZdq2NGc2VNjGnc0Z+bs5nyk1orcbTuDPhy60tvxOF04tXuVbqt+dyhariqtM5mVa538Uvke1Tva+yqn79k63jzQJyr+kqwzn1cRzY2TXxaoeW+VueKakPZmqjqFcf7BecteoZ3pV6Uq2LaEe8nGYXb63GprgdXn7t97aoj1XST3jJtqtBfHpsCwuaScTrM9KyNX9uKL6YKsfnppk/DRRfsHcOFooHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFDGiE/InDmh+BHfJaeG4kGheFAoHhSKB2W0eOfx9mGhyaSMFb970k/i5+1nWyl+zowV//bxpfY7PDSZFOZ4UEbX8eZhdtbxC4OtelAoHpRY4p3GXWA3AJkU5nhQKB4UigeF4kEZfeXu0bTk2ifyyxd/7ptiToYvbf1JP5zjd09dLyK8APFTb0AkTiK+ePvQ8f7U5e+25adAcxrxg1a3LJafAg3FD2T5KdBQ/ECWnwINxQ9k+SnQUPxAlp8CDcUPZPkp0FD8QJafAg3FD2T5KdBQ/ECcFPRe/pw7FD8QV/ze4EzTo39nZS/F7tm9hrm9fxH/7Cx1T+r24bWc565nc2D4nbfHxB24MFW/9UCUatA5PdEsVY7NXa6N4odySPz7/3kt3v5w3zmCjeNXuS0F18VXhvzIo2ZTDtEj72fOUhvFTM/lMGpqKXkFfWMOE7s6ih/IIfH336+L7Q/3enzRVOV4OWSe+Pztp+uv9z+Lv/+Yaq/yRzm/+qHiZHqsxSQ1C4k53/zBKTzkbW3loSLFOkWLni4nmqVkDHMfXHk7HMUP5KD4f6bF//94/yL3u9jn0ke2krlve5eaoj4Xn2XGk/OkEzNf6BJ/6ewq/egI/9ILyWXsuLK2hii7wja3pi88LT/qn2YpJ8ebgoHiB3NQ/Nfvdl++3tvsJTOrlijFavFvn1//oZSJclf8q+bbTFqV9OKTWkhOrTUXZOFuxavf2/cm16uhZs1Eu1TZIKjKeoofyEHx//rLL/+nCttMNrOEC/nB/NLid89//6yHi334z7MqF8x8lWuVttwOIXr9ohaytUK1HifHV08s6excOrdLyYMiN3UIxR/JgdM5oe8ff1spiakp6ls5vtj8YMa6//LV1L4mx5f3oytjNsKPD68mx1dFvVvHZ+UJgCnH5S810SzlFg8UfyQHUiB389XaCBY5zanjK/EyByr0EVCr43X2tV7lkmpobV3HV+uRNzmZ1rsuCeT3dl9Kx3qiWcrN8azjj+SweJutk+SbT6nTqtfi5WClQpBZWh0Ber6cIf4ydqUfHUEvJOb8pnZJQFfbMqjJyxtdS5hfZqKp3PNyTGm26o9lYAq8N6Ftvzv4Pc95/Lb74kA4PI8/lmEp2PgGPt+EjOhev3InB4SPMQ48r9wdzfJToKH4gSw/BRqKHwh756Ktblm44m8Mtd65pLx06kyV7TJZTwve/bvZSPO02oL75mTUq3U5wXTRma4/M8/TN0fxgzkkXkosz9NrU4uuhnl76oC+OXHeV3XDlWfp4ghI7Tx5IajZN0fxgwkRX5286+7xstNNHxbyEt1fVV6s5lY9eHLB8L45p9PNdAnpzfj2U2rnOVd33VdV9KfNB8XbPztzvO5ykz3hH1X3eC3Hq+u5717z6xdnrl5c9uAVg/rmtg9/loeQ7YYzXXS755+e0sY8tQFliUDxQwmq40X5qjvazA5vi1c99dVcs7hZbkDfnOrsfXg1E2wX3WYly34zTxX1V/Ur9RQ/mKCi/im1HW3iQLhat8VLaVlazTWLm+UG9M3Z/rdarhaflHgzUazlf58pfiQh4sW+d2pT0wFTzjXijUk911bHVY4P7Jt7+6zLeLezLtV33K3MvKKoKneKP5rD3bL6Ar3uaJN51yf+8bYxt+qhK4b1zWUqZ5sJpouuMKdzep66/8bcnsc6/mgO9c7Z83jT0abKa9XzpuaWOf77xly9eJVpw/vm9BlDdR6vDxAlvjqzeFc2GvrSQfE9RElB54sjDCfvm6P4wZxF/Mn75ih+MMtPgYbiB7L8FGgofiDLT4GG4gcyslu23kazQzroKr3V5Xa4t+Z4KH4grvjfGsaK13pbXW7yr0P9s0dD8QM5JF6fO6s+F9sHV51qX/3deUJOdrl9UuKzsmum1uVW1P88WTp6pzlQvP3TJ972yv380fbBmUkyQ+fqLgzzhJy8HFdegC08XW46XvOmjvjp6J3mQPH2T494+/ibvNLq9MGVPW9VT5y6AO8+It/qclMBT1XWU/xADom3vXLv14Xpg7M9b8pkoydO5WdTnbe63FRAip8Jh8SbOjn74dY+ylTreat64lo5vtXlpgJR/Ew4dDpnutlkN7rpgzOTVA/5/dfqCblmHd/qcrN/njodvdMcKL4P1c2mG3K/mD4401Fn5pRPyImT9Vqrvt3lVrBVPx8GpOBgV4yl5zINz+PnwinEd+vllbvZsPwUaCh+IMtPgYbiB7L8FGgofiDLT4GG4gey/BRoKH4gy0+B5jTis/LW/2GhF8DyU6A5iXh1W3haZLetOcvfbYHjOM4fX9r6k354iNGP+olRz+XG5Yu/ZMaKV0+F5bfOqzZDQ5NJGV3Uyx7nd6+55xkAip8zbNWDQvGgxBLvPpvZ05Ykc4E5HhSKB2W0eP12gOS6fdcQxc+Z0efx5okwnscvjChX7grvjYIUP2eY40GJcuWOdfzyYKseFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB6UIPH6RbVDx7Wn+DkTIN68tFSOORQzNJmUATk+cmgyKazjQQkTnydpsfEMLTcmNJmUsKL+g3S+vWfj7nIIEq9HI/CMRTAmNJmUsKJeNew9YxGMCU0mhY07UCgelDDxu6fk3S8f2Kq/IAIbd6vtwysbd5dE6JU7IZ7X6i+JATl+483x27vkdsPhxxZHcB3f0Ueze14Xm1vh/4EDDi6KGEOMblYcYnRxjO2dY45fKGF1/HP3mZyo41es45dHWI7vHD92RGgyKbxyB0qsO3CcRRLLyE0jp2R0HX90aDIprONBGV3Hi1Z9x0FB8XNmrHh9c4739hyKnzNjL9naRh2v3C2M0E4a8dPbScMcv1BGP0LV3fCj+DkzNscfH5pMytg6/vjQZFJ4yRYUigclSHx2KxtxadzQZFJCxOtW3dvjKmpoMikhL0YwN9R7brIZE5pMSoj47otzY0KTSQkQ33NxbkxoMikhdfz2vXo8/m5g647i50x4f/zQp6QpftbwPB4UigeF4kGheFAoHpQw8Zt3r5sk4encBRH6njvxj++5uyRCb70SeZ7iL4nAoj65WufQRf1+6g2IDRt3/eztr8r8/iQHwfigngg9QZcmPupO9yncNz7tzUQ7fb8fuIe9K/ZM6Q4RdqS1I+z7DtclvOeu2mY3KXtFa9HWtM69JhdsLt34vG/8L8rjoBa42qyOlTXX0l5u372hns3sWKo1pSqxWt8/33vuWk46d3p775uvunvdu6ebRtQH/dV9daDo3/vad4pyBftmpH31YV+f7G6DXon3AGgeYTVHVUTfweBsnxNu3z7oXcu1PNF12JzwPXd7MmdCxPe9564b3zHVKnmaW1D7uN978kG74Ny3FlW5sePbvo3yl++tesT9y1dNu9/2ZExPDt7Xgrlr3LemeOqxfWOpfXujuzj3AxW1IukwA1rQ9X006Nut+uJwjdrTFOurrN1D3vnRaLgczpC1DWkdLiGcu1XvrcxmRl8Lu1zkmHn1wAebhGEceXYZOiZNtDdihJa/E3KaE3UVuffjWRkwJk2c0AHZaXLmv4XjOf+4cwh7dQEEXqsf+BBNSGgyKaGdNHzr1YVx9jqezAOOLQsK63hQ+GZLUJbWH08iQfGg8HQOlNDTufxWDT0TMTSZlNDTOf0vZmgyKaEDFYh/vK/+kgir44XzPEn41qsLgq16UCgelCDxoj3vH1NwTGgyKaFvttyseDp3UYS+2VKIZ6v+kuCbLUHhmy1BCanj9ZPx+dVUD02SExD40CTfbHlp8DwelLHie4YvoPg5M1r8x792vRuH4ufMgNM5L3Jm5q//KX7OBIn/uVu9GYTy0eOe4udMSFGfJUnnrVc9xQHFzxk+UAEKT+dAifUqFKdQSCwRNo+citDXnRUcTfqyGFDHs3fukhid47d3XS1+ip8zY+v4nj5bip8z4y/Zdt6lQfFzZqx45viFMvo8vvvZeYqfM7yAAwrFgxIqXpboyGPSXByh4jdpsfvCCziXQ/ADFRR/WYTk+N1TsmJRf2EEFvWbZOCDc4dDk0kJe6BCnKbncUaoIDMhpI7//Frkq4IPTV4UA8RHDk0mJbiojx6aTAqv3IFC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoIwWv71L5XCEnldbz1D8Te/Hy6MngTHGnctS4f9hAePO3dT2xM1N144ZfEDc9HzjXEeX3YZqW7oTWEQZaXL3vD5ipMkbSXPaobX17+EbE7dzgdoqbux3mhvSu7+6Vtyz0rBg7aV8CelMnE5I9d9G7Fz56KJeZHf5TvO8PYJFX2i7bcpF9aGeIdsHRv8etl/p2Du1HXFT/8NZV31/tY+manNv6h/9KfRtj7s2m0+bS5UGGwv7E9fI7U6uqqbUvzG+cZepkSY9I5fckDkzWnwnrWPqpiNz1KZ5t6o+yWYEN9N2hGwu2rH2+vZ1BHVn98So1bat5W7c/Ost6XxFXTPDNvQd04w48+lcb2F9PF2F+6hVHfnt2tET+JWjVlSu7aivxxLvNO4Sw2/JnIkk3sP8TudIBa/cgRLhyh3Hj18iMa7cSfL2NVuKnzMRrtzVfoeHJpPCHA/K6DpeDjPNOn55sFUPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYNC8aBQPCgUDwrFg0LxoFA8KBQPCsWDQvGgUDwoFA8KxYMSY8BBOQJZe9g5ip81EcSrQSa33w0NTSYlgvjtwyuHGF0co8U/Xv30Reb4Bw4xuijGN+52T8ltkXOI0YXBVj0oFA9KLPFO4y6xHL9V5OQwx4NC8aCMFr+9U8U6W/ULY6z43VOqfufta7YUP2diXKt3f4eHJpPCHA/K6Dpeds2xjl8ebNWDQvGgUDwoFA8KxYNC8aCcUjyZM6cT7x4DccKcIho3zQvFTxiN4k8ZjZvmheInjEbxp4zGTfNC8RNGo/hTRuOmeYkkniwNigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHJYb4t0ffy9COQj6hmcaLqB4DihRs95RcrWNFE+mUT6hECba9ryKFB4wgXu7dze34OIK3D+ti+34dLeJGHEWxgmWpfIwsTjSZzk2kYLk8hEykAQEjiJePU6qjbjy53OgsjRVx++2nNNbmmYdG40STrxAzLxAcGyy7+lFEMJEGBIwgXqVCHMKREKEiRdw9/yRyQKRg24c/y6I+TjST4+MEk6ZNpAEBI4iXz9HGE797WsWKuFnJoi9SsO2dOoYiRdN1cZxgUryJNCDg3HL82+MqVkQRZRczxw/NU33B3q/lmwMXnuMj1vEqX8WKuFF3l69i1fGf1T6NE83kzEgNhqnqeFk4R2rVa+/xIsocHytYluoyJEY0k+MjBbt/sYkcEHBe5/E6k6azPI8XYaKdeotzsHgXBSY7jydLhOJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhRM8XJkLc+wWkhAin97TAv5CNPU2zElkOL1Dckf1vaWafNwdp7YG6ivX9Szckk69ZaeDkjxuyeT2zP1/IF6OFs/hiKfRsjUw3vbu5VvdM2LAVK8faBBZfry8Wfz0JmcIB9Du4/3YNgcARVfqMeY1Oip8omWTBXyosS/Wusnkp4p/hJR718wL2AodFvPOBaFO3P85aJa9bJkN9W5dPxePVwu/zsTKf7CkGW8eWhR/tokyTefUlPg21Y9xZNLhOJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kH5L5Avf6BaTy4CAAAAAElFTkSuQmCC" /> <br />
<h2>
Model Synergy</h2>
<pre><code class="r">
###############################################
# If two models have the same RMSE error
# but are different, what do we get by
# averaging them?
#
# Phil Brierley
# Oct 2013
#
###############################################
#clean all objects from memory
rm(list=ls())
#number of cases
cases <- 100000
#how much worse is one model than the other
worseness_factor = 1.0
#generate random errors
errors1 <- rnorm(cases,0,1)
errors2 <- rnorm(cases,0,1) * worseness_factor
#average the 2
errorsAve <- (errors1 + errors2)/2
#calculate then RMSE
rmse1 <- sqrt(sum(errors1 * errors1)/cases)
rmse2 <- sqrt(sum(errors2 * errors2)/cases)
rmseAve <- sqrt(sum(errorsAve * errorsAve)/cases)
#-----------------------------
# plot the results
#-----------------------------
op <- par(mfrow=c(2,2))
#histogram of errors
bp <- barplot(c(rmse1,rmse2,rmseAve)
,ylim =c(0,1.1*max(rmse1,rmse2,rmseAve))
,ylab='rmse')
axis(side = 1, at = bp, labels = c('model1','model2','modelAverage'))
abline(h=rmseAve,col='red',lwd=4)
#lineplot of errors
num <- 25
plot(errors1[1:num]
,col='red'
,type='l'
,ylim=c(min(errors1,errors2),max(errors1,errors2))
,ylab='error'
,xlab='case')
lines(errors2[1:num],col='blue',type='l')
lines(errorsAve[1:num],col='forestgreen',type='l',lwd=3)
abline(h=0)
#error distribution
br <- seq(from=min(errorsAve,errors2),to=max(errorsAve,errors2),length.out=100)
hist(errorsAve,breaks=br,col=rgb(0,0,1,1/4),xlab='error',main='error distribution')
hist(errors2,breaks=br,col=rgb(1,0,0,1/4),add=T)
legend("topright", inset=.05, c('model1','averaged models'), fill=c(rgb(1,0,0,1/4),rgb(0,0,1,1/4)), horiz=FALSE)
#scatterplot of errors
plot(errors1,errors2,main=paste('correlation=',cor(errors1,errors2)),xlab='model1 errors',ylab='model2 errors')
</code></pre>
<img alt="plot of chunk unnamed-chunk-3" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAA81BMVEX9/v0AAAAAADkAAGUAAP8AK48AOTkAOY8ATLUAZrUiiyIra9s5AAA5ADk5AGU5OQA5OTk5OWU5OY85ZmU5ZrU5j9pAAABMAABMAGZMTDllAABlADllAGVlOTllOY9lZgBlZjllZmVlZrVltdpltf1rKzlrK2Zro/6HTACHvv6POQCPOTmPOWWPZgCPjzmPj2WPj7WPtY+P27WP29qP2/2jazm1ZgC1Zjm1ZmW1tWW124+1/rW1/tq1/v2+h2a+o4++vr6+vv7Ojr7ajznaj2XatWXa29ra/rXa/tra/v39tWX924/9/rX9/tr9/v3+vr7/AAD2Sw4pAAAAUXRSTlP///////////////////////////////////////////////////////////////////////////////////////////////////////8A///7dnOhAAAACXBIWXMAAAsSAAALEgHS3X78AAAgAElEQVR4nO2djZ/ktlnHq7uhywzbkF5uDtgeIcBC4ZJAoNu5NGWBuyFtb2bXmf//r8HWu2TZlmRZkud5fnefnRd75Ef+So+kR7L9swYFUj8rbQCqjBA8UCF4oELwQIXggQrBAxWCByoED1QIHqgQPFAheKBC8ECF4IEKwQMVggcqBA9UCB6oEDxQIXigQvBAheCBCsEDFYIHKgQPVAgeqBA8UCF4oELwQIXggQrBAxWCByoED1QIHqgQPFAheKBC8ECF4IEKwQMVggcqBA9UCB6oEDxQIXigQvBAheCBCsEDFYIHKgQPVAg+RGR1GswKgg/R8HmsVAg+jRA8UCF4oELwQAUF/Il2DO+Tm7NaAQH//PVD+/f82WNye1aig13mgYB/+vJT+/f5G4jgn+6ot3tp5h0K+F8BrvHHLdwaD7uNP2118FOBsEqFvfoInW6+hVrjqVQbn6bUJ4w3L6zzLWjw7rTi8ZH/DVRFZ7siU/y0APhofAg+oxB8GlVkip8ih3MjY1kEvwpF1vjzq09DaSH4VSjW1R8dY3gEvyJhG59GFZniJwSfRhWZ4icEn0YVmeInBJ9GFZniJwSfRhWZ4icEn0YVmeInBJ9GFZniJwSfRhWZ4icEn0YVmeInBJ9GFZniJwSfRhWZ4icEn0YVmeInBJ9GFZnip+hVttsjefHgSgvBr0KRV9J883jc8ssq7LQQ/CoUfSXNad+7kgbBr0hY49OoIlP8hG18GlVkip+wV59GFZnipzng3VfSXD34U3fRoLt/U1KbsN2xxoequzfA4R7B99K6dvC0S/v+N4mvG5yvPOCdl0nDAE/vBvL89S9B1nj3rVBggGfZdue9pEbAuzZFB3Ca/q1QgIB3qrgpmzHwjo2R4J23QkHwY7osa8BmpMpvHOUC2/g0Sg8+sM2eAN9Dj736NPIAH0g+HfiN9UqF4NNo0pRLYJUfq8GB+y8P/rI2BWdxOO9TO1wCnf0C4I09UoIvjTFcwVkcVGrwG6evHx2yIXh/BWdxUFnAjw7ZEHyAgrM4qCnwF/nHSy2iXY+k2wvIjeXAl+vcle8WJge/2+029ncj4EcKxcb5FsGnkRd4b/KU+25nfSf/OpJB8IG/TKXFwW/Ui6uJGnEH8usdgi8BftcEgKfcd726yr6KBN+VJLULgk+jiQQvDavAfuR5hdeqvODefeccH0yDt4oSgk8jH/A7b/B0Zw28xn2g/EyCt31I7CTNzZ/eVbTKtn7ww8j62jDmWpWnvHZS8eCVD4leiHHcV7SufiXgPav8hsOVnCR3/rKbBO/oHTAfIjdEL8Q4Dq00RfB9XZoA8KzCX2hhEd9I7hc3+Y38o33W3u5sHzIN/kS2vQewNM33j+fXvRvaXht426ONyQ+8y0n3JCq85GRwF9ut38g/TDtzk/hFAPintz+8fnrbe+gQvX21dfPq6wPvyuSAXAletHeytzZNfiMLCOOkk6Pgm0nw9oBAfNaKTCR4p64NfIgmwIsmdrjKq6mDnQ5+x+L2GndeMkz1wesjAvVRvfNw9YNPm6rnSppKwUvIF9kxc4IX0OmfjbYXjbmo7jj/1uHrtXhuI8b7cpP+yR98gK4O/Eihd+fdkFpzIzro7g655houTWP4BW0Ep+/Wr/IO8KpjKN535UtuQPDD6pq4sGbOkIqwyZGZG/zFeKe7BYO7CX5wBqexwZu/9wZ/3B/IvrexslW2KwJvt86N0d+7WB0B3WnrRcYHPGs+VFfwIrcNGMwlOne/b//Zma/tSpqsrn6k0HMxAJdGTspJfM4qr39zsfoBan/jdztrtp4u3DCi8jSdi+4dtH6DbbAh2at/9cce+NqupMnZuRsr9EwbF3gZSOmBt4naHUCt42fspi3E2hhzbzserKE/Uv0B2X70DDbFXf2B3B96rr62K2lyBnDsQh/+gMU6NA3eFcSA0cY7Azhjhb6TiJqrpResid+IKq/9qreehnoEV2Ngf7cznP3GmHRVY0etkbFTmHb1/o8KvzbwAxpv42U3S3bsGPiNGUNTO1vcB0b6vW80126B36mxoxbvDwXffA8XfEysXq2QYmtuKHgBZWM18hutgFA5O/1uGVXeBk+/24lgYcPtcBrc16ird+rqwIfH6gWKjYzWdX0zUTs31oDOBu8e7Q3IAb4xWpOdsZgjFDy6+pC891rdxgCvh9HYx43NfXjhvC0H+J0NfiAKoAx2idf49w+De9hC8EpGhF21x8bM2kbtJ7tjQeAleZGKPpHHv1RdjXGDlXiND3Z3VwTevRZhJO9KG726KUTGjImErhQA/mKBbxR4PflY8CG6NvDhU9JKWkU262YzBj6Ae684ab15Lf2NNqwcNViJ1/jgni2Cb7SlTvxD4wBvO4XdrgkFr0/EGON38fWmiQYP29XHTsuy2uhquzcG+OZiog4Br3z9ZjcAXo4a+nEAdPVpZCaodbQaB3jV675YrOPBi6PtbPDanMGwwboQfHjehTqkA+BllRcrbaLBm/138Vlx53GC9OB7HV4EzyXba4HCuApuBHwQdxWPFUm6wbtvvxMHnrb8vbYfwTNpPlyg2Jibtb2GlsN7yAX+siz45rjFGu/IeyOXPbF10DJ6ZoHf6WsqZ4HXLpfoXyYlFwakA9+ctgi+n3d16jn4xgFedcLEqlm5JQy8AC4HChZ3mZ5rvi+6jT/dfIvg7byrKucDvrHBB3IXvl6NEHOAb863CN7Ku3bepRPfWX27RhYPOT+7OPhBg53C4Vx43q25MDlnYgHQ+gFyP/3VW1p3QSUcYrBLa3omTS3gpXzA6wsyzFdvXayYzcWbO9b4RHKClyM6d41X2sgFW4GS3QV+WATv+8tUGgavLXgXsio83XMTw90K1l1ar++bSCR4EKtsQzQC3kHU1RS7HiMxqYsN3rv0xIEHciVNgNzgmwEX3qvw7t08pMds6BByWfCQr6Rxy0zQ6q/3YTgvmo2R3l3IAR7ElTQhCgSf7P7JPfC+P8Q2Plgejxi1RmoO8DOOb8r09IuDH0nr2sH7PGJ0Enw6Ifi4X4bLfsSonnchrULHDdP8JcEHOhEEHyr7EaN61FI03RnBi8FcaK8BwQdr9BGjF3uhUx7w4Z1FBJ9GKsGLte4h9HFSwdpFjREQfIz6VxvoCV7ygo8bGiL4GI2DNxUTic0gBJ9GIwki+OXwIfhgIfg0Sp7g0kLwaYTgEfw6FD9Jg0+o6Od9RYpfiDE0UYHgV6EZCzEGJioQ/Co0Y+mVNlGhp4XgV6HINn5kogLBr0LYq08jUODxSholUODdaSH4VQjBpxEU8JBX2ToFBDxeSWMLCHi8ksYWFPB4JY0lIOCxjbcFBfxIWgh+FULwaYTgYYAfuMffioTgQzV2O9cVCcEHa+R2risSgg/X8O1cVyQEHyHjdq5Tj2qtVAg+RkO3c12REHwaIXgQ4EMumqxUCD5GCN6RFgTwQ3lfkXCSJo2AgMeFGLaAgMeFGLaggMeFGJaAgMc23hYU8CNpIfhVaA74gStpVqEU505X6fyEazAr4TV+4tQU+GVNFXHalDx7TO6C4NPqOsA7O3fzjrfELxF8xC7hAZzZJiX/JYKP2CU8gDPbpOS/RPARu4QHcGablPyXCD5il9RtPGolSt2rR61ECB6oEDxQIXigQvBAheCBCsEDVXLw7x/Y69PbR/4n4pfP7wi5sdc2r0AHQl48jGyn5+Mwtg/fY3AXfmbG0pC7jNqyKPiTfZmx7y9PrktV6xeb3BgWPR+nm0+nwUJN9xhLhp2Z8TToLlO2zAN//uKO7A+8fHXHIy+/euje72m9/W44zj/+y26P4/rAP/3tbZeXIbHzcdwPO0K2x0Qy7ZkZS4PvMpXITPCfPT7d3XdFkBXEbRff7963B+4MGwE/8UtabtemU1tdR8srBd9/6Ie1x0Qy7ZkZT4PuMmXLTPCvqQGH++4Qp3339/3DsQvw76fAT/zysELuVKf9yMapGq/O2HAy3ZmZqPHi5I3Zkgi8rLdt08IO5wve+cvnd+vz8w070ZM1frR9ZnuMJMPOzGgafJcJWxKBly31i89pS00mXf3oLw+06s8yrYgOE1bT8zHaq+d7DCbDz8xYGnKXUVtwHA9UCB6oEDxQIXigQvBAheCBCsEDFYIHKgQPVAgeqBA8UCF4oELwQIXggQrBAxWCByoED1QIHqgQPFAh+HI6dle80CWHp+6+I1ML9tIKwRfT+dWn5vjF62558fuH5g/sooJsR0fwxcQWkx+7S9zOt2R/zLuyGMEXU1fjTy1qepXb+dWPeZeTI/hyauv4yx9uyYuv7g+8jc94hzEED1QIHqgQPFAheKBC8ECF4IEKwQMVggcqBA9UCB6oEDxQIXigQvBAheCBCsEDFYIHKgQPVAgeqBA8UCF4oELwQIXggQrBAxWCByoED1QIHqgQPFAheKBC8ECF4IEKwQMVggcqBA9UCB6oEDxQIXigKgv+8PLxpG74c37NXtuvTuaNv9otp4w3BgqQaf6Ykedb8SBg/k588fzu5WOXEP10ZC9Pd92tz+gd0NpXvu3E7o7EPzX0lll8T5HKUe6pJXYkjkcQFwevPjzdqWfGm+D1LZVJoZ4wstt8oI/+5u/EFy259izQTy8fz7fbjuDzu/aTLBVyW1sUXjzwT92m7l5pfE+WCv+5mRh/sexZEDwr0U93P7998Z/dnwftiy5PR/Ly71iN70r2fZsPcvPHbuO/dzX+i1ta1PdtDv6r2/Jjux9L4HxLt+VSe9DuaAO5EebvnyeM7HY80Xzzd/ylq7EUGaXadDxvPp1vRXE60hTFtu638tPx5+07vidPRf5cT0y96FoOfHf0Q1f6aHnmhZqWWGZhu717y7NyYEWZbqSu/ubTkXn8AyvB7Zc8gfPtzZ/6JXgpmcdWueHZ68z/xcOgkcJZc9zUPfB3/OXpS5oZWY+f7lpGpxd/ye55Rz/JbY1e48+/+I/WBL4nT8X+Af0kXwwtB/7I6gP1gPTPkVaEvXCJ3Ud+5m65zVu2I2/j231N8DwBfsoXM9wU9+W68TI3HB9rUieMPErw/J38gvlh5qs7Hbrysmeu/sjx820nshWfWh9PK7bYU3rzAy1GWmLai6YFwbPCbp6q9o0O/shcfddrYXgV+PvufR98+zYv+KMCL4xnuWHZ4+X2YIBXRk7XeI7MdOdiG6Mpt7Vu5JP4RL3KvjFT4T83E9NeNC1b4xt+lsZrPD+/66vxbDPPyISRQ208R6YcAiPHPrFzJbYxj80+/eM7WqS2Yk8DvJmY9qJp2TaeN0ksA23vUn7R6G38iXbjLPBtG0/LdLcPP6c8gbzgu2NTkFs7NzR77DQ/3ckaP2Qk78TzfGq9eqPG85JO++r8W3mo53e03Kjq3L0Te7JU+M/NxPiLla8Fe/Vs2KlOlewY82GP6tWfmDs8kJe/l+BZp/hA9+lefqd1mHOCp736rWE8ywHL3okNR/6e3oR41Eg1Chgax3e18sh6520jzcbj9412KCJf71mKe7mnGsf3EzuKHr8ujNwBFYIHKgQPVAgeqBA8UCF4oELwQIXggQrBAxWCByoED1Sj4EV0GHV9GgNPn33ZnD/LNh+Cyqcx8E9f0vVa3yD4K9Qo+F9hjb9aYRsPVNirByqs8UCFvXqgwl59iMjqNJgV7NWHaPg8Vqo48NjG24IC3k5lwntcvxbPe+qzm7LGI/hF0097iDjw7l79Ypmvz50c7FZucfCkCvDuXv1SmSfkQ/tvocSD1V1wTnpXoCxsHuGuPl35T9mrn2lUDUMZHx23+Ws89/QJHX5FbfzAz7sDdTU+4amdmdRpq4NfpGPbS49X+OLgA9Oa8/OOeV3gm9PNt9E13isfg4CLu/ovP6lbK3ikNccUCT5Hpj2l7kQTmqRfnSXUkYRaFaRo8O8feBfPJ614U6ifZ+DT9e8WOKVpwTPsi5KfAz5xr94N/sN1gR9y1v1GPaAwxeQnEvzd9v8y1fhrAz/4c9prl8nYLdsIXBJFPrZzdyJbNqbzSivelGsBryN1fEu0/2wDA0/k58aVAGsRosjX3qu/EvAa0h5u9pZhFmEaDl7soA8YzR8J8gkNLgdeD7MY4FN1dMuCF5w4WsVUVnP5dgg8/yDqek5XH5ZW2M/JT1R98KrOv+etzNPbR/4no6ExSZquXlRvWgYoPiKKBJH13SwTqg+g9fhnVIUrAH9y3LlpUUNnJKlVUF5zhQcgjWiuFX2t8nOX3mgFxE42lcE1gf8r8heE/Bl19lt6p7OvHtq/ZN+Bf/4ueAnY8uCHBm1Wg6Yj5p/NKQVe6003oIoAUWVopsGaagL/ry168i/kv0k3PXLadpNEp+4ekvfUy9cHfoAE0WAp8vy9Ymu/McuD8Pui3AwfLsBgXVWAl17+nzvwv22/Ou2P952rp/eC3a8KvBx6ySlWq0UfQN3b0OilgyV1HeBl2bbAyxr//PXDid7yvVLwblcvG3HRbbP5DquxP8gfy5TnGaypinG8Cb59x9r4F5/TNp7U6uqHdpKunjQ95iHrEbR9xdGvFbz4M+8QI0fKlSRxMA+W4emFC0hlMIJPm6TRAWd1NaiWD5YBu8L7nSQEn0aTSQr/LoOs/eobSlyAJwZ4T7eP4NNoKEkVaSVWjCaV7HDOisHLLK0fvKQg6mU64CZ9/ZjE9dbT4KbAlTQK/AeqCPDve0/PM7/nUf0i4BttJJcevJq2cR3d3+CmaI2/HvCSBDHCLougF7E8w6h1gu8yxNdZ/jkhvyb/QwiN0B8/f/l46Ib0z+9U2J6H8Jvm/MUd2R/Izafm4Ajtn9rRf/7OnRmXW4q9NqHDD7u0q/fKvPfPFfgPsrL/U8f+txTk8f647R7X+vz1w3GvwvYshN/Q6zye7u5ZiK8f2m8r/x/ygu9145dgroH3691VCV7m5oOs8S8I+ev2H9W+5c2eh3vQwvYshN/QR/h2sbzDffdNL7R/viX7rOBVnG6h5l2O64ictiHTHbwawatvPug1/rfd3CyN0Lfg2/+qxtMvWQi/0cDLGm+G9s+v8oMXFXER9KIEEE6ZyKCwOHyQwTWBJ+RvSNvC/5q28eT+uKdP6PrqXrXx9CFfXQhfAy/beD20f8jexi/drdPI639VSG/F4D9oRUDf4bgPX3flPNJsjSdJlu/cadP52kRQs3pXr/7oO3RP9It4bPg8Q7txQejFJBzK0hW/0Sr8aJ9+7eCTHSlAXUficB8EPsNQzl7BM+McIHi36PVD738jwYvzPXo4A00OzTgHxcETUiV4OnR4/vqX3jVeX3iRjX3DR5AR56A8+B7z3OCP+4OjC0FvBOJxNxBtaJUXOuduT9uZjX5N4LOeFw89vf19+89v0CBD8vo39FDLd+lc+SMyomPYSGyDHcoOvpecefUUIclqvKee3v7w6o9B4PVTqwPPzV4GDprVg/9I2v+8CKQ9zIgO5P7gOVp0gGcE8rp4wV29GkbOd/U5bnBohG0U+Hx13r7+f0w9Vy8itEsHagfR28MNq8LEgc9xg0O9S/exDPjujnae1+VZZ5UIL5tlPtYJXr6TBpJhg3X9jGfedafiHDc4lJWdlAIfItulMvTqfTGZ4A2PNCBR4+mSBkuL3ODQklnZ6R+yJvDsqwxR+gHetq/nhYC4DDakXP35tneD8gxtvAP8x8zgT/08Dsnh6pmnL1jfGxk8ahpZ3/3BdzXe7yKlawPfTfz5Tv4R+yPhFb6sGsvX+7t6dxsPo8bPAl9mIKcBJ9rSDGZTUOeuOe6f3/XY5+rVk5W6ehmqLYVe9S0Mm0YMVuI13lnqc/XqP/bA543d+atvVKERvIa+P3IPCOB0cB3gM/Xq++A/Ti+uT6aIAI7+Tam4nUTPrNBNDIncud1dljaelAYfHcBp1Li5nJQVwqQQV+8n7VDJ5GCeGXyIxKBd/6IkdWKP5pwGO1U2Vt8lXxp8qKvXR8lLXijnr6bf0usGu8XBd8uSe+5u+V69YyRXvas3gyWloavRnPN8+fTqHadk8V79MHiyAlfPwJdHb0Zpg6Zl3TG75Xv1w+A/1gte+8hb2PLciTVTPGCwJhG5Iy53t3wb/2EEfC70J7LtPW5qQHasnr8UlzUh5w8+RNnAZ4rhPL394XVUyJZI119cWrA+zNUP6ERuviUv0j6owEyLIPjZyOVbwzjSN9iQDOA43F3b8h9vPiV+NImZ1gDznOBjY/XiXBcibqg3npOFcjArolfvKvUt8tN+2V79GPgKu3dagJTI1+LELVfPDZ0Bno3jl6vxxB2tVX+SHSmVtACpujtBFXVej9aLaF7j5epd7q4byaV+GJGe0puKwQ+PaHhtIkTcxrA4+14U2TTYqZK9+jHmpcGPRS3V8L2GYD2R8/KqDbIMdgnBu2VHLcUpFpKevhpnL+O2IcO5WVOT0aoCvPMGCNNRS3Gua6AuO5h9OD41/ukf/M7UdYF33wChmYxasvNcwzyNmiR0TNJ5gY9bcDhDk+AzDOjsGyCMy2g8VYNaFDyRiy6t9t002Jbm6rd+pyoZjFHmucC7b4AwJKPG89fi2OU12hHgQ5QRfA5n77wBwpBcMyHs1JdVo3zPgME9QQdPfV14x1aBL09dDS963P1cvWf2rwt8mLQZMMm9tES4lr7vrRgYzAqv8d8/eq89SwSjs3MafGXkiXoV4CtAT8RFFXbE3idWH30ZUazIT7WAj1iIoXv64ug1Rx/s6rtRqz3vPpX5maoGfNR8vDjFpA70wo6oAI6vEHyjqrw2L1qMurrjWf90IfgRxSzEkJVLBMwKipsh3roN7mlsBc5E5ufJE3xd3Ts9gKP+FLodhqSuxnLBNd7P3RnHmitP8D9VCp5/4tWtlGSMflnwU2kFiJBKwEcGcMSnwuD1mXj2uTFbep9efezNAaLUAq0DPL9EPC6GISfBa+jYy6itNZc0oEKduwDwS7fys2IYhWfn1ESBZkwI+Hk3B4hQC/SNJ/jF6/wcb1d6waWJOtzVZ1+B0+KcZJ4LfID68RE1nCvEXg7ivQxWKufqrwO8DJ8Uwu6akhsxWKkMeFIT+NmunhMoBN4RqB0xWKoDH9LClwC/LPm5E1SFe/T9VTdTBjNx8AHs84J/Uz140bMuUwIaV4R+wmCqEuA7g33Bf1wc/NwYhqh4RSr+RBM/BX5G9CpGHcupq6cygg+QFbnT50eyI1cR9NV07jxjN7nAO2/8NGC5/p6fclbxiPaShTuBAH5R9APt+9QjRvVrqPIvxmhEaF4L1VOr7KDyYLZXAX7ZEd33LvCTjxglYlZWd/Y5B3XaEJ6Y/90GW8oPnvhOzPE/b5YG73T17itsLFcvOtX5o3fNsuAXueuV9/xMpik6t6u3r7CR1UzPiWjky3Xp9Q4mf2ObOKD8T6GKAr9kK//eucrU4xGjRi1TVTEb94HFF8MG6xoDv8ydLaPAL1jnZ0xQyUaef8gHnZEnE336aPCL3NmyNvDD6ke1BiJ34lXgaHQ0aUGrD2JicDQLq2/jawbPN6kufr4u3iSMinr1ZD3g+xqdDRFxc7tupiQt/nAvP3lWKqrxseCrID8JvjHb+vQFoJEjx5kGZ+/VXyd4QmSLa9XNxPQbGbObZ3DuXn1n8xWCJyqCt2wDr2YH5hmcvVfPArB+6ywt8DWg9wLfNMu18TJIONPgzG08IQx8AHMFvob+3YirZ3+Jep+29qum3atTN2lw3qdQCYLXB15uJuq9wS0Ktv1JRejXVuMBgW8GCcaVAT1BfwiV9Opjwb9hPyvfyk/GS4j+kqBTr2KAGvjFXX3yXn0seD4pX77O+xmgTZrNjeNpK/q0yd8UBufs1XOf7b/cbrXgxbhLBdNnUBefRG8xjcE52/iYIfwqwctWWFGKq/BEtu5ar3H5Gh+a1sTvCBDwzkmzyQrt3Kw1FnrqKQzOCP6nGeDf1BG+CwGvG9sn3ev4mZ12FaIjfEGtTCgDePd6+yLgK4nbhrl64e19engy3ivHAuKAsoPXDDiTOINHa/z5leP6mqLgC6MPGD5z8K6Lqxyzdw2v2EYvXoYBpf8Iq+6jBo+7+qPj0qKi4As38/7jZ231IzHKq7NtN/1CI+fb1Y+kq09lcK42nhBQ4LWfyJkbDb7oqBvTOqKESG8vl3YsYXAm8JLbbPBFnX3MsTlCK5zbyAn7RhGX43SxKcK5+xqcG3zojGwPfNEeXuSh1ZDMHJZptV29abQI0BWBj2K+dvCN1lUXSej+XQ3f1Da535pdPSEJwL/JD753l9foQ4tuu9FhE0mSHt5keSwKXuvXzQEvw3dZmvm0MQzJth+BCx6ahx53QDnAKycdMz/TD99l6eEdtylrfL8Oy8q/aF5Kgk/QoXc09MFmB+u0TQm+ryxNVjnwWvO+MvDN6eZbBV5EWlamguB/SjCSc3XtM0A43y5W43MVoVLgk/Xr+uCLhHISdrbdSaXOUyHwOqh5/br+mC7DwC7ooskgKfBmHmaGa5wHGtCC4BM3772u/eL+fjnwRqjGTL8K8DOWXhHDy6do3of8fU6Hn/xYNuk6XH3k8mreAU7t5XV/b1X7bPTTH6ZKV28vr9bPcX9OuUYlO4HiDKROsJ9+FeCdy6sha3HHUoerd7fxkAU6gANZCB6oEDxQIXgt6Uo1N1+ryuywBrMyH3y6vRImtXDd9Ey95t0QfIxqJorgF1TNRBH8gqqZKIJfUDUTRfALqmaiucCj1ikED1QIHqgQPFAheKBC8ECF4IEKwQPVfPDH/eQuz+8IuZl6QP2BvHA++S88JU+r5sgneU9bvfLtn3HfnM8GfyLTBzq5Ljm297n5dJrOmU9KvlbNkFfyfrb65ds74945nwv+/OpHvwO57pxmbN8PPdA7NKUQq+Lkn/y0rf759sm4v2kzwbc2n7wO1JXYUR37T2uPTCnAqjj5J+9hq3e+fRILMG0O+APZHrvlPeNHavdq/0wa7Vvyp1Nq05q2Klp+mfbNdUCN96W0C38AAAGNSURBVEksIOfzO3ceJez53bST8mvrfFLytWqOfJL3s9WzjffOuG/Os4A/+JRDr96tV0qeVs2RT/Ketvr16r0zng88apVC8ECF4IEKwQMVggcqBA9UCB6oEDxQIXigQvBAheCBCsEDFYIHKgQPVAgeqBA8UCF4oELwhfSeL7uhC+5819kmFIIvJB38yX6+XQaVBn/oVpIdP3/5b5+/fGzfb7v313e37PMXd2R/6C6FoXls/7786oHlvQX//J1YYO06G79b5jbShcF3a8WP98dt93w/drHI0WcV8dp0/uzx6e6+zR3LY/v36VcPLO/Uy3PwzrPRuoY/LGBSYfBsHfhx3y0v7y4UOe0XvuitjM6vKdsWJMvjfefqWd518M6zcb5d5BKB0jWe5olllZfxqwYva/zz1w8s70aNd5+N8yuvqyXDVEEbT+5ZVln7d93gZRv/4vMHlncdvPNsHK6yjUeVEoIHKgQPVAgeqBA8UCF4oELwQIXggQrBAxWCByoED1QIHqgQPFAheKBC8ED1/xVtQsDZYlKvAAAAAElFTkSuQmCC" /> <br />
<pre><code class="r">par(op)
</code></pre>
<h2>
Model Synergy but varying correlation</h2>
<pre><code class="r">###############################################
# If two models have the same RMSE error
# but are different, what do we get by
# averaging them?
#
# What happens as the correlation between the
# two models changes
#
# Phil Brierley
# Oct 2013
#
###############################################
#clean all objects from memory
rm(list=ls())
#memsize <- 6400
#if (memory.limit() < memsize) memory.limit(size=memsize)
#settings
cases <- 1000
number_of_correlations_to_test <- 20
threads <- 8
worseness_factor = 1.3
#initiate
errors1 <- rnorm(cases,0,1)
correlations <- seq(from=0,to=1, length.out = number_of_correlations_to_test)
#-----------------------------------------------------------------
# function to generate a correlated variable
#-----------------------------------------------------------------
generateCorrelatedVariable <- function(x1,rho){
if (rho==1) return(x1)
#adapted from http://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable/15040#15040
n <- length(x1) # length of vector
#rho <- 0.8 # desired correlation = cos(angle)
theta <- acos(rho) # corresponding angle
#x1 <- rnorm(n, 1, 1) # fixed given data
x2 <- rnorm(n, 1, 1) # new random data
X <- cbind(x1, x2) # matrix
Xctr <- scale(X, center=TRUE, scale=FALSE) # centered columns (mean 0)
Id <- diag(n) # identity matrix
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE])) # QR-decomposition, just matrix Q
P <- tcrossprod(Q) # = Q Q' # projection onto space defined by x1
x2o <- (Id-P) %*% Xctr[ , 2] # x2ctr made orthogonal to x1ctr
Xc2 <- cbind(Xctr[ , 1], x2o) # bind to matrix
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2))) # scale columns to length 1
x <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1] # final new vector
}
#---------------------------------------------------------------------
# function to calculate the errors
#---------------------------------------------------------------------
calcErrors <- function(myCorrelation){
errors2 <- worseness_factor * scale(generateCorrelatedVariable(errors1,myCorrelation))
#average the 2
errorsAve <- (errors1 + errors2)/2
#calculate then RMSE
cases <- length(errors1)
rmse1 <- sqrt(sum(errors1 * errors1)/cases)
rmse2 <- sqrt(sum(errors2 * errors2)/cases)
rmseAve <- sqrt(sum(errorsAve * errorsAve)/cases)
#return the results
c(myCorrelation,rmse1,rmse2,rmseAve)
}
#--------------------------------
# the parallel stuff
#--------------------------------
#load library
library(snowfall)
#initiate clusters
sfStop()
sfInit(parallel = TRUE, cpus = threads, type = "SOCK")
sfExport(list = c("errors1","generateCorrelatedVariable","calcErrors","worseness_factor"))
#do the calculation in parallel
allErrors <- sfClusterApplyLB(correlations,calcErrors)
#stack results together into a data frame
allErrors <- do.call(rbind.data.frame, allErrors)
colnames(allErrors) <- c('correlation','error1','error2','errorAve')
#stop clusters
sfStop()
#------------------------------------------
# plot the results
#------------------------------------------
yrange <- c(min(allErrors$errorWeighted,allErrors$errorAve,allErrors$error1,allErrors$error2),max(allErrors$errorWeighted,allErrors$errorAve,allErrors$error1,allErrors$error2))
plot(allErrors$correlation,allErrors$errorAve
,type='b'
,col='red'
,lwd=3
,xlab='correlation between the two models'
,ylab='RMS Error'
,main='The less correlated the two models, the more synergy when averaged'
,ylim=yrange)
lines(allErrors$correlation,allErrors$error1,col='blue',type='b')
lines(allErrors$correlation,allErrors$error2,col='black',type='b')
legend("left", inset=.05, c('model1','model2','average'), fill=c('blue','black','red'), horiz=FALSE)
</code></pre>
<img alt="plot of chunk unnamed-chunk-4" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAAq1BMVEX9/v0AAAAAADkAAGUAAP8AOTkAOWUAOY8AZo8AZrU5AAA5ADk5AGU5OQA5OWU5OY85ZmU5ZrU5j485j9plAABlADllAGVlOQBlOTllOY9lZgBlZjllZmVlj49ltbVltf2POQCPOTmPZgCPj2WPtY+P29qP2/21ZgC1Zjm1tWW124+1/rW1/tq1/v3ajzna/rXa/tra/v39tWX924/929r9/rX9/tr9/v3/AABaINQiAAAAOXRSTlP/////////////////////////////////////////////////////////////////////////AP+Eine/AAAACXBIWXMAAAsSAAALEgHS3X78AAAUfUlEQVR4nO2dC5vithWG45lNJrDbTBvY2bZJw0x3A8mm0AwUPP//l1VXSwYBMpJtSed7n2cG8OVI9mtdbNnwTQ1I8s3YGQDjAPFEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJ4hZ/eK4E3z/fr0/n7ueuqZ1ph1ktTmY/vDazDs6MONbyS1hHPsmFFfdckkPSYx5SEb+qzorns87sgdO1/BK+Jv5CkoMyuHjGbsr2Dkv467QSiW/YcaD2stxXesJuWokl9KvMcGXWmvHPdz9Vd7+K/0u9pgizZe/vluJAm5iY/PMvUo+c5cyImCWOzQ1LROhU6XG2bBZbYyOX1jNMZDsXdt4vJumOemhvmNp/PKQwt61m7PX3qhVMrcPSuv9j/vCnWq5WstUG/XmSByuQyA3fe+w4ZftO/m9HV/NrlcqrtYeviRcFf8KjVs0+FftKT9jP+evDq35t9lzVLDSRE9hmiCXMmvdrtdafUryeo5JtiXdkRM7asA3h28x3m05P7xTJcUZEZEcudDVwIckLUc2GyX0kQ4rcqSO7FczaJd9Om+VqobjZoDOr6U3UiWx53vdzcQjY0f+r5h90Ktb8q+Jn4p0u/2stvpkg3ujlm3XFXlnspiwvPEMszkTs0ImpSta6klXH0aK2YoqDvlXVuzKi1prt5z+zQni3bNKTinigasE/NzOayK1c2Hm/kqQz6tGG2btjN+VVx1oE29rp6nWa/EzMqs0GneahCaSzyvPPyvfSFd3IOt7q3fSaeLFp/JgSyF3KYzUTxFHH4upXuXd08ZjII1jEkdFqa02+OF9PiW/miPWO2nhXRsSs/Xyye//l/XLFM6XS04nrvdLMaCK3cmHn/UqSzqhHG9YURhby8MzK3cQRTK+j8qOWk6s2G3RmNdNJUXtvI2o7aw/K6Hr+6VZvq6tt/PqKeNGY2a+NcB/xYiUv8a6MyM7d6v531oD9Mp/UDvF8l14Vb+f9SpLOqA7xTcjN3ZfpzBHsSLxaTqbebJBrNTWlSUOWOnbYqB6TFV3PDxDfqgxPa0fdu1avqmqZWXWhEd+sycNsRFVmV/XW+i7xjnS3dz+x9vWvUx6gVdUbRWeqep2L1jZcTtIZ9WjDWmHYzmczT4PpdfSWyuVkGs0GuVazxDd7j3UPPvNa9ii6nn+61derep2M7BXI1azOHY8lj6rm6KudnTsjvlnTVBxyyUkzx+7cieUn7ozItVh1NmObJbZfpXes6GznTuZC5906CM4leSZqa8Pk7lMbxlJ07ki9SyrV7ZLLqWNRb5BrNUt8s/eaw6YdXc8/7tzx+b7ixSqtQqIn8E3U9UpTW67Ue3O+Y8TrNfXxM9vIgqHyxWPynP44b47OSp7unGRErsUWXuj+Tut0zlLUOp370Y7TVPWi9Crx55N0Rj3aMB1Cu5jUjh2p1mGS1YnWSnfHarNBrtUs8c3eM2u3o+v5dipqD+OSrc3u8fX6Qh3ZXL3IJA/a68uFYZ8K1LhW32L/cXl9oY5cvvgmz8RlrdHfhUKTigHi+2VTXS7IW9UpubZcGDoVC4gnCsQTBeKJAvFEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRAkRX4GU6VF8wLqgbyCeKBBPFIgnCsQTJVj8blrNVu0vjPALDUYlVPzheVGv+JPEpw+SQnzKhIrfP63r7Uy+dgsNRiVGieegxGdGcBu/n4uv8kMbnxmD9epdFwl9pwWtjESc04YSXznS8p0WtDIScU+LJd7q3DmHAZqErXkXp9VWjM4rI5HzC1pCLoASX2oiQ4kvpGUsJ5HBxIO0gHiihIpX35HmulgP8SkTXOIPz7MzcyA+ZcKr+rPf9wjxKYM2nigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPlGDxu2l1t6zr/RN+VDgrQsUfnhfid4UhPjOCf0ZcCF9NID4zYpR4xubbDxCfFcFt/H4ufj9+cw/xWYFePVEgniixxFudu0pze65A76DEEwXiiQLxRIF4ogRfuZurnhzO4/MiuMTzC/U3hQajEl7V7z8ubwoNRgVtPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE+Et7f2Z4inwRvHngDxJHh7OzYP8STg0iGeIqjq6SGEQzw1tHH06klx0qdrgPiCOe3LGyC+XC5oh/iCuaQd4otE+b6gHeJL5HJRV0B8cVzq0hkgvjhOL8+6gPjyQFVPDukb4onRCL+uHeILwqugN0B8IXh15S0gvgw6aof4QujsHeIzR+vuqB3iM6d7SddAfM74XZ11AvE543d11gnEZw2qenq4b571BuIz5czNs95AfJbcXtI1EJ8j4d4hPkcieIf4DImgHeJzorl5NoJ3iM+HTvdZXAXicyHg8qwLiM+FgMuzLiA+G6IWeIjPB4inR8ROnQbi0yduUVfg16RTJ25nvgG/Jp02kU/iDPg16aTpS/vgbXwVmYDs5UBv2j3F759OGvAbQ0c2Va54ny+1CMJL/OHlTG1usI6NS8UR4v3osagr/Er8/EzHvXNoiPeity6dYeg2PiBe/+GSIfJ1eRcQnySJVPXsZJ3V9A+v4aEh/iqBt0374te5ExdpNi7zHa/cQfw1Qm+b9qXD6Zz7pK7blTuIv0z/JV0TWuI7XrmD+Ev035c3oI1PhyG9j9irj3HdtSjxQ2of85Jt9U5x1d7heaHe7T6sm//nksgP8+10AyYa65KtZ+gY4rfmDKII8W9DdeRbjHfJ1iX+8PK5qmZb9icS5Qmyl+8+LdQnLn5196WoEj/A5VkXI7bxLvHPk3o3nUjBs3rL+pP8pVrwl82kyKp+gMuzLtJq43mbwv/YKSJPU72wqp6fM7K3JYofuFOnSauNt8Rzu+ytkLxaiMbmbgnx0RixjXeczp0v8bLOKU18D7dN+5LW6Jwl/rSNZ5/KEj9OUVekK1716g/Puld/tyyqxI/TmW/wEM+r2d3ja+cuHi7ZXmCkkzgDxI/C2NohfhxG1w7xQ9P7bdO+QPygJFDUFT7iz99d1Tn0TcOy7UGa3bSq9OfcxI/epTPkMB7fEs/P9Hbvl+eTSJmRrsu7SEu8x+jcdlLza7heuU+PZAp8auK9RufMbX4QfzOJifcanTM39mYkPpnuvCJZ8edH5/bzmSNc2qRT0jU+4vdzXgF37dT3NTq3my5c4ZImPe9e4ln7yjvWm0l4aH/x50bnbO+ZiE/nHM7C9wIOK2oDXMDxGJ3biGMlp159itq9xfOnaMxo6O2hCV65S9O7X1W/EP3oVdyqPgZJi7/5RyAHwa9zVz288h5eeGhC4hMt6Zq07sCJkUQiJHRZ3gnE90RCl+WdpHUBJ0YSqZB2gfcRL56R7mFYVlWGbxA/Bj4lfiNGy6KEpiF+xNvlvfGs6vllk67ybxEv7rLgt/scXpZqJPaHT/drdfMFH6L9e3Np53ruxyHxoq7o0Mav+q/qxV0WH/7zsuT3eqmR2OlCTZYjtXdLNUTrkfsxSLwz3+ApfmUukwaF9qjq90/rDTM7a43EisliwIbVBHKGR+6HJ/WTOINnG99V+rnQV8WveB9y9/inqOnNc5Jqsm4C+AyP3A9ONtpH7dU7Tuf28wWv0w8vXx5frZFYNVmXeOv7kjtlqHfy0Z7aBRwx3v5+yeqYWW09J6knmzZ+K7+CKxnxqd1fc50u4v/X7fvObrlyx1oVPggrbqO1npNUk9mU71+WaoZPuKHIqagrfMTLu28Ozyl8bXl7bDgR8fl06QxebTw/fdqO/wWHvLNx1/pujnTEJ31d3oXnjRisdu3csSc0SJNfgfcWf+4LazuGhvhU8BV/w/de0RCfw3V5FxAfRn5FXeEjvp+nZWMwtvgMq3jNYBdw3r17d2ZafXXa2QWPc+8bsFMiZ6dZJ3H9JXLDNK8FhxL/Tv5VkTnaOPV3y7TuK9vee0vkhml+Cw4sXhx579TRd3FabaZ1X3mIRN7y3BKnnVPii79hWkLlxExrte4ZbsnAbXx907R0WkZrWqtXl9+WDDw6VwAZ9+RtIL4jOQ7IuID4juQ4IOMC4rtSRoGH+C4M9LOvgwDx/ijjJWiPIJ4/6+C+jl+Y+EJKuiZUvP7Wye3p/TlFiS+kL28IFa/Hax3jtiWJL847SrwXxWmP0Maf/4WqcsSXpx29+ivk96CELxB/iQKreE0s8fYDbeFfcZIIpVyXd4ESf4FSrsu7gPhLlFvgIf4iEN9D6MTJ9UEJX4Kv3J2/6T5r8eUWdUVwiTe/E9IxdMoUXMVrwqv6s89TZiu+4JM4A9r4E0h4h/gTSGiH+BNoaId4i3IHZFxAvIZIFa+BeEXJAzIuIF5R8oCMC4jX0CrwEN8A8TY0xJc+IOMC4qkVdQV58cQ68w3UxVP1Tlw8We3ExdPVTls8Ze9ExdMaj3FCUjzpoq6gKJ7aeIwTouJJjcc4oSgeVX0N8WShJp7ieIwTYuJR1DWkxKOKNxASj5M4Gzri4b0FGfHQ3oaKeGg/onzxGJBxUrx4VPFuShePAZkzEBCPARkXpYtHVX8GiCdKyeIxIHOBgsWjqF+iXPHwfpFSxaNpv0Kh4uH9GmWKh/erFCke2q9Tonh496As8RiJ86Yo8Wja/SlJPEbiOlCYeIzE+VKSeFT1HYB4opQiHiNxHSlEPIp6V8oQD++dKUE8mvYbKEA8vN9C/uLh/SayFw/tt5G7eHi/kXzFYyQuiHDxK/GbwovOoQNB0x5GsPgVU75Z1KtJ19BhYCQukOCfEX9a1/Xu8VW8dgodBkbiAgkVf3hZ1vV2Um8fXjuGDgQFPozgqn4/r6qH1+3d6W+JQ3zK5Nirx0hcBDIUj6Ieg1jirc5dpbk9VxdAFR+H3Eo8vEciL/E4eY9GsPjdVFTr9yen8T2Ih/Z4BJ/HP8uLtUOcx8N7RKJcuatbnTvP0J2B9pjkUeJx5h6dKFfu+m7jUcfHJ4dePUbieiAT8RiJi00O4lHV9wDEEyV18ejP90Ti4lHU+yJp8aji+yNl8fDeI+mKx8l7ryQrHtr7JVXx8N4ziYqH9r5JTjweiRuG1MSjih+IxMRjIG4o0hOPgbhBSEw8qvqhgHiipCYe3fmBSEk8nA9IQuJRyQ9JMuLRuA9LKuLhfWDSEI/LNoOThHhoH54UxMP7CCQgHtrHYFTxGIIdjzHFo4ofkRHFYwh2TMYVjyHY0UBVTxSIJ8pI4vEs5NiMIx5FfXRGEQ/v4zOGeHhPgBHEw3sKDC8e3pNgcPHwngZDi4f3RBhYPLynwmDiMQSbFkOJx+XZxBhIPIZgU2M48RiCTQpU9USBeKIM3asHiZDA7dVgDCCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4ofYoHKdOf+OPjIF6ouMGSzdiYwSCeaDCIJxoM4okGg3iiwSCeaDCIJxosoniQExBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE+UCOL38+rh9ehdhGC7aVUtIsWq68NzWCwr2OG5ulvGCsa28n4dFowF+bA+CnudcPF8l24m7XcRgu0/Luvd+5Ad3MrOJvAgsoKtFvU27Phub+UmtLBs1bHTaf+Hi98/rdUhZ95FCLblW7AKsWVnZ/fDp8Dao7WVgZhgu8fX4ICruy9yKzvt/3DxIu8fl613EYJxwoJZsQ4vvwZW9fZW/ju0qjfBopR4bbvT/g8Xz6s9mZx5FyFYzauuWaRYm1loG2+C7aYLsYvj5CxCt6gR32n/J1zi9/Mg7+2MhYrvaSt5L2Yb3Lsbp8T31MaLkhUpYxtxn3nQYWRt5T+CxVs9mfBKsm7ED9zG8wpZ9+pn4b16HSLYezs7oSXeCrYKruqtrYxZ4jvt/2jn8TzxaOfxLJgspUG2TMaincerrQxVZYJtq+CLAlJ81/2PK3dEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKEWIN/cUs3fOG4xbz2L53ILsCtTl1nFzc2fgDed9UZ545xIQf0w+4vdzfiOyvLd598On+6/sb60+8tuL+VPV/OPX5g5o/pSbukF79bN4J6bzQH9ZyFvaH17VzdJHa8hA/zQrqQycrMefy/tcVbOteGDDBPvu06I2mWNzwx+Gjks24nkRYvt7NeP/xfNr/ImLlXiEgO1b8VS1KKV8R+ulZvqB5hX7JKfLxTcz8eA0e5GT9JxmDR5IfVg1Tz8crSeXPDzz9URUPZG/sKVM5p7WoU+aRCcb8eYpLea4eYKA6WaTzCNEUvzRUrWs6lcLs/ju8fW3z5PDy1JNsgKp55c/6A9qlprWrCeTqHmJX4o/9klN5C/sODUxwx+Sik8+4h9VQeR72Xp0hD+JIlyteHVqCr21FOO3pRDfLL5/+uPp64c/ntZqkhXoRLycJac166kk6pZ4K91WarwdCn9eJi75iHeWePmdAkLPwlT1zhLPy2CzeL36199Y4zzREaw5J+Kf7J6jvd61Em9i1nXol2hEJxvxvI1n+1C13lpQ81FMeb9st/GWeLsNlt0t2VrbzbK9hhFvWnPXei3xp228jMmnQfyttHr1WpCcyN9uKt6TPjzbvXpL/M+irpWLs4XWzWOqcpIVSD9kKQKp9kTW08fr6V69Ea8mHp51r17FXKFXDxIB4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJ8n+YflWV0FwDAAAAAABJRU5ErkJggg==" /> <br />
<h2>
Variable Importance</h2>
<pre><code class="r">#####################################################
#
# A generic method to calculate the importance
# of variables in any model
#
# Phil Brierley
# Oct 2013
#
#####################################################
#clean all objects from memory
rm(list=ls())
#set memory
memsize <- 3200
if (memory.limit() < memsize) memory.limit(size=memsize)
#libraries
library(nnet)
library(randomForest)
library(gbm)
#----------------------
#parameters to set
#-----------------------
#what model are we building
modTypes <- vector()
modTypes[1] = 'linear_regression'
modTypes[2] = 'neural_net'
modTypes[3] = 'gbm'
modTypes[4] = 'random_forest'
#a number >-=0 and < 1
deletion_threshold <- 0.05
#for data set generation
Number_of_Useful_Variables <- 10
Number_of_Junk_Variables <- 10
Number_of_Records <- 1000
Number_of_Removed_Useful_Variables <- 0
Include_Junk_Variables <- TRUE
#importance testing loops
numloopsImportance <- 100
#train test split
Train_Percent <- 0.5
#multithreading
threads <- 8
#-----------------------------------------
# error function
#-----------------------------------------
calc_error <- function(act,pred){
aact <- as.matrix(act)
ppred <- as.matrix(pred)
return (sqrt(colSums(((ppred) - (aact)) ^ 2) / nrow(aact)))
}
#------------------------
#generate a data set
#------------------------
#set seed if you want to regenerate the same data set
set.seed(42)
useful <- matrix(runif(Number_of_Records*Number_of_Useful_Variables,0,1),Number_of_Records,Number_of_Useful_Variables)
junk <- matrix(runif(Number_of_Records*Number_of_Junk_Variables,0,1),Number_of_Records,Number_of_Junk_Variables)
colnames(useful) <- paste('useful',1:ncol(useful),sep="_")
colnames(junk) <- paste('junk',1:ncol(junk),sep="_")
#create the target
useful_weightings <- sort(runif(Number_of_Useful_Variables,0,1),decreasing=TRUE)
target <- useful %*% useful_weightings
#remove some useful variables
useful <- useful[,1:(Number_of_Useful_Variables-Number_of_Removed_Useful_Variables)]
#create a data set
if (Include_Junk_Variables){
myData <- data.frame(cbind(useful,junk,target))
} else {
myData <- data.frame(cbind(useful,target))
}
#target - what we are predicting
theTarget <- 'target'
targindex <- ncol(myData)
colnames(myData)[targindex] <- theTarget
#----------------------------------------------------
# divide data set into train and test
#----------------------------------------------------
trainrows <- runif(nrow(myData)) < Train_Percent
if(length(which(trainrows)) < 2) stop('not enough training cases')
testrows <- !trainrows
#-------------------------------------------------
# function for calculating variable importance
#--------------------------------------------------
varImporatnce <- function(variable){
#initialse the errors
errorTrain <- 0
errorTest <- 0
#copy this variable data
temp <- myData[,variable]
for(i in 1:numloopsImportance){
#scramble the values of this variable
myData[,variable] <- temp[order(runif(length(temp)))]
#calculate the predictions
if (modType == 'neural_net'){
predictions <- predict(model,newdata=myData[,-targindex],type='raw')
}
if (modType == 'linear_regression'){
predictions <- predict(model, myData)
}
if (modType == 'random_forest'){
predictions <- predict(model, myData,type="response")
}
if (modType == 'gbm'){
predictions <- predict.gbm(model, myData[,-targindex],type="response",n.trees = model$n.trees)
}
#calculate the error
errorTest <- errorTest + calc_error(myData[testrows,theTarget],predictions[testrows])
errorTrain <- errorTrain + calc_error(myData[trainrows,theTarget],predictions[trainrows])
}
#return average train and test error
c(errorTrain/numloopsImportance,errorTest/numloopsImportance)
}
#----------------------------------------
# set up multithreading
#---------------------------------------
library(snowfall) #for parallel processing
library(rlecuyer)
sfInit(parallel = TRUE, cpus = threads, type = "SOCK")
sfClusterSetupRNG()
sfExport(list = c('myData','trainrows','testrows','numloopsImportance','calc_error','theTarget','targindex'))
####################################
# LOOP THROUGH ALL MODEL TYPES
####################################
variables <- setdiff(colnames(myData),theTarget)
candidates_for_deletion <- NULL
for (modType in modTypes){
#-----------------------------
#build a model
#----------------------------
if (modType == 'linear_regression'){
model <- lm(as.formula(paste(theTarget, " ~ . "))
, data=myData[trainrows,])
basePredictions <- predict(model, myData)
}
if (modType == 'neural_net'){
model <- nnet(x=myData[trainrows,-targindex]
,y=myData[trainrows,targindex]
,size=5
,linout=TRUE)
basePredictions <- predict(model,newdata=myData[,-targindex],type='raw')
}
if (modType == 'random_forest'){
model <- randomForest(x= myData[trainrows,-targindex]
,y=myData[trainrows,targindex]
,ntree=1000)
basePredictions <- predict(model,myData,type="response")
}
if (modType == 'gbm'){
model <- gbm(as.formula(paste(theTarget, " ~ . ")), # formula
data=myData[trainrows,], # dataset
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
keep.data=FALSE, # keep a copy of the dataset with the object
verbose=FALSE, # don't print out progress
n.cores=1) # use only a single core (detecting #cores is # error-prone, so avoided here)
basePredictions <- predict.gbm(object=model, newdata=myData[,-targindex],type="response",n.trees = model$n.trees)
}
#calculate the error
full_Train_Error <- calc_error(myData[trainrows,theTarget],basePredictions[trainrows])
full_Test_Error <- calc_error(myData[testrows,theTarget],basePredictions[testrows])
#Export model to threads
sfExport(list = c('modType','model'))
if (modType == 'neural_net') sfLibrary(nnet)
if (modType == 'random_forest') sfLibrary(randomForest)
if (modType == 'gbm') sfLibrary(gbm)
#-------------------------------------
# calculate variable importance
#-------------------------------------
s <- sfClusterApplyLB(variables,varImporatnce)
s <- do.call(rbind.data.frame,s)
colnames(s) <- c('Train','Test')
row.names(s) <- variables
#get the full model error
s$Train <- s$Train / full_Train_Error
s$Test <- s$Test / full_Test_Error
#scale to 0-1
myRows <- which(s$Train > 1)
s[myRows,c('Train')] <- s[myRows,c('Train')] / max(s$Train)
s[-myRows,c('Train')] <-0
myRows <- which(s$Test > 1)
s[myRows,c('Test')] <- s[myRows,c('Test')] / max(s$Test)
s[-myRows,c('Test')] <-0
#pick candidates for deletion based on a threshold
my_candidates_for_deletion <- rownames(s[which(s$Test < deletion_threshold),])
candidates_for_deletion <- c(my_candidates_for_deletion,candidates_for_deletion)
#get the ranking of each variable
s1 <- s
s1 <- s1[order(s1$Test,decreasing = TRUE),]
s1$Rank <- 1:nrow(s1)
colnames(s1)[ncol(s1)] <- paste('Rank',modType,sep="_")
s1 <-s1[order(row.names(s1)),]
#combine the rankings
if (modType == modTypes[1]){
rankings <- s1[ncol(s1)]
} else {
rankings <- cbind(rankings,s1[ncol(s1)])
}
#---------------------------------------
# plot the chart
#---------------------------------------
s <- s[order(s$Test),]
x <- barplot(as.matrix(t(s))
,horiz=TRUE
,beside=TRUE
,main = paste(modType,'variable importance\nTrain RMSE =', formatC(full_Train_Error,digits = 5,format='f'),'\nTest RMSE = ',formatC(full_Test_Error,digits = 5,format='f'))
,col=c("aliceblue","forestgreen")
,xlim=c(-0.2,1)
,axes = FALSE
,axisnames = FALSE)
text(-0.1,colSums(x)/2,row.names(s),col='blue')
legend('bottomright',inset=0.05,c('Train','Test'),fill=c("aliceblue","forestgreen"))
abline(v=0)
} #end looping through model types
</code></pre>
<img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAA1VBMVEX9/v0AAAAAADkAAGUAAP4AAP8AOTkAOWUAOY8AOf4AZo8AZrUAZv4iiyI5AAA5ADk5AGU5AP45AP85OQA5OWU5OY85Of45Of85ZmU5j9o5j/05j/5lAABlADllAGVlAP5lAP9lOQBlOf5lOf9lZjllZmVlj49ltf2POQCPOf6POf+PZgCPZv+Pj2WPtY+Ptf6P27WP2/21ZgC1Zv+1tWW124+1/rW1/v3ajznaj//a24/a/rXa/tra/v3w+P/9tWX9tf7924/92/79/rX9/tr9/v39/v7YV/E+AAAAR3RSTlP///////////////////////////////////////////////////////////////////////////////////////////8A/3m6RJAAAAAJcEhZcwAACxIAAAsSAdLdfvwAABI5SURBVHic7Z0Le+NGGYXrTU0DaQMEWEgayvYC3ppCa6c0CXSby67y/38SmhldZqSRLdvy9410zvu0K0eWZUVvZizNHM98lBFIPtI+AKIDxYNC8aBQPCgUDwrFg0LxoKQg/v31ya35X+Bt9n7Bh7fF49Ze+u92vdjp7Y9LKuK1j2Eb3eJ7s55RfEhZ4vOT++NsZs/r3WxmT9NDvny1yk/7q2/M0lH+VG704e3s5Kfr03fN9U/nbm/F0ikzT15akeV7GSVnxb/B+/1gXlCtMdufvvP2sqiP/mH2m/wYbu/MysZvcVkcr+Us3J3ZKD/48FcWIi3xxdnJHczsOXt/bZZGqV0U2xc/lRvZH391Xm3VfHG9zE/wuniL+r0y8+dljb5ahe/3P3tc3prGXi6ro3+YlVS/Rb7H4L0++dI8buyuPI7GLiVITPyllfCUW6wq1+KZs3p791O10dN5vuGdO5vhevenUi7Njp7O8+JmFFfv5XaYry1/qN+vrNWrY3N/G8We/eN7mJmVs0W151zwwn8vV6MssmB37le10hfBLiVIS7w9U6fvigJk6nZTRIpnKtxP1UYPTpKt6v31tnTZKn5WVfV22/yvZFG9l93jXa7BefHfz4kv15jHa+O2KuDumMrdlr+B2zLfY/heWfsN3K/qCnmwSwnSEm/1eOLNp/Pe4u1nu7dsiC/eqzgCW69n4fu5wl2u2S7eHcFG8cHu3K9aftpQfFHVG0xJNJVhTHy1kVfVB+stZf1aKPOqX198tn71L1P0wvczL6jXNKt6/+hD8e2qvhJf7676Ve3BX4ZHLUCS4osrnbLox8VXG3kXd8H6snSVy8bFXSA+36auLXzx9Zr2xV19TdAQ37q4K4/rrN5d8faNizs5+2mKt6ehWF7WlWVB+VO5UV68ytu5YL0x7uoAu2zezvniizu68P1Kx25NcDu3riVFqvofiyfr2zlzXPlh2FuOy/CTZl1U8GtR70mIH4IHyZO2EclL8wOYgHh3byx4C7wZih8a91kdufJ9cB+liUDxJGUoHhSKB4XiQaF4UEYuvmgma1zTR1IyRYuZfYG58bur2n5cq0xXS7nZ/qzxeNtyFExSfItCrGvP/4PtAPWEn20Qv/b2Xj7ethwFIxefVf3vNnxT5FsacZ66g9W2wn9jW0v/Yl5mXusajL12vzv3R7JwO6/jgOXjbUuNk7A7kxE/s5GZuiMliNgE4v99vsgeZv9w4p3fTvG213TtaoLy8bal/CnYh+mIL+rYVpzHrSw/EXLxP1xfZncn/6n+XsqsRqyPxOYk1uUfgXu8bSn5y+/PZMSXqYiyAzaI2BRiF66nbH3689vT/7oO8HXRY+uJD0s8xaeKJ97veQ/6XV01UHaR3p38dH3p5ThMRoNV/ejwxJf5lqh494FuNnw6/7v5J3/qzl2Mh+LDnfPiLlU88WW+JS4+1+w+AUxww8Wn6irCK+UB6/J+rwzJ8HYuFfzP+CLf0iHeXAHaDU3cxlX1VTinS3zRKGN3wAYcMn4oHhSKB4XiQaF4UCgeFIoHZfTiN/TIr8uvJe+bwij3P8UkxoTF199H3zeF4fYyzSTG6MVn1fdjwxFQ7B+ElbB3CsPte6KN9dMR3xgB5ee4+F1SGIbJds9NRnxrBJRWVb9zCsMw2Q75yYhvjYDSFL97CsNA8QkTiq9GQPEHG9orhWFgVZ8wdVVfYaT74vdKYbh98+IuWfyLu3oEFDv0iHl6/xSGYapJjOmIj42AYtfvn8IwTDWJMQXxZA8oHhSKB4XiQaF4UCgeFHXxM+0DAIXiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB0VfPDkSm8+7vvjPyFFQFL+cX7kHz7/3subBDxR/NPTE14Z914/zjyleAhHxxqz5/3Futb68yRf5P/Pv3XpP/I1dR/HHR1L8y99us/tPc7tXZlEYD8WzqpdCVPwb+4VBs8j/BnqKJ0dCUHz2fDGfr0xNn7PqKf4XchRExec8/vqdqfHdSorXQ0S8qdzvP77NpRvx5jM+XxR1f76e4hWQuZ27n88/z8v5TXVVn9f4xrBbT/EKJN9yp32Cpkoa4s1Vn73iaz2jfe07XTYb0S/x2gcACsWDQvGgUDwoFA+KvngyLD3Pu7547V6siZGC+FgCx9zQBwPKUfywJCA+lsAxjffPF34zDsUPi6z43gmcx0/zf278Ik/xw6IivlcCJyviGhR/FHTE90ngZMsr/6XaF8GTQ0N8nwTOy5vAO3vnDmL/VhDpBM7zRWOQaIo/BG3xvRM4Le8UfxDa4nsncO5tr3xwVa997kaNuvj90b4WGjl7n3f9BI7MAZAG+iVe+wBAoXhQKB4UigdFXzyJcuzzri9eu1cjTUYtPhbEKDpuayg+ypjFx4IYtjXvU38rio8yEvH9h0LJmt+apPgo4xLfL4jBEt+DkYnvEcR4vgjbbbWvnlNlEC8bkB8KpRm90u7gSg6Zlg35oVCaYUvt85wcYxLfO4hhnmWJ38yYxPcfCuW+2TdL8U1GJX5/KL7JpMRzKJQdEDGiX+K1DwAUigeF4kGheFAoHhR98WNC+2QNiL547X6wHaD4fsQnI8qWox0KheJ7EZ+MKLsf7xg4FN9khwTO8+uvKT4BxBM4yxWr+hSQTuDcXzU/48fEICcrDYQTOM+v3zXFa3eG9WFKxguEEzhuYAR/FByK10F+MiKW+CSQn4yI4pNAv+VuFGifpeHRT+DIHABpoF/itQ8AFIoHheJBoXhQ9MWPAu2zNDz64rV7XvpA8TsRDWIsm1+h0pbaB4rfhWgQ42ZhG3VrKF4H4SBG0X/jQfE6CAcxnl9/x6o+CYSDGGaiAtMnX6N9vd6PQc5SUigEMUY2MMIEpRuEgxgvX1F8GkgHMW5aVb222G1Q/CZ6BzHsBb//SorXQb/lTlvsNij+EMY8FIrMGZJGv8RrHwAoFA8KxYNC8aBQPCj64seA9kk6AvritXveekDxuxFL4JhvTQZNdxSvg0ICJ4TidZAeCmXZbLujeB2EEzi2q350Q6FQfBc7JHBWjVKvfcHei0FOUlooTEY0vjlpKL6LiU9GRPFd7DQZUfZtcDunbbUHFN/JtCcjovgjoH3d1gvtk3QE9BM4MgdAGuiXeO0DAIXiQaF4UCgeFH3xqaN9go6EvnjtHpgtUPzuRIIYbvTqoMlW2+wWKH5nOuakGdlQKBS/iR3mpGl8S5rilRCfk8Zu4EHxOkjPSdMs8LyqV0I8iBF+wifdOzdV5xbxIMbNVfhSitdBfE6aZsyW4nWQHwqF4pNAv+VOW283FH844xwKRebc6KBf4rUPABSKB4XiQaF4UCgeFH3xKaF9MgTRF6/d/eZB8cMQGwolv6FvDGKsbduD4gchlsCxjffJJnAofld6J3DsDBXBhEQUr4P4UCgs8WkgncDJ64EwiaF9IR8wyMkYB9KTEV2sssePUxwYAUm6QTiBY2IaaU5GRPH70DuBwxKfCtIJnMdUh0KheGG0r+cqtE+EMPoJHJkDIA30S7z2AYBC8aBQPCgUD4q+eG20T4AS+uLZMaMCfBCD4gcnHsRYJNYtS/GHsFsQI+ykoXgVhIMYFJ8KwkEMW9XPk5qTZpATMD6kh0LJt/h8mUx/PKr1TGVOmkbYkuJVEA5i2BXBeGcUr4NCEKMRtqR4FfRb7iheBf0gBi/pVdAv8doHAArFg0LxoFA8KBQPir54Xp+roC+enWwqHEt8LH1T/BB+U5ridTiS+PhERI+2QXe58JvrKV6Hw8X3n4joxq4znXPeSorXYTDxvSYisuvCEA7F6zCc+B4TEbmaIRwbgVf1OgxX1feYiCha4tnNpsJw4rMe6ZvnyGc8xatwuPgdJiJywpdXwVU9xaswwO1c7/RN/D6e4lXQb7njtZ0KAuK70zcGKtNBv8RrHwAoFA8KxYNC8aBQPCgUD8qQ4pszhIcrWoOgOCheh6OWeF98exAUB8XrMGyJLztmXn83ny+s+GXpuj0WhoPidTiO+AvbU5OvuLkqn6T4tDiO+KKndvnnyntkEBQHxetwXPGff1V/qLcGQXFQvA7HFb+6v/I3CAdBcVC8DsOK98IXTnx9Yd8eBMVB8ToMfB9fhy8K8d4EVM1BUBwUr8OA4mPX7NuheB2GE38fLdBbYhgUrwXb6kGheFAoHhSKB4XiQaF4UCgeFLkETraM3sxTvA5iCZzsZmH76JtQvA5iCZxYz5yB4nUQTOB8x6o+IcQSOM8XCxu/akLxOoglcJi5SwuxBM7LVxSfEmIJHHNVz6o+HeQSOHbkw/aLKF4HJnBAYQIHFLbVg0LxoFA8KBQPCsWDQvGgiAUx7u1N3aL1IorXQS6IkWUMYiSEWBAj62jao3gdxIIYObFvSVO8EmJBjK62fIrXQW4olOgnPMVrITgUSlDtV1C8DnJBjPbdnoXidZAMYlB8QjCIAQqDGKCwrR4UigeF4kGheFAoHhSKB4XiQeFkRKAITka04mRECSE7GREnKkgGsQQOS3xayCVwXt5wooKEkBsK5WKVPUau7iheB7EEjqn6ORRKOoglcFji00J0MiKOc5cOTOCAwgQOKGyrB4XiQaF4UCgeFIoHheJBGUz8lglpXEtOpKOG4nU4XokPxT/a+M1y0RodgeJ1GLDEbxoH5Wb+ff6sCWLUbbgOitfhCOLj46DYZyOzVFC8DkcQHx0Hxa6Kdc1SvA5HFB+Mg8ISnxhHFB+Mg2JX8TM+HQYUv3EcFCd8ecWr+kQY8j5+QwqD9/GpMZT4/VIYGcVrMZD4PVMYGcVrwbZ6UCgeFIoHheJBoXhQKB4UigdFbiiUzOYwWlC8DpKTEd1HJqGieCUEJyN6fv01xSeD4GREyxWr+nSQGwrl/oqf8QkhNxTK63cUnxBiQ6G4KUbb8xFRvA6CkxHxdi4l5IZCySg+JTgUCigcCgUUttWDQvGgUDwoFA8KxYNC8aDIBTEe55yTJiHEghimCa/5hUkDxesgGMTIWl+RNlC8DoJBjIwlPiHkghim8Zbj1SeDWBDDwKlJ0kEsiGG5affLUrwOYkEMTkaUFnJBjHvOSZMSDGKAwiAGKGyrB4XiQaF4UCg+DWaHsfsbUnwaHHYaKH60jFr85iCGua/jN2m6GLX4Fr5407xjphJvQvGWUYvfGMR4NH3x7KTpYiLiO4IY7KTpZCLi49PSmHkKWlC8ZWLigyDGy5uId4p3TEy8H8R4vohc01N8wcjFbwhidHineMfIxW8IYrgxcHhV34F3GvZoiNUVzyDGAfjifymo1n14a/8OTr0Lpqff3cZf3RcGMdJgo/isKXrDq/vCtvo06CX+6Y9fntw+nc9mC/Pz05/+aR41Xt0Xik+DfuLPF9n7v66s9fy/88vswVX/FD9a+okv6vv3X9w69+Uaih8tO4hfz2YnFD8Vtt3OVeLfXy/Kqp7iJ8C201CJt///dkXxE6G3+OxuNvvky0VS4rfNSRMbF4HiHaNusm0Rin/kGDjdjFr85qFQbubfs8R3MhHx8QQOq/puJiI+nsCh+G7G3Du3dSgUiu/GF/9ZwabeuSzot0lOfDgUCsV3s1F8Fu2dS0n85jlpKL6bfuLfX89O8uWDabQtHjdf3RfJOWkovpt+4teX2d2Z6aMxi2RKPBM4B9BLvOmSza2bRZZOVc8EziH0E3+dX+K9WmVP5/bfRMTvCcVb+on/olL9cPqO4qdAr25Z8xmfGzexG4qfCL1650xdn9fxLoqR392nclW/FxRvGXWT7V5QvIXiQRm1+M1BDPMdqkiHPMVbRi2+RSA+MhqGgeItoxa/OYjRqg8cFG+ZiPhIEOPlTfTLshTvmIj4SBDDjngVKfUUb5mY+MacNBz1qpNRD2nKOWlGhOicNNm3vJ1LBc5JAwqDGKAwiAEK2+pBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQ1MUTHSgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UCgeFIoHheJBoXhQKB4UigeF4kGheFAoHhSKB4XiQaF4UP4Pj4L6YZU2QjoAAAAASUVORK5CYII=" /> <img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAA5FBMVEX9/v0AAAAAADkAAGUAAP4AAP8AOWUAOY8AOf4AZo8AZrUAZv4iiyI5AAA5ADk5AGU5AP45AP85OQA5OTk5OWU5OY85Of45Of85ZmU5j9o5j/05j/5lAABlADllAGVlAP5lAP9lOQBlOf5lOf9lZjllZmVlj49ltbVltf2POQCPOTmPOf6POf+PZgCPZv+PjzmPj2WPtY+Ptf6P27WP29qP2/21ZgC1Zv+1tWW124+1/rW1/tq1/v3ajznaj//a24/a/rXa/tra/v3w+P/9tWX9tf7924/92/79/rX9/tr9/v39/v6yHK1SAAAATHRSTlP//////////////////////////////////////////////////////////////////////////////////////////////////wD/IKv2MwAAAAlwSFlzAAALEgAACxIB0t1+/AAAEzNJREFUeJztnY1/4zYdxnG7rpBbgfJycL2y3V6AHAsDNpIOaALr2qbn/v//D5bkd8uOkjp6JP+e72fXpE7saP5WjiI9kX6SEpH8BF0AgoHihULxQqF4oVC8UCheKBQvlGmIf7o+vR16fDV33fvD+/x+55C7XsP11QJBhPhV4qyiX7wze7wakKDFZxa+TxItYJ0k6nxqL0/X53cf3p98lZws77PNJ8t2nW3u9OF9djNTD630jfqZ72eO8p3au9yi9j6/yw+Zv6xCbbhPfvY+O/TaHLbxOm9SczTNrHk49SRVivr/CZrAxeenMZOV6JNbE5/9fv6/a7X5/K4pvrVTJf7eGD1ZPuX7FUc5vW1sKQ9ZvKzCiC84vS2fmT9rZl7548/V/dbh8seS1iGRhC7+jbb1eJGdXyW9Ln5mnvSktTXEt3YqL75ZlVwa+/l+5ijF3vmWN/nfxultdYS0EK82JnP1hPyZyfzxQl9z9BZVpuLVysOZwphyNA6JJHDx+pSe3+U1TZ3b6lK/VE9RVastvr1T9a67zjSYX8x+5ihm72KLur9SbssKXryS2jDLfeZFUUe817rXyTwvU+cFTGFMJW8cEkng4nPNfeKz2pd0xbd3qsRnm9V1vdyvEl9t2S3evPqg+MbhTGFmugQU70J52vQF0mzJzuo6KcSrGqwuolbx5U61dvbq5O+q6hX7VeKrLe1LfUFXfPdSX4qvDlcWJruTbW0cEkkc4vMmUe2OOcn3yUCNb+w0Mw9mNbG+XyW+2tJt3FVtgpb4TuMuF5/MqsPVG6NJ65BIIhGvz5e6VW+df67e47PNb9bmujywU6a7qvyztNqv9h5fbml8nFtVkiyX+u/zB6uPc6pM+tWqw9UKUxQ5AO9hiw+aEJrmL4DiD4Xiw8C0oz22lymexAjFC4XihULxQqF4oUQuPu8+KzrmcizpmbwnTe+ghkvWVX+Q7q3p+0Sgnj9r3a+2PV7Ub/UHi1ia+pMU3yEXa7p6f6UHRmvCZwPiV7WjF/fLbasy4KFv7xOK90g+wt0O5NQCMtXAq+6d/0r3ov5e7ab2NeP5tU7UtfkjmZuDVwMBxf3iNvvj+bUSXtyma/yYmzuTEZ+0AjlVECdtif/HxTyrnH8y4o3fXvF6NHVlhBb3i9sPfzF3y9v3EVX4CYmfmV+LQE6ZfTEbi3eETPx312/S9em/yr8XfQHoGTPT+YlV8Udg7te2VaPss+JFYjE/GfHtQE6ZfSk2J3la82S5Ov/h/fl/zcD4Kh/WrYlv1nh38aogOi8QBZMSXw/k1AZEi8tAMXS6Pv33dZWIMBm6Ay71aUt8vjM+R+nEpMTXAzkd8eYNXT3x8eKP6kf20No00pvimwfvb9ylLfH3CWu8T2ri64GcrvhMs3kHyOp+HquqLhG1Wt5gVXzeK8Izs/pHPMt7fAAZCycmJb4eyLGIVy1A/UQVwzGXemVetwX6xOedNfoA3Q6c5qU+7w2Kg/jFk4OgeKFQvFAoXigULxSKFwrFCyV68QMj8qvi68qHpjCK489a92vbdJ+9OsIsrW5jYMLiq++pH5rCMEcZSGKovsJ8yo3yQOyr90Y+3lLMMaJ64/Lx+Jl6+OAUhjn2UGe9eqm5PpD6d6+HfSMZl52O+GKOkbzL/Ae7+H1SGIrB4bnM8t/yp97r0f/yG7PhMxnx5Rwj5Xhr+1K/dwpDMTggn/1bm4dWalM16h8BkxFfTjWhT38+rYV++OAUhmJIvLpdl9PrmIHfk68o3htN8WastTYBysEpDMXApf6fRbPyPh/V1zuseKn3RnWpL8kvxPr+wSkMc+zext1/CvHmd3NUPeVJDExGfDnHSBGuKFPvh6cwFMNJDHOpXxWfFvlxziuP1Tt2LVxRToByeApDMZzEMOLzjgF9pEi8T0I8OQCKFwrFC4XihULxQqF4ocDFJ+gCCIXihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQ8OLJkRg+73jxn5CjABS/OLsyd7a/rGXNG79Q/NHAia8M110/nH1E8T7wIl6ZVf8ezrTW53fZTfbj7FuzvSb+Rm+j+OPjU/zzZ7fp5lVm90rd5Mab4nmp94VX8e/0FwbVTfY34CieHAmP4tPt5dnZUl3pM5aO4n8kR8Gr+IyHn96pK77ZSPE4vIhXF/fNR7eZdCVevcdnN/m1P9tO8QD8fJzbnJ29zer5Tdmqz674yrDZTvEAgu+5Q5+gqRKGeNXq0y2+ziPotu90GTaCr/HoAgiF4oVC8UKheKFQvFDw4snIuJ13vHj0KNbUCEC8LYGjPtA3JpSj+JHBi7clcFTn/fay3o1D8SPjVbxzAufhVfbjpl7lKX5kEOKdEjhpHteg+OMAEe+SwEkXV/Vd0W3g6QEQ75LAeX7X8M7RuZdxcDeI7wTO9rI1STTFvwiweOcETsc7xb8MsHjnBM5Gj8o3WvXoUxc3aPGHg24Kxc6h5x2fwPFTANICX+PRBRAKxQuF4oVC8ULBiyd9HPW848WjBzWCJV7xtiBGPnBbQfF9RCveFsTQvXmv6s+i+D5iEO8+FUra/tYkxfcRkXi3IAZrvBsxiXcIYmwvm/226KZzwIyipg//U6G0o1fo8a3g8NOz4X8qlHbYEn2egyMm8c5BDPUoa/wwMYl3nwpl0x6bpfg2UYk/HIpvMynxnAplD7wYwdd4dAGEQvFCoXihULxQKF4oePExgT5ZI4IXjx4E2wOKd8O+GFG6iHYqFIp3wr4YUbqJdw4cim+zRwJn+/pLig8A7wmcxZKX+hDwncDZXLXf42NilJMVBp4TONvXd23x6MEwJ6ak3OA5gWMmRqjPgkPxGPwvRsQaHwT+FyOi+CDA99zFAfo0jQ4+geOnAKQFvsajCyAUihcKxQuF4oWCFx8F6LM0Pnjx6JEXFyh+L6xBjEX7K1RoqS5Q/D5Ygxg3c92pW0HxGDwHMfLxmxoUj8FzEGP7+hte6oPAcxBDLVSgxuQr0O11N0Y5S0EBCGLENjHCBK2n3oMYz19QfBj4DmLcdC71aK87ofgBnIMYusFf35PiMeB77tBed0LxLyDqqVD8nCLP4Gs8ugBCoXihULxQKF4oFC8UvPgoQJ+l8cGLR4+8uUDxe2FL4KhvTTa67igeAyCB04TiMfieCmXR7rujeAyeEzh6qD6+qVAovoc9EjjLVq1Ht9fdGOUsBQVgMaLo1qSZoHYuRuQCxfey12JE6deNj3Norbuh+H4mvRgRxR8DdLPNBfQ5Ogb4BI6fApAW+BqPLoBQKF4oFC8UihcKXnzwoM/QccCLRw/A7ILi98YSxDCzVze6bNFid0Hx+9KzJk1sU6FQ/AB7rEnT+pY0xYPwviaNfkINisfge02adoVnqx6E9yBG8x0+5NG5iRrP8R7EuLlq7krxGLyvSdOO2VI8Bv9ToVB8EOB77tB+e6H4EYhyKhQ/pwYEvsajCyAUihcKxQuF4oVC8ULBiw8K9NnwB148evStDsWPgm0qlOwDfWsSY7TsOhQ/BrYEju68DzeBQ/F74pzA0StUNBYkongM3qdCYY0PA98JnOw60ExioNvxTUY5G1HgezGiy2X68FGIEyMIcq7xnMBRMY0wFyOi+ENwTuCwxoeC7wTOQ6hToVC8Z9DNuRL0ifAMPoHjpwCkBb7GowsgFIoXCsULheKFghcPB30GMODFc2AGAoMYFD829iDGPLRhWYp/AfsFMZqDNBQPwXMQg+JDwXMQQ1/qz8Jak2aUMxAdvqdCyZ7xdhHMeLxQ6QrAmjStsCXFQ/AcxNAbGvOdUTwGQBCjFbakeAj4njuKh4APYrBFDwFf49EFEArFC4XihULxQqF4oeDFs3kOAS+eQ2wQjiXelr7Jf2l+U5riMRxJvH0hogfdobuY17vrKR7Dy8W7L0R0o7epwbnaRorHMJp4p4WI9LZmCIfiMYwn3mEhInNlaM6NwFY9hvEu9Q4LEVlrPIfZIIwnPnVI32wt7/EUD+Hl4vdYiMgIX1w1WvUUD2GEj3PO6Rv753iKh4DvuWPjDoIH8f3pGwWVYcDXeHQBhELxQqF4oVC8UPDi2SaHgBfPIRYIY4pvrxDe3JAP27aheAxHrfF18bov71X3ORSPYdwaXwzMvP7m7GyuxS/qX5Fsfl3aQPEYjiP+Uo/UZBturupPYI0Ph+OIz0dqF7+re99e2npt2arHcFzxb79ofBe+Oe2RgdYxHFf8ctO41Kc3885OFI9hXPG18IURXzXsu8sQGSgew8if46vwRS6+ashv7COzFI9hRPG2+rwbiscwnvhNa1ajnOEYBsWjwPfVowsgFIoXCsULheKFQvFCwYtnXz0EvHiOzkHwl8BJ9VwYHSgeg7cETqq6eCg+GDwmcLavv6T4YPCYwFkseakPB38JnM2V/T2erXoI3hI4agYUu/gRS0Cc8ZbA2egxuqvOThSPwVsCRz+BNT4Y/CVwUooPCSZwhIJP4IxWALIP+L56dAGEQvFCoXihULxQ8OLZZQsBL56DNBD8BTHU5zqOzgWDtyCG6t7ZXlq+NEnxELwFMR7UbBi2r0lTPASvU6FYvyZN8RA8ToWilyjowFY9Bo9ToTy/s3jnIA0If1OhbC8tbXqKR+EtiNHjneJBeAtimOgVJz8KBQYxhIIPYlA8BHxfPboAQqF4oVC8UCheKBQvFLx49tVDwIvn6ByE0cTvmgfFspC0huIxHK/GN8U/6CWoFvPO8iQUj2HEGj80D8rN2bfZo8+f3XYWJKJ4DEcQb4/f6EdfdxcroHgMRxBvj9+oTbZVKtiqx3BE8c2ViHpr/FgFIHtxRPHNlYi2fe/xYxWA7MWI4ofnQdHCF1fdVv1YBSB7Mebn+MF5UHo/x49VALIXY4k/LH6TUjyKkcQfGL9JKR4Fvq8eXQChULxQKF4oFC8UvHh22ULAi+cgDQSfa9K0e2s1FI/B45o0JovRhuIx+FuTxmQxOlA8Bp9TofBSHxA+p0Kxi2erHoLHqVD6xI9YAuKMv6lQKD4ofK5JQ/EB4XNNGooPCE6FIhROhSIUfF89ugBCoXihULxQKF4oFC8UvHj21UPAi+foHARvCZzeSYwpHoLHBE6qx+jbUDwGfwmcnj5disfgM4HT/oa0huIxeEzg2Adx2KrH4DGBY3uH5yANCo8JnBvbIlQUD8JjAqfzaU9D8Rj8JXB6choUj4EJHKHgEzijFYDsA76vHl0AoVC8UCheKBQvFIoXCsULxd9UKNnnOuuMGCMWgLjjLYjx/G6ebmxBjGMWgPTiLYhhW6RAQfEYvAUxKD4svAUx9KXe0nHLvnoM/oIYWePurWVkluIx+JwKRa9J04biMXgLYuiHLGlLisfgL4jxcGYdt6V4DAxiCAUfxKB4COyrFwrFC4XihULxYZC8jP1fkOLD4GWngeKjJWrxDGIcTtTiOzSDGEsGMfqJWvzuIIZtkGbEAkTMRMRbghis8YNMRLxtRozndxyk6Wdi4mtBjO3lMn2wtO4oXjMx8bUghrr0WzN3FK+IXPxAEIM1fpDIxe8IYtjGZileUzsLB3TEYsUziPEC6uJ/zCm3fXiv/w7Oa23jx1/c2vd2hUGMMBgUn7ZFD+ztCvvqw8BJ/ONvPj+9fbxIkrn6/fG3f1X3Wnu7QvFh4Cb+Yp4+/WGprWf/XbxJ783ln+KjxU18fr1/+vTWuC+2UHy07CF+lSSnFD8Vdn2cK8U/Xc+LSz3FT4Bdp6EUr//9fEnxE8FZfLpOko8/nwclfjiBky7sPXcjFiBiou6y7dAQfzO3r0lzzALEQ9TiBxM4tvSNguI1ExFvnQrlG17q+5mIeEsCZ3s517m7Nuyr18Q8OrcjgdM3+RHFK+riP8kZGp1LG+M2wYmvJXCev6D4AQbFp9bRuZDED61Jc8NL/QBu4p+uk9Ps9l512ub323u74nNNGvs3aShe4SZ+9SZdz9QYjboJpsYzgfMCnMSrIdnMurpJw7nUM4HzEtzEX2dNvJNl+nihfwYi/kDoXeMm/tNS9f35HcVPAadhWfUenxlXsRuKnwhOo3PqWp9d400UI/t0H0qr/iAoXhN1l+1BULyG4oUStfjhIMYm+0xn++7ciAWImKjFd2gHMWxQvCZq8YNBjJ7l4yneMBHx1qlQzvQfQxuK10xEvDWIsbTWenbZaiYmvr4mjcLyPk/xmqinNN25Jg3Fh4O3IIaOVn/NqVBCwV8QY8OpUEKCQQyhMIghFPbVC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQKF4ocPEEA8ULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXyv8BpNeUclVPlqAAAAAASUVORK5CYII=" /> <img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAA1VBMVEX9/v0AAAAAADkAAGUAAP4AAP8AOTkAOWUAOY8AOf4AZo8AZrUAZv4iiyI5AAA5ADk5AGU5AP45AP85OQA5OWU5OY85Of45Of85ZmU5j9o5j/05j/5lAABlADllAGVlAP5lAP9lOQBlOf5lOf9lZjllZmVlj49ltf2POQCPOf6POf+PZgCPZv+Pj2WPtY+Ptf6P27WP2/21ZgC1Zv+1tWW124+1/rW1/v3ajznaj//a24/a/rXa/tra/v3w+P/9tWX9tf7924/92/79/rX9/tr9/v39/v7YV/E+AAAAR3RSTlP///////////////////////////////////////////////////////////////////////////////////////////8A/3m6RJAAAAAJcEhZcwAACxIAAAsSAdLdfvwAABNASURBVHic7Z0Le+NGGYWrpKaBLAECLE02wPYC3ppCa6c0MXRxLlX+/09Cc9FlpLEsO/KcGX3nfbrrrCwrql7PeDRzPPNJTkTyCfoECAaKFwrFC4XihULxQqF4oVC8UJIW//OH07t99n++qfevXtvc6N+wldV8n98eF6LEe1873HObVUbxYfn5Q3b6483Zx0LeD1l29jHfZL8qNt3dZ1bFKju3f2+yLDtZFq84+To7+V45rrbY1xrxxUtLjWpD84BmT/3uUHtd5eZomnP3cGqn4uTqneN9Z6QoXl3a7BcXSry++mcfN1mJKb0bY/Rk+XxjdtB7nv23UOpsUT9o8Sv9jyt9eCO+PmC1p93r3JzAZ1+on1uHs89lrUNGSIriHy+K63pvrvaV9rsprvTjRTZXP6s9iiK5NPZzI7LY87yu1e2WK/veUK8tdi0rfyO+PqDdM5s/XhRHtVtMjWIKdHU49Rv1ya3Uzo1DRkiK4jdGoq7q7/T1V1vM5Tfii7fFlfWiimT5hBFfbrGvbRRw82KzoT6g2bM4ov69xQ9z+2s6v0Cd08YWcueQETIl8WqDvdDFP1S9XlQOZXVdiq+37BZvDtgr3jmcEX9uT5Lix8ZX1bfE56uTf6qip0q+2r0WX29pV/UlXfHdqr4SXx/Oircnd+UcMkJSFO9r3LXFFyVR/bjJ2iW+3tJt3NVtgpb4TuPOis/O68NZ8a3GXbT2UxSvil/7dq4t3t7Rqct/VVXOpWOzxbmdW9WSPFX9D/bJ+nZO/ZrizXX2sT6cEa8PVb4v4vWepnjFJthFjbhp/gpSFG/unYPdIlN8NGzMR20gKJ5MCIoXCsULheKFQvFCSVy87VVrNfE9oZpyzFa9QN0H3pvOFfWE7sTZ1rGu9j9v/az2bj6W23W/fSp3AJMU38GK1d37J7/X46UN4ec94leNo6+aI+71Y7XPJqP4gNiBb5WwWdo4jBlQrXIz9Xis7rT/Wneufq5epl77rNMZjW7Ae/MmmZuDqyfNUcqf1YvcP3af+2iH4jxMRnymEzb1uEsViVE44v91MS8K59+NeON3q3g9yLoyQt2fy1H3q2q7/o2pFPgJibf1sSnAjUiM2Vh+IhTiv7+5yu9P/129X3QFsGUoTQtelW+C6ueV9a4fy+3mIKmYn4z4MkShLn0VxilLshU7NwNrq7OfPpz9x4yXrzLzgoZ4t8R7xVfduOqx3P5oBorjzVc6TEp8GYexn/F3DfFarh1RvT/98aYOSpho3R5VvXk0sZ/6sZH5SoJJiS/jMF7x5gNd7fh48Tf1V/HUvXXniHcP7mvcqUd1hOajKvks8SFpiC/jMH7xhWbzCaByHCZtVVcRjVLusCrv98pMTXn7VgZyysfynjDe6IXLpMSX6Zot4lULUO+o0jmmqlfmdVtgm3jbOaMPUHbU2E4A57Hs0EnF+wTEk4OgeKFQvFAoXigULxSKFwrFCyV58T0j8qvyW8yHpjDK47eTGGXiwowM5Lr/oHlDnwITFl9/ff3QFIY5SieJUSYuqq46p0svEfPJi89t31099Ygqj3Y8Xks4OIVhjt3trC8TF+YA5vu2dSd+IuOy0xFfTj1iy+FPfvH7pDAUvuG5chS/fOfUqYzGMF3sTEZ8NfVINd7arur3TmEofAPyVeLi3rxHNtnnF5meVqneN3omI76agUJrsbNd6KcPTmEofOKrxMXK1DHloCDFB8YVb9rcegyunpvooBSGYlvorpwaRR3SPLYSGbEzGfHu1CMrXR5r8QelMMyxu427MnFRCn+8OLc1ARt3QWk27lTa2X4Cr5zG3SEpDIUnifFchXnr31Q/8nYuGI/1J3YjXGFmKslfk8JQ+JIYZeKi2YFTTmqZiPdJiCcHQPFCoXihULxQKF4oFC8UuPgMfQJCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxS8eHIk+q87XvyvyVEAil/Mrs0PT79rZM2df1D80cCJrw03XT/MPqX4EAQRr8yqPw8zrfXlffFQ/DX7zmxviL/V2yj++IQU//LXu3z9prB7rR6scVc8q/pQBBX/Xn9hUD0U74GB4smRCCg+f7qczZaqpi9YDhT/P3IUgoovePjlR1Xjm40UjyOIeFW5rz+9K6Qr8eozvniwdX+xneIBhLmdW89m74pyflu16osaXxk22ykeQPQ9d+gLNFXiEK9afbrF13kG3fadLv1G8CUefQJCoXihULxQKF4oFC8UvHgyJoOvO148ehRrUsQh3pfAUTf0zoRyFD8mUYj3JXBU5/3TZbMbh+LHJLT4wQmchzfFX7fNIk/xYwISPyiBk9u4BsUfAZT4IQmcfHHdfCm6GTwxMOKHJHBe3jveOTr3Cl7TBxI6gfN02ZokmuIPBy9+cAKn453iXwFe/OAEzlqPyjutevTVS5gIxB8OujWUNK+47vgETpgTIC3wJR59AkKheKFQvFAoXih48cTH0a87Xjx6XCNK0hbvC2LYgdsaiveRtHhfEEP35r1p7kXxPlIRP3wqlLz9rUmK95GY+GFBDJb43aQmfkAQ4+nS7bdFN58jZRQtfYSfCqUdvUIPcUVFuF6N8FOhtMOW6GsdFamJHxzEUM+yxG8nNfHDp0JZt8dmKb5JcuIPh+KbTE48p0IZSBgdeQwlHn0CQqF4oVC8UCheKBQvFLz41EBfsJHAi0cPhO0Jxe/GvxhRvkh6KhSK34l/MaJ8nfYcOBTfZI8EztPbryg+AoIncBZLVvUxEDqBs75uf8anxigXDE/gBM7T249t8egBsd1MxbVD4ASOmRihOQsOxWMIvxgRS3wUhF+MiOKjAN9zlwDoa3QM8AmcMCdAWuBLPPoEhELxQqF4oVC8UPDiUwF9oUYGLx496jIQih+OL4ih+mw/TXFiBIofjDeIcduZvRptdCAU72N4EGPR7sKheAyBgxh6xDbJIAbF+xgcxNBLjzmlHt1YH8woFyoeAFOhJDYjxtSMWzgVyi4ovoe9pkLJv3Fu59Bid0HxfUx4KhSKPw4UjwEfxIieMFcoNPgSjz4BoVC8UCheKBQvFIoXCl58KqAv1MjgxaOH3QZC8cPxJHA8q0mjjQ6E4gezZSoU019fQfEYwi9G5E5XT/Egwi9G5K5FRPEggi9G1CrwbNWDCJ7AcT/hox2dm5rnDsETOLfX7kspHkPwqVDa+WqKxxA6gdP+iKd4EPieu0hBX5djg0/ghDkB0gJf4tEnIBSKFwrFC4XihYIXnwToqzQ+ePHo0ZchUPxeeNekWbS/QoWWOgSK34dtU6EkGMSg+C0MDmLY8ZsGFI8hcBDj6e23rOqjIPhUKHO9PEkNur0+jFGuUlQA1qRxB+giHZ2b/PBc4CDGy5cUHwehgxi3naoebXgLFD+IPYIY7oymFA8C33OHNrwFih+FBKdCCXNhYOBLPPoEhELxQqF4oVC8UCheKHjxUYK+KscHLx498uaD4l+FL4FT3NC3eu7Qkn1Q/GvwJXB05338CRyKH8jgBI4elnVSOBSPIfhiRCzxcRB6KpSiHnCnxEC33/2MclWiJnQC53KZP7grTaKH4SwCXDsETuComEacCRyKP4TBCRyW+FgIncB5iHUxIooPDLoZV4G+EIHBJ3DCnABpgS/x6BMQCsULheKFQvFCwYuPBfSFCAxePHpAxkLx45FUEIPiR8MfxJhHOixL8YewXxDDHaRBG7dQ/CHsMRUKxcdB4CCGruqdjlt0Y75ilAuRDqHXpCn2eLeIYzxemmqX4GvStKc8o3gMgYMYeoOz8BzFYwAEMVphS4qHgO+5o3gI+CAG2/EQ8CUefQJCoXihULxQKF4oFC8UvHg21yHgxXOgDcKxxHsXIjL/cL8pTfEYjiTeuxBR/qA7dBfzZnc9xWN4vfjB6Zv8Vm9Tg3ONjRSPYTTxu9M35b5uCIfiMYwnfvc0KGbf1twIbNVjGK+q352+8Zd4jrNBGE98PiB98+T5jKd4CK8XP3whItvEX1w7rXqKhzDC7dzg9I3/Pp7iIeB77tiagxBA/Pb0jYLSMOBLPPoEhELxQqF4oVC8UPDi2YyHgBfPYRgIY4pftG/YWhvc705aKB7DUUu8K/6hNfuNgeIxjFviy4GZt9/OZnMtflH1zpoYRgeKx3Ac8Zd6pKbYcHvdeJ5VfUQcR7wdqV38qel9i3i26iEcV/y7L5vfhfeL57gbhOOKX653V/UUD2Fc8Y3whRHvNOwpPiJGvo+vwxdWfFM2xUfEiOLdiQuHQvEYxhO/bs1qZOmPYexs1Y92esQF31ePPgGhULxQKF4oFC8UihcKXjz76iHgxXN0DkK4BM56NvMlMSgeQ7gEzu3cuw/FYwiXwOnUBwaKxxAsgaNnTPAUeorHECyB83S59JZ6tuoxhEzgeD/nOTqHIWQCh+IjIlgCR33m5994bucoHkK4BM7aPyhP8RjwCRy24iDgEzijnQDZB3xfPfoEhELxQqF4oVC8UPDi2WULAS+egzQQwgUxivs6zogRD8GCGC/v5/nac6dP8RiCBTFa6xNUUDyGYEEMio+LYEEMXdWPtn78iGctlHBBjKJx986XwOHoHISgQQy7YI0DxWMIFsTQT73pvojiMYQLYjzMvOO2FI8BH8SgeAj4IAab7xDwffXoExAKxQuF4oVC8UKheKHgxbOvHgJePEfnIARL4Kz13bznS5MUDyHgYkT2i5MtKB5DuKlQtvTpUjyGkIsR+QbnKB5EwMWI/IM4bNVjCDgViu8TnqNzKAImcG6vfS+ieAwBFyPyT3RH8RjCJXC25DQoHgM+gcNWHAR8Ame0EyD7gO+rR5+AUCheKBQvFIoXCl48u2wh4MVzkAZCuKlQHrxL0lA8iGBBDNWF5/3SJMVDCBnE8C4gT/EYQgYxWOIjImAQ4+nSO189W/UQgq5J4538iKNzELgmjVCCrknDEh8PXJNGKPggBsVDwAcx2HyHgO+rR5+AUCheKBQvFIoXCsULBS+effUQ8OI5OgchXAInX/h77igeQripUG7nnAolIoIlcHyLFCgoHkPAxYi+ZVUfEcESOE+Xc70SVRu26jEES+BsXX6Mo3MQgiVwXr6k+JgINxXK7ZaqnuIhhJwKxf9NGoqHgE/gsBUHAZ/AGe0EyD7g++rRJyAUihcKxQuF4oVC8UKheKGMJn5HCsP05Ly879zzUTyG45V4V/yD7rZbzDuTI1A8hhFLfN88KLez74pnVRijPR0KxWM4gnj/PCj6Wc/QLMVjOIJ47zwoepPvO/Lsq8dwRPHuPChbSzxH5yAcUbw7D8rTts94iocwovjeBWmM8MV1t1VP8RDGvI/vSWH03MdTPISxxB+WwsgpHsVI4g9MYeRs1aNgX71QKF4oFC8Uio+DPb4+Nk5TiOLj4HWXgeKTJWnx/VmM4tbO+02aEU8gYZIW36EpXnfo+mbEOOYJpEPS4nuzGHpkzjMrBsVrJiLek8Vgie9lIuJ9WQzPAI2CXbaaiYlvzohxucwfPK07DtJoJia+kcXg0iS9JC6+J4vBEt9L4uL7shgPXJOmh8b/6AENHKx4rknzCpriO//zP3/Q74OzRtv48bd3/lcPBT8jBlv1il7xeVt0z6uHwr76OBgk/vEPX5zePV5k2Vz9+/GP/1A/tV49FIqPg2HiL+b585+X2nrx38VVvjHVP8UnyzDxtr5//sudcV9uofhk2UP8KstOKX4q7Lqdq8Q/38zLqp7iJ8Cuy1CJ139+s6T4iTBYfH6fZZ99MY9K/K7ZUNSEGB0oXpN0l22Hlvj1jOK3kbT4/tlQ8qe3X1H8NiYi3jsbymLJqn4rExHvS+Csr/2f8eyrV6Q8Ord7TRqveI7OKZriO2tteUbncmfcJjrxjQTOWo/RXXdeRPGaXvG5d3QuJvH9s6GwxG9nmPjnm+y0eNyoTlv7c/vVQwmXwMkpvodh4ldX+f25GqNRD9GUeCZwXsEg8WpItrCuHvJ4qnomcF7DMPE3xRU5WeaPF/rvSMQfiBi1/QwT/5dK9ebsI8VPgUHDsuozvjCuYjcUPxEGjc6pur6o400Uo7i7j6VVfxAUr0m6y/YgKF5D8UJJWvzOqVA4Hr+VpMV3aE+For442YbiNUmL7w1iPKjpym+7RZ7iNRMR71+Wxv/9+BFPIGEmIt6/LI1ap6ADu2w1ExPvLEvz8t7jnYM0hh3v/13s/wuDBTGKlp2nTU/xKAJOheL1TvEgggUxTPTK06qneAgMYgiFQQyhsK9eKBQvFIoXCsULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKBiycYKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQKF4oFC8UihcKxQuF4oVC8UKheKFQvFAoXigULxSKFwrFC4XihULxQqF4oVC8UCheKBQvFIoXCsULheKFQvFCoXihULxQ/g8aSXVZz/sAjQAAAABJRU5ErkJggg==" /> <img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAA3lBMVEX9/v0AAAAAADkAAGUAAP4AAP8AOWUAOY8AOf4AZo8AZrUAZv4iiyI5AAA5ADk5AGU5AP45AP85OQA5OTk5OWU5OY85Of45Of85ZmU5j485j9o5j/05j/5lAABlADllAGVlAP5lAP9lOQBlOf5lOf9lZjllZmVlj49ltbVltf2POQCPOTmPOf6POf+PZgCPZv+Pj2WPtY+Ptf6P27WP2/21ZgC1Zv+1tWW124+1/rW1/v3ajznaj//a24/a/rXa/tra/v3w+P/9tWX9tf7924/92/79/rX9/tr9/v39/v5e9HN3AAAASnRSTlP///////////////////////////////////////////////////////////////////////////////////////////////8A/+QNreQAAAAJcEhZcwAACxIAAAsSAdLdfvwAABQOSURBVHic7Z0Le+Q2GYXxbtNAlgDhskCyod1egFk6FNpMCkmgYXJZ5///ISzJF9mWHc/E/o7s77xPu5l4PLajd+SRpTPyT1Kikp+gD4BgoHilULxSKF4pFK8UilcKxStlRuIfzw9v+9e4P05eXTy7nc2qfy+vr4rHHz/kj/2F4QV77g3HosRvkgGlPGglR7f4weywN1kiE//xw6uvs1q7TRJTd7OC/yFJTJl//JAkf7Xir7OnTtN0m/zsQ/ZM9tuqem32VLZKvka+KfurWSc7HWQvsCsdmdU39of5t9ybecH3xnFt/9kmnfhiS6l7J/iHUB1pbfeWozT455QrY94Z0Yk38v57nliH9jdTcvmDTMHGK0tHWRdz8Rv/NcWvp5kqu4UfS/FbZ/TVxaO/t2zfr68e6/s/vLXiiy0ZnPjqENoHaJd8+oV5/Bj+cxqbFCY+8UfukSna7LdTK+j+OCula/sgqznG1jYrt6wOr8zj8tXmvFqu4TZ1f5zJMKds+6BYye0qW9Har/Z2VJ3Vy/2790a+geLk78RXh5Cvmawauy/31v5zNmZlb5PCRCfeeTSVxJSULcXD261Tkj/I3gIr88AVZkN8uYZ7Jq+Web02RVx+6l5nGtwv3t5y8cWS/AVeBS+OzyyoDsGtmW2xsfv2Dtxf4Sp5bZPCRCnefByXBd8n3pw+B4p3G81+bqrPaXte9/ZWiK/t/xnx7hB6xYf/nJTiPVxxmbpoTodlSYVO9R3ivXOtew/5lwJmhaqdvXn1D1P1qr0V4qslzVN9QVt8+1Rfiu/6c07rBydLlOK3SaOKhBp3HeK91lVe8u6VRbUzC/JmRH7ZX+2tEF8taTfuqjZBQ3yrcVfs/qjjz2lsUpgoxZsCObWny7ykbDv4j/XLuS7x3vXUhVvoitaYzz9HipLeFBd2+d7Kz/hq//7l3KaSFDjV/5A/Wd+93VvrzzGbKt+WmFofmfi5AmqavwCKHwWKB+A+vUGN4xyKJzOB4pVC8UqheKVQvFJmLj7vMSv64nICgZm888y+wIyQXLuOE/OE7aDpui4w61cbvz8+qpZdu12v8iGY1HbHzKZxv0jxLXKxdkz81a/znsBS+FGP+I2/9by3N192Xfbn+r21czE/c/FpOeLuJ3fcYGmZianGWm2H/Ne24/QP5mXmtY82eeH1m5Y12W7cPOm2kr1ZfuOG+KtlWzvsZzZiRnJsn32kUasmixGfeMmdPPVQVdaa+H8em8HbvzjxzlOneDuAunFngo9/s2OC/rLH89O0fPfYA9kgu5F2YTniveROLe7iFhafCJn47zNb16//Vb5f7AmgY5jMRiaqgdxcfLlsUwbn8lW2oCGX3VmMeD+5U0Zzipqci125QbPN4Y8fDv/jxsI3iXuBJ75e4/vEP54Xo3t5cG4L7TfeiUWJL8bc88/4K0+8lZuPll6//vd5FYJwsbkBp/q0FF8su7YrbcvR9+v5eF+W+CLqEhTvPtDNivfHfzb/2CT2UR6t6jhD+w25NHdeLstPM068TdrMpUmfLkx8EXUJi7dizIpZ3c+bYtUpwqvlNTbF9Z51uvUv54qF7sXlRtiqF8L/jM+jLh3iTQvQrmiSN+5Un8dyusXnnTU18UWnTvFpkXfgbLu2ESXzF0/2guKVQvFKoXilULxSKF4pFK+U2YvvGZHfFN9Q3jeFUWy/3LZLWpR9usVz5TozSmIsWHz11fR9UxhuK9XG86RFkcAon6v9nIn52YtPyx60rjlP9k5huG1XnfV+0sIlMNxzxc9ZJTGWI75zzpO9UxgGf3jOS1rYBEbxXPFzVkmMxYjvnvNk7xSGoTEgXyQtbAKjeM5fZzZJjMWI757zZO8UhqGVxMgr/FEaFD+fJMbyxLfmPNk7hWGoJzGKpEWVwPBP9bNKYixGfPecJ3unMNy2vSRGkbQoJz2oN+5mlcRYjPjuOU/2T2EYvCRG+ZlRvBMal3OzSmIsR3znnCf7pzAMXhKjTFoUZ5dGB86skhhLEE/2gOKVQvFKoXilULxSKF4pcPEJ+gCUQvFKoXilULxSKF4pFK8UilcKxSuF4pVC8UqheKVQvFIoXil48WQi+ssdL/7nZBKA4tcHZ+7Bw6+8uHntF4qfDJz4yrDv+u7gE4qXQES8MWv+vzuwWp/eZz+yfw6+c8s98Zd2GcVPj6T4p8+v0ps3md0z8yM3XhfPU70UouLf268Mmh/Ze2CgeDIRguLTh5ODgwtzps+4GCj+f2QSRMVn3P301pzx3UKKxyEi3pzcbz65yqQb8eYzPvuRn/uz5RQPQOZy7ubg4F1Wzy/LVn12xjeG3XKKBxB9zx26gJZKHOJNq8+2+FrPoNu+y6XfCL7Gow9AKRSvFIpXCsUrheKVghdPXsS+5Y4Xjx7FmjdRig8lcMwFfW0iOIp/ETGKDyVwTOf9w4nfjUPxLwIsfnAC5+5N9s+lX+Up/kXEIX5QAifN4xo5FP8iIhE/JIGTrs/8l6JbxXNnX2XiCZyn9zXvHJ0LM3n3hnQC5+GkMblzgi7iOJmJ+MEJnJZ3ig8zE/GDEzg3dlS+1qpHF3GczEX8/qAbR7EydbnjEzgyB0Aa4Gs8+gCUQvFKoXilULxS8OLJMEYud7x49DDHTJiT+FAQIx+4raD4YcxIfCiIYXvz3vhrUfww4hQ/fCqUtPmtSYofRtTihwUxWOP3IW7xA4IYDyf1flt0Y3k2jCKqQn4qlGb0Cj0OBgPbgyE/FUozbIkufxhLED84iGGeZY13LEH88KlQbppjswm6/GEsQvz+JOjyh6FCPKdCCSBT9B3gazz6AJRC8UqheKVQvFIoXil48cpAl3cBXjx62EsWFeLDNyNK15qnQtEgPnwzovRG9Rw4CxO/QwLn4e1XFB8B4gmc9QVP9TEgncC5OWt+xitjlPIeAeEEzsPb26Z49CBZB9EYmgjhBI6bGMGfBSdBG+6A4oewy82IWOOjQP5mRBQfBfieu1hBF8zE4BM4MgdAGuBrPPoAlELxSqF4pVC8UvDiVYEu7Qq8ePSwiSRKxIeCGKbP9hO1EyPoEB8MYly2Zq9Gy5BkceKHBzHWzS4ciscgHMSwI7Z6gxiLFf9sEMPeeqxW69HtbFlGKe1RAEyFMpMZMSKSNAWcCqULih/ATlOhpN/ULufQgrug+CEscCoUip+WBC24C4ofgzlOhSJTMijwNR59AEqheKVQvFIoXikUrxS8eE2gC9sDLx49YiaIFvGBBE7gbtJoG4IoEd8xFYrrry+heAzyNyOqT1dP8SDkb0ZUvxcRxYMQvxlRo8KzVQ9CPIFT/4SPcnQuJj+TIZ7AuTyrvzRBa25D8YPZYSqUZr46QWtuQ/HDGZzAaX7EUzwIfM9dhKDLRAJ8AkfmAEgDfI1HH4BSKF4pFK8UilcKXrwi0GXtgxePHjmRQ4344D1p1s2vUKF1yKFFfNdUKGqDGAsUPziIkY/feFA8BuEgxsPbb3mqjwLxqVBW9vYkFeiWtiSjlPVIAO5JUx+gS9BjcXWikjMlwkGMpy8pPg6kgxiXrVM9WnUdit+NHYIY9RlNKR4EvucOrboOxY/LfKZCkSkPPPgajz4ApVC8UiheKRSvFIpXCl68GtAlXQcvHj1mJoUi8aEETnZB3+i5QwuRQo/4UALHdt7rTOAsUvzgBI4dlq2lcCgeg/jNiFjj40B6KpTsPFCfEgPd1pZjlJIeDekEzslFele/0yR6PK4iMjXTIpzAMTGNaBM4FL8zgxM4rPGxIJ3AuYv4ZkQULwm6yeWDLgtJ8AkcmQMgDfA1Hn0ASqF4pVC8UiheKXjxOkAXcwu8ePTgiQi6xDOIUaJKfDiIsVI5LLtU8bsFMeqDNGgnIixc/JCpUCg+DoSDGPZUX+u4RTe3hRilmMdE+p402Rrv1tGNx8fnZXLE70nTnPIsQUs3UPyeDA5i2AW1G88laOkGit+XXYIYjbAlWrqB4uVJ0NINFD8VsU+FIlMKMYGv8egDUArFK4XilULxSqF4peDFLxt08XaCF48eOJsUfeKDNyJyv9S/KU3xGCYSH7wRUXpnO3TXK7+7nuIxvFz84PRNemmXmcE5byHFYxhN/PPpm2LdegiH4jGMJ/75aVDcuo25EdDN7ol5cfFOxXin+ufTN+Eaz4E4COOJTwekbx4Cn/EUD+Hl4offiChv4q/Paq16iocwwuXc4PRN+Dqe4iHge+7Y3IIgIL47fWOgCAz4Go8+AKVQvFIoXikUrxS8+CWDLtwe8OLR4ygTokX8unnBVluQD9s2oXgMk9Z4X7zty3vTXofiMYxb44uBmbffHhysrPi1/xXJ+telHRSPYRrxJ3akJltweeavwBofD9OIz0dq17/3vT+chHpt0Q3vSRmxcMdmWvHvvqx9F74+7ZFDdnQuZhWyTCv+4qZ2qk8vV60XUTyGccV74QsnvmrYt29D5KB4DCNfx1fhi1x81ZC/CY/MUjyGEcWH6vPzUDyG8cTfNGY1yumPYYi36kf7c+cOvq8efQBKoXilULxSKF4pFK8UvPglgy7cHvDi0UNoE6JFfH8CJ12He+7QdiZEi/gWNfGXK9tf34TiMYglcOq3J6igeAxiCRz7ZuCpPhrEEjgPJys7GUYTdMN7UkYs3LERS+C0bzzmkB2d4/BcgVgC5+lLio8JsQSOadUHT/UUD0EugWNnPmy/iOIx4BM4bG9BwCdwRjsAsgv4vnr0ASiF4pVC8UqheKXgxS8ZdOH2gBePHkmZEC3ie4MYN/aiLvClSbSdCdEivkXzncAgRjxIToUS7NqjeAySU6GEZkKheBCCU6GE+/LRDe9JGbFwx0ZwKpTQJ7zw6FzMJoQRnAqlPgNWAcVjkAtitK/2LBSPQTKIQfERgQ9iUDwEfBCDzWwI+L569AEoheKVQvFKoXilULxS8OKXDLpwe8CLRw+hTYgW8f1ToZgITuA7VBSPQXQqlBAUj0EugRMeo6F4EGIJnKf3wawlxYMQnArlIljr0Q3vSRmxcMdG9GZE6HvSxCxCGlU3I6L4CtGbEaXfBC7nKB6CqpsRUXwFPoHDxhYEfAJntAMgu4Dvq0cfgFIoXikUrxSKVwpe/JJBF24PePHokZQJ0SK+P4hhrus4OhcNYkEM071jR+gaUDwGsSDGnZkOIzRIg7YzIfrEh6dCCd+oAG1nQvSJD06Fkq0QmBoB3fCelBELd2wEgxhP70NTYkiOznF4rkIuiGFuQxWA4jGIBTE6vFM8CLEgRueUphQPAR/EoHgI+CAGm9kQ8H316ANQCsUrheKVQvFKoXil4MUvGXTh9oAXjx5CmxAV4vvjN6nrwnt637rYp3gM09X4uvg7O//NetW6PwnFYxixxvfNg3J58F327NPnV963KB0Uj2EC8eH4jX327W2rR5/iMUwgPhy/MYvM+6ElfsmMVbgTMKH4+jwonTWeo3MQJhRfnwfloesznuIhjCi+905ETvj6rN2qp3gIY17H98yD0nMdT/EQxhK/X/wmZRADxUji94zfpBykQYHvq0cfgFIoXikUrxSKVwrFK4XilSI6FUrgljQUD0JwKpRVehO42Kd4DGJToYRG5gwUj0FsKhSKjwuxqVDsqT7Qd8u+egxyU6Fkjbt3oZsRcXQOgug9aUwQownFYxCbCsU+9ab9IorHIHdPmruD4NAtxWPgVChK4VQoSmFfvVIoXikUrxSKjwPxtgvFx8HLioHiZ8usxTOIsT+zFt+ieU8aBjE6mbX454MYoUGaEQ9gxixEfCCIwRrfy0LEhybFCHxT1jBKT+yElz1CLEy8F8Qw95y7C7TuOEhjWZh4L4gRmv/GQPGWmYvvvScNa3wPMxf/TBAjNDZL8RbvyPZokWDFM4jxAnzxraP9+MG+Dw69tvH9L6/Crx4Kgxhx0Cs+bYruefVQ2FcfB4PE3//2i9dX98dJsjK/3//u7+ZR49VDofg4GCb+eJU+/unCWs/+Oz5Nt+70T/GzZZj4/Hz/+NmVc18sofjZsoP4TZK8pvil8NzlXCn+8XxVnOopfgE8VwylePv/Ly4ofiEMFp9eJ8mnX6yiEv/cbWnMnUlaULxl1l22LRribwJ3Ead4x6zF99+WJn14+xXFd7EQ8cHb0qwveKrvZCHiQwmcm7PwZzy76g1zHp17JoHz9jYsfqKBuBAzEd+6iVVgdC6tjdtEJ95L4NzYMbqz1oso3tIrPg2OzsUkvv+2NKzx3QwT/3ievM5+bk2nbf64+eqhyCVwUorvYZj4zWl6fWTGaMyPaGp8/AmcuYs3Q7KZdfMjjedUP4cEzmh/7OgME3+e/Q2vLtL7Y/tvJOL3JGIZkgwT/1mpent4S/FLYNCwrPmMz4yb2A3FL4RBo3PmXJ+d410UI7u6j6VVvxcUb5l1l+1eULyF4pUya/HPBTGadxC3ULxl1uJb1MXfcQ6cbmYtvj+IcXnwHWt8JwsRHwxi8FTfw0LEh4IYXeLZZWtYmHj/njRd4jlIYxB/R0vek4biI0IyiEHxESEZxKD4iGAQQykMYiiFffVKoXilULxSKF4pFK8UilcKxSuF4pVC8UqheKVQvFIoXikUrxSKVwpcPMFA8UqheKVQvFIoXikUrxSKVwrFK4XilULxSqF4pVC8UiheKRSvFIpXCsUrheKVQvFKoXilULxSKF4pFK8UilcKxSuF4pVC8UqheKVQvFIoXikUrxSKVwrFK4XilULxSqF4pVC8UiheKRSvFIpXCsUrheKVQvFKoXilULxSKF4pFK8UilcKxSuF4pVC8UqheKVQvFIoXikUrxSKVwrFK4XilULxSqF4pVC8UiheKRSvFIpXCsUrheKVQvFKoXilULxSKF4pFK8UilcKxSuF4pXyf5yjU7f/CP9hAAAAAElFTkSuQmCC" /> <br />
<pre><code class="r">
#end multithreading
sfStop()
#plot the candidates for deletion
barplot(sort(table(candidates_for_deletion)),main='Unimportant Variables',ylab='Votes',xlab='Variable Name')
</code></pre>
<img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAAAllBMVEX9/v0AAAAAADkAAGUAOTkAOWUAOY8AZo8AZrU5AAA5ADk5AGU5OQA5OWU5OY85j485j9plAABlADllAGVlOQBlOY9lZjllZmVltf2POQCPOTmPOWWPZgCPj2WPtY+P27WP29qP2/21ZgC1Zjm1tWW124+1/v2+vr7ajzna24/a/rXa/tra/v39tWX924/9/rX9/tr9/v1u+zoPAAAAMnRSTlP/////////////////////////////////////////////////////////////////AA1QmO8AAAAJcEhZcwAACxIAAAsSAdLdfvwAAA21SURBVHic7dwNf9rmFYbxyl49O1lTyMsK2dYO1nqsNW/f/8tNj4QEmBfdinnK4Zzr+q1NjG/LlL8ROMP5bk0h++7aV4CuE/BBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAP2k3DL4f3z+v1apz+vXeR0nR0+tLVuEi/zouHlyOfsK79tPqntJQ3eLlpcQy+uXRWPFZvDU4fAPjrtYUv//m1KMq30kXz4q/j8vezolRs35EwiwS5Gt/9vah6LO/SRXE3aUflHb2oxNeLp82xtpPyo+5+2bskfVB5SqivRTr6qPrIzacznhv4DWUN39S+o/SZbhbVJd9/Tr9fDuv3NR+9hS+dJ6VhOt7O5OH38uivPujhpboW9dEH7f7aN01nfuAH1UNyDX//XN7xRsvh3aR6R6kxWjyVlJtL6tP4qD1G+9HbB4BZeclsf/K4Pau3H5QOmN5YPJUfnK5I9ZtbyA18xdrAP25oEnN9Hh7MK+7Sspo28OkOuvPRW/jyrT/qB/GdyeYTNpek30/T11d7kklfBQWn+uw1DhuFffh0Mu6AT4/HreE+/Hp69/PTYL0zaeD3PugAvv6A6jPY7qbhq7NtdY4+A394qm/h0/k8PY07Bj+v+baTBn57yetTfdvx7xhsddPw1VPp6jnVOfiDJ3cb+OJxXhze4+snd9X5PFluJw389pLDJ3cPL8354Ko3i9Jtw9cug/U5+M03XbvfziX4UujhpeQaVGf/Br66tD705pv4dtI+xm8/aPfbuenmyXySv4Ez/a3Dd/Ztf7oTIOCDBnzQvMPTiYAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPmgi/eM8LmHzVCV//FOBtvFac9Lrv8dXLxvfu8e0Po2a8XoEqOuszU1NO9cvhw/+OneqBv0jFfzsq+szUtMf46u8JOLzGvT4Tncgy/Ilr/IaPpTbggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAfNKvziqRhMi+L++fAa9/pMdCKj8KvxaD0t1RcfXg6uca/PRCcyCr/89LyeD+pfm2va1Pc/8XoVnV1xZhM+3eNTt32Pv+yNa3mm1v0Yvxwm+dltP8ZbprIKf+bWfMPH/slZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZCviMWaYCPmOWqYDPmGUq4DNmmQr4jFmmAj5jlqmAz5hlKuAzZpkK+IxZpgI+Y5apgM+YZSrgM2aZ6qrws4eXWVGMet6a/ebXzDLVNeGXHyfl/xbvn/vdmr3WV80y1VXhPz2X93ngHczUmlN9cTeZc6p3MFPjyd3VqYDPmGWqq8KvxsXD7x8nPW/NfvNrZpnqmvCr8WDx4WX+8NLv1uy1vmqWqa78rL6EL//d79bstb5qlqmuf4+fcY+//Zna9jG+KI67L57Kb/Wqk8LBrdnrM101y1RGn9WvxqN0QgD+RmZq7WP8+uidurlw+rjzzqLp1Y3b1WVnRa+ZYarrwS+Hm9vo2Lk+3ePLZn85/PPc1/DaFbvQ7AYMzMOfuK9vWg4Hlfw98Lcw6wf/bQFvcKbW/az+dMAbnKm138evq1dj9PpY4A3O1Dqf1Z8JeIMzNe7xzmZqPMY7m6nxrN7ZTA14ZzO1+k/u0v8L0z/gDc7Umid3xWOvD6uvyqs3tSsGfNaZWnuqnxeHfybbEfAGZ2o7j/HLYU964A3O1Fr43uzAm5yptY/xvdmBNzlT41m9s5ka38c7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ymBryzmRrwzmZqwDubqQHvbKYGvLOZGvDOZmrAO5upAe9spga8s5ka8M5masA7m6kB72ym1gm/HBZ198+tcBPw9mYXg1+vxoMT7wHe4ExNONUvP06OvwN4gzM1HuOdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTA14ZzM14J3N1IB3NlMD3tlMDXhnMzXgnc3UgHc2UwPe2UwNeGczNeCdzdSAdzZTA97ZTK0bfvFUpO6fD94DvMGZWif8ajyqfp0/vLx+F/AGZ2qd8MtPz3u/pqvQ9Aq+q8vOCmbHZheDP3OPpxuu+zF+Oay+nI48xtMN95Zn9XTDAR804IMGfNCADxrwQQM+aMAHDfigAR804IMGfNCADxrwQQM+aMAHDfigAR804IN2KfjV18m5C+b9XrTXcbTVuLh7NXjD0dbrxfvLXbf0Ywgj/WjdV655teuFy3aP37v+6ZadPV7qaOvp6G0v+n11287f9krSvaMtP07Wi3c9virPH65s1uvrSO6C9/ikW/6z+PDP9DVfXrAa7+r0uVudP9rOS/wvcd2mdz/3vMefOdo8fXlPe1B13XCLHz7fCvzTIN0fywumg91Fn3v8+aOlW6jnqf78det7qu/4L033+ksdbvX137ZP9TvXv/5l9fVve7ft0zdSHTna4mlU4uun+q7r9s3wx4+2Gg9OfWz/w80Gxh/jD6//+McvezZ97gfnj5bQL3e09Zvh94+2HPZx7/5PvTn4r5PZ/i3Q45Hv/NGWX94K/+q6vfUev3u0dDrq0/nDzaqfYur1lSR2QfhkMbvfuf7b56fpOXhPqjNHS19CPU/1547WH/7M0fq6d185+/f49OX546ed679zg5bv6vud95mjLYe9/1TgzNG+4fv400er76L9ntWfv3LG4Xs9k+Vo2Q4ndyH4WXH0D1Q2f5tGnzt7rKNd+nB6/Fl90IAPGvBBAz5owAcN+KABHzTggwZ80IAPGvBBAz5owAcN+KABHzTggwZ80IAPmnv4+rWKey/K3XkpY/2S770XW9avk+31+stbzD38unqR+uzES9OPwqdX8AJ/86U7e/WS5ernlxc/fL7/Lb2Ouf5p5ulP1YXlBe1Lthfvf32s4evN6uu/imIwr36soefrui3nH75C//BS/fxy+snEynnz1nr68LL5GZbpYPNjneUb093NavxYfgk8rvdGt59/+HSa35zpl+1PLWzeqk7109FGuf7x6/TGl5ftJn3hpH/KxXZ0+wWAX3z4o/qRpGn6Wzla+Oqt9X8mDfyweRl7eu9ssN3swg/zvtb9zywA/Oofv6Ufrx2Odn4ycfNWuseXT/vTBe0dOV28+vpzu9mFd3JvTwWAX89+Gmyerb3b+8nUd5P19HG9fYyv/3qV6r6entpvNjvw29HtFwG+/jtpZkXx/edRc6qv30rP6u8mzbP6+iReP7rP7pvNLnw7uv0iwNORgA8a8EEDPmjABw34oAEfNOCDBnzQgA8a8EEDPmjABw34oAEfNOCDBnzQgA/a/wEHp+7yS4pqMAAAAABJRU5ErkJggg==" /> <br />
<pre><code class="r">
#plot average ranking
d <- sort(rowMeans(rankings),decreasing = TRUE)
x <- barplot(
d
,main='Average Variable Ranking'
,ylab='Variable'
,xlab='Average Rank'
,horiz=TRUE
,xlim=c(-3,max(d))
,axes = TRUE
,axisnames = FALSE
)
text(-1.5,x,names(d),col='blue')
text(0.5,x+0.05,formatC(d,digits = 1,format='f'),col='blue',cex=0.8)
</code></pre>
<img alt="plot of chunk unnamed-chunk-5" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfgAAAH4CAMAAACR9g9NAAABO1BMVEX9/v0AAAAAADkAAGUAAGYAALYAAOoAAPQAAP4AAP8AK98AOTkAOWUAOY8AOf4ATNQAZo8AZrUAZv4rAOorAP8rK98rTNQra8k5AAA5ADk5AGU5AP45AP85OQA5OWU5OY85Of45Of85j485j9o5j/05j/5MAOpMAPRMK99MK/RMTOpMh75lAABlADllAGVlAP5lAP9lOQBlOTllOY9lOf5lOf9lZjllZmVltf1rK+prK/RrK/9rTN9ra9Rro76HTPSHTP+Hvr6POQCPOTmPOWWPOf6POf+PZgCPZv+Pj2WPtY+Ptf6P27WP2/2ja/Sjh+qjvr61ZgC1Zv+1tWW124+1/rW1/v2+h+q+o9++vr6+vsm+vtTajznaj//a/rXa/tra/v39tWX9tf7924/92/79/rX9/tr9/v39/v5qSD4nAAAAaXRSTlP/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////AP8eEXTlAAAACXBIWXMAAAsSAAALEgHS3X78AAAWfElEQVR4nO2djX/bSJ2HUbIsK45uSJdeehCuvUv3DrbZbjkWMPResulhYBeHJgdxnCZL3pX//y+4eZM0IymyZGve/Ps+nzZxZDmO/XhGo5mvZr6TAZJ8x/cfAPwA8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE8UiCcKxBMF4okSn/i7/WRziceO2LfL5NGFsf12d/243Efd1jeqDQmnstXYt/aYYIlP/PXTxre+G2fiQ3OY7Dy8y1zx1U9N477BE5/4s+Q3vNgeKoWbfEvCt9ztr/06WZtcsp/WJqJ0r/9tl0tS93OunzJBwpLaTT7oz8aW9f8TcqVM7cFyw21tX/UreXUiv6uN2p8QHtGJZ2/rX59uqur6dpe9+4eiGO7wt5kJ+3ZXFkrx0z88ZTvl98tHswdcs8ff6ruxB60fG1v4DWFYf3ApvrLvpth4xj4g6gMgN5Z/QoBEJ55ZE5WxUMjtX/N3lm8qD/787Rdl+0y/X9x1xiSeqSKsLG2ah2i2ZUd8ouRv0R6cV/WbmbHvpfyQ/PXpTmVj8Sc4f5M6EJ14bk2Y4woP2Y1LaWNtIj4KmfSzfnwphT66KO7P5J2P/i49yt3kg/KyLLfw24ey2jYerMSPMvPR/EnYB+Wn/A4pXm0s/gRf71UbsYkv6lH+jn6bH2418aztl7SIzw7XvuZFM9+tFF9ueVi8qOSNR4vjOxefaMd6tRHiB0S+5aLpdLj2JylBvbPSIa8IeB1rVvUll1JjvlspvtxSrepzRL3A/oCR9uhCvPwFhnhU9QNyJhpavIrnCkRJPEzylhb/8TLJC6PZuMvfflY2+c1yt1y88cBK404+OG/m6/tq4vP/xUY07oZD9cCcledymTQv2mD552BHNrDzc6lD49xbncSr3bRjfLHFOJ3THiw3iOZguW8pnn2Ydgzx2p8QHpGJ70mti47kn9DIyopXTfCWLjoKf8LDrKx4eRRfuFd/Vf6EB1ld8aAViCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxRP4pMm/PwpRPElflYH4l0C8USBeKLYEj9OX8gbNz/RZvAvfoB431gSX+rWxV+lH2ni3//bV0L4Xvq9/4Z41ywvnpvl/69SofX+NfvGvqTfyO2a+COxTcDET9MPhfjp99+xfxDvmMHE339xnJ0/ZnZf8G/KuCneqOpP/+v0t0L8wZYq+xDvkuHEvxZzuvNv7DMwX/xslot/qW5BvEuGq+pvttN0wmt6xqSPeJR4HwwnnnH18QWv8eXGruJxjPfC8uJ55X7+0TGTzsXzYzz7pup+tr1d/PufvUOr3gsDnM6dp+mnrJwfFa16VuNzw3I7zuPDBD13RHEgnrf6RItP2wbxvsGwLFEQxCAKxBMlpKoe1b9DAmrVNwHxtoB4orgMYvDzOrmAN8R7x2EQg/fh3mzLs/lkdvomTUUXfdFlC/EucRjEuHrMvhzJIp/MphtM+cuZNkgD8S5xG8TI1Ki9quoPuPhiWBbiXeI2iJGN1ZFfiOeFXgtiQLxLnAYx7l8r70L83oYs9ijxPnAZxLjZHuU3eePupXSLY7wXHAYxNO9M/B6vGbb0IAbEu8RhEONcDM7mrfpuQLwt0HNHlICCGBDvEozOEQXj8USBeKKEXtWj+rdE4K16NPhsAfFEcRnEUAO3HIj3jcMghujNeyxvJzx/kcre+rYuW4i3hcsZMbLyYyCCGHJQrnWQBuJt4TiIoZX4fBKc1mFZiLeF0yDGzXbebyvG49Mt7rE1iAHxtnA7I0Y1erWFEu8LtzNi6GHLLSUex3gvOAxi8Hu1Es9a9RuzeUEMiLeFyxkxzsuxWZzH+wY9d0QJPIgB8bbA6BxRMB5PFIgnSoxVPar/AYiwVY8G3xBAPFEgnihu16TJxj2nQoF4WzhdkyY7N+bAEWM0/RM4ED8EThM4N89/pYmfyvH43qNzED8EThM444lW1Z88+YsMYvQdj4f4IXCZwDl/oR3jmezpYgkciB8Chwmcm+cXmvgDMTECSrwvHCZw5MQIsrEvJz/CMd4fbtekMU/nmPhFEjgQPwRu16TBeXwwoOeOKBEmcCB+CDAsSxQEMYgC8URZlaoeB4KerEirHk2/vkA8UVwGMXif7Uf5xAgQ7xeXQYwjffZqftGk7KotbkC8Q1wGMcZlF04yO/2dkl3cgHiXOAxiiBHbcjz+5z9KxRIVxQ2Id4nDIIZYemycXx8//fAruRhRcQPiXeJ4KpRyRgyOHJDXb0C8K/xOhfJSSUeJd47jqVCy3xenc3s8eSWDGKndAg/xDUQ4FQrEDwF67oiyIkEMiO8LRueIgvF4okA8UWhU9aj0a5Bo1aOZVwfiieJ0Royxpw4ciK/jOIghum0ziPePwyCGGr8RMPEHqRqGH+KiSYjvi8Mgxs3zP2pV/ZRZ3xOjc0NcJg3xfXEaxBiJyRE4yezgM1XiB5kYAeL74nhGjHKFir0NNQw/yFQoEN8Xh0GM+y918Vz6dLCpUCC+Ly5nxDjSq/ppXuJxjPeCyyCGaPDLm7JVP9xUKBDfF/TcEYVEEAPi62B0jigYjycKxBOFWlVf4ueFBwOxVj2aeTkQTxSIJ4rDBI6cvVq/Whbi/eF2TRo9gaMFMRz22UK8wumaNMXgnB7EcDAVCsTXcbomjdhBoAUxHEyFAvF1XK5JUxZ4PYjhYCoUiK/jdCqU/AhvBDE4lqdCgfg6TqdCOXqR39KDGPanQoH4Ok7XpCknujODGLanQoH4Om4TOLp4z0C8FyDeN8QSOBCfg2FZoiCIQRSIJwrdqr4dP2+LQ8i26tuBeEtAvG9cToXCzuu0S6iCBuIXpCmIIfpwyyBGccmcy2vnugLx8+kcxBDXx3+RT1teXCTr9GrZrkD8fHqsSaOX+JMn76buZ8ToCsTPp3sQg9UD+YA8q+qnqSznTmfE6ArEz6f7VCjbk+yqWHDwYGN28gn3jRLvBYdBDD5ar02FkvvGMd4LDoMYZok/fZOmL53PiNEViO9A5yDGla81afoD8ZaAeN+QDWK0A/GW8D34Nhc/b4tDMB5PFIgnCqr67vh5qyyBVn1nIH4AIN43CGJ0BuI70BzEGOlBjOiAeJN+QYxikIb31ctrJUPsq28C4k16rEljjM55WZNmGSDepHMQQ1T1qTYsO9tzvkLFMkC8SfcZMdgen471qv4zIT7EBE4TEG/SY0aMTAtbcg5Q4r3hMIghNqhpr8RUKO//nTfqcIz3ghJ/t588+vbzSfu+D9AniKGFLfdCTuA0sYri7/Z3rp9dXD66mLP3YOA83jdS/O2rYyaefbXxFDEGMZpYRfGyxJ+hxLexiuL5MT5J3HnHsKx3EMQgCoIYy+Hn7RsALv52V72KdSuNuyZiPMY3Ebd4D0C8byB+KaIXL1r1mwP9zqb0jfohv1Ia4n1TnMezrwOdxzevR3MlOnTHI9ldn4hxmaBnQ+lE7OJln13/nrvu69EciW18cE5s/G764VeBz4bSidjFZ4eixPeu6/usRyO25SGcD/gQfDEgG83IbI24xS9+OtdjPRpZM+RzIyRCfB7BiCaLUSNu8YvTYz0as8QnKPGeGUZ81iF9c6Mf44V4HOM9osSfLVbV91iPRgofv1CtelG1i8Z8TFmMGrGLv/18crm5QOOue/oG5/GBUZzOyX+unhbifaM6cN5O2L/rHw8uvil9w4F436hjPHN+mSQ7zp7WwwCqHZy9Y0ODIAZRIJ4ooufu1d9k353DIAbQcfW2a2A8PgD8iXd4IieBeB1/4tm53JK/Z1z9BcYGNWxbAvE6Hkv88Md4Xbzoy3us35sHMVK5cDy7EWmf7SDEfIwfT/KBmed/TNORED/+WEv0mJdLyyDGhhqUO/0dYeuzVRG/LUZq2IajF/oOZon/QI2/S/Hvf/6jdMPzm+8Tj+IXHJ0rKcWrkdrxv+jeb7bNXls5OseqeDH/0fRDkcogi8dj/KKjcwV18Z9+aUQ382mPJEmeuDnYUi9+uvXwG7PqeD2dW3J0ri5+cm5U9dnRSPtBBjG2lHhxAyXeKUONzo0nWvhCii8b9voyRJI8iMEO7TKIkRIu8D6P8UuPzjHJZfhCiS8b8ufVkVmcx+tE3Ko3y/N8IF4nXvHn6cdNF+E8FMOAeBNf4m9315btse2Lx5GwEHH87nPya+cGu2ISxEFR1V86HI0H/tGO8be7CGJ4x5WAUrxL7WjcPYRr8ewY77aih/hm3IofolXfGsQ4Fyd1Rpet73c4UDxU9UNT/SRcGWf6EN9MfOLnBTEqXXuJuiz6IK5laawTs/jmIIaZw5AJnNlsqsbjo71MemhiFt8YxKj25csEzsmTvwjx8U6MMDSrIN4IYlxV+vLFsCyTLeMX8U6FMjSrIN4IYpj5u3wqFN7Wj2pZGuvEKL4liFFv4+fRqymO8QYxim8LYtSG6zXxUU+FMjTRiUcQYxhiE48gxkDEJr43EN/M6osHzTgzgBkxiIISHyqWDeAYHygQTxSIJ0os4tunQsnGuISqJ7GIr2GIPxpVEzjVIAb6bCvEIr41gaMmstcwgxjUp0JpIj7xDQkc8WEwqnoziEF9KpQm4hPfkMC52R6JVUlKzCAG9alQmohXvJbAKZaiKTCDGBzKU6E0Ea94LYFz/2XzjBhFEIP4VChNxCO+LYFz1FjVz7QgBumpUJqIR3x7AgdTmvYkEvFI4AxNHOKRwBmcOMT3xveYZwRYNoAgBlEgniio6kPFsgGMxwcKxBMlFvFzghiVBSogfi6xiK9hir+q9twhiDGHWMS3T4VylH5jlngEMeYRn/jmqVAqVT2CGPOIT3zjVCi1YzyCGHOIV7y5Jk2TeAQxWohXvLkmTZP4GYIYDxOP+NapUFrEI4jRSDzi24IYOI/vTSTiEcQYmjjEI4gxOHGI743voa8IsGwA4/FEgXiioKqPhKENYDw+DiCeKMGKbw9i8PM6fUkaiO9LsOJr6OJ5987NtnHRpO83MjaCFd8axLji65IcGatQiQROHrw5fZOmmLa8lQjENwcxKp26IoFTBG+mG+xDgNG5NiIQ3xzEyMbGjyKBYwRvMCzbSkTijSDG/WvzYyCGZfXgzRTRq1YiEq8HMfgkOAb5eHwevNmD93YCFt8SxKh5VyU+D96cvkE9P4eAxbcEMRrWlhUlXgRv3v/s3Z4evgNNhCoeQQzLBCoeQQzbBCq+NxDfk5URD3oytAEEMYiCEh8dwxjAMT42IJ4oEE+UwMS3J3B4150xJQbEL0xg4mtU16QxMYMYWEa8B4GJb58KpVYfVIIY3383RQKnI8GKb0jg3L+ujNFUghgnT95hPL4rwYpvXJNmUin1lSDGFJG7zgQv3pwKpRq2NIIYBxuzk0++8vt+RkPw4s2pUJrEF0GMg62ZmvcOzCU48S0JHLHK5O+N0zkjiMHj1cjgdCQ48W1ToZxXB+VxHr8wYYlHAscZQYlHAscdQYnvje+hzZgZxgCCGESBeKKgqo+TpQ1gPD5KIJ4o4YifOxUKFhUeknDE1zCnQhll58aZvgxiHKTq+ngEMXoSjvjWIAZfO75hRoxiIgwEMfoSoviGIEZdvAhiHHymSjxG5/oSoviGIIao6tNaEIPPhiCHZV/O8vF50ImQxetBDNa4+7SewOHSZRADJb4nIYuvBDHuv6iNx0/zEo9jfF9CEt8SxBB3PdZ3l0GMAxXEQKu+LyGJbwtiXKWVcVucxy9HMOIRxHBLKOIRxHBMKOJ743twK3qWNoDxeKJAPFFQ1cfOggYwHh85EE8U3+Lbgxj8Z6xJYwXf4mtUxJ9jMSI7+BbfPiNGdvP8V6b4agIHnfULEo74xjVpxhOzqjcTOMWcKKA34YhvWpPm/EXlGG8mcIzFaUAvwhOvBTF49KrSuDMSOPriNKAf4YnXghhyhQo9l2EmcDhTrFCxEP7FtwQxxA71El8mcLaw/Nii+BffFsTIGsXrCRwsSbMgnsUjiOELv+IRxPCG76q+JxA/FLGJB0OxoAEEMYgC8URBVb9idDWAxt1qAfFEcSt+XvxGdOHdv9ZO9iHeEp5LvCn+SkyAMx5pF06KIMZBvno4n70aV8sOgusS3xa/OUq/Yffyy6TLznsRxGCc/if3XSQywLL4Et8YvxHCzelQPlAzYPxPEb3B6Nwg+BLfFL+R4vnnoRSvliYplyDCYkTD4Fu8uSBNvcQr8UWB34P3YfAt3pwH5aZ2jJfi+VA85/QN6vmBcC2+PX4jhI9f6K16If7kH2fC/l7evAfL4vw8vjV+g/N4ZzgV3zd+k0G8NVyK7x2/ySDeGsH31QNLdDWA8XiiQDxRUNWvLO0GMB6/qkA8URyJx5o0oeGrxJtr0kya1qQpghiYvdoCzkr8vDVpjPnqK0EMzFc/PM7FN2Qx6iXeDGKcPHmH8fihcS6+KYthDNBwKkGMKSJ3g+NNvD4VyvYku/qotjRJEcQ42JidfII1aYbFm3gti2HGrkrxeRADixFZwJn4lizGQyU+D2LweDUyOAPj7jy+fU0ac3gW5/HWcSMeU6EEhxPxmAolPALtqwfWaTeA8XiiQDxRIJ4oEE8UiCcKxBMF4okC8USBeKJAPFEgnigQTxSIJwpG58LHigFf4vEcnp8D4ok+B8QTfQ6IJ/ocEE/0OSCe6HNAPNHnQAcOUSCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oPsTf7iaPmuZRGJKzJEnWj+fvtwTXPz62/lrEc1h5LR7E3+2PsrNNy09yOLL8BNkll2H5tYjnsPNaPIi/fXUsP8kWuXvbe32knhyufc1eg93XIp/DzmvxIP762UV2+7ldMawGThLLhZ4Lt/1a+HPYeS0exF8+si/++p8m1ks9l2L7tYgPl5XXsqIlXmD5OO+qxAsGfy0reowXOBBv+7WslPi7/R3rrXpeBd/9r/3TOduvJT+cDP9aVvg8fs3y0cTlefzwrwU9d0SBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnCsQTBeKJAvFEgXiiQDxRIJ4oRMT3yr5fP02aLlO0f1mWS4iIv/zNTvedZZj9rJqehfj4uHv7519e8KtemLzbXV6Yr//5F+vHvGiPRET6B/8xyeQdWS5eXCgjdrh+9gf+jT32bt96LtwVNMQz54cjXmL5DXEJxPXTkaj/uV625XJtkh3m10YUJV7tcP10h1/VwB5/2KPeCBwa4s+YuU3+9WyHy7x9dZxfm8Ru8sugeE0g78jyY7wq22pf9v/u7U9XxzsN8Xf7orF2/ezvoqbnF6ZI8Ydyc34IUFesiOuW5LUrcgclfv9ff7kqFT0R8Vxtxuv6r59dyEKtLjwf8e95iX+VN+PldUvrxQ5FiZ+crU6RJyH+jF9qyuv6hIk7FAdsJVNcfF4e4y9F/S7E3+1v5juU4leoYU9BvLzUlB3DuWXeeF+bFFcj/uAXI77lh28n6o6saNWzJp3coRTv5vJuJ1AQ34HVEdoViBdNP9uXVIcHxBMF4okC8USBeKJAPFEgnigQTxSIJwrEEwXiiQLxRIF4okA8USCeKBBPFIgnyv8DGiGDw0+cF/UAAAAASUVORK5CYII=" /> </div>
Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com29tag:blogger.com,1999:blog-8236580214478682937.post-13058448339963388952012-04-08T00:48:00.001-07:002012-04-08T00:50:42.701-07:00Pretty R<div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-family: Arial, Helvetica, sans-serif;">I just stumbled across a tool for syntax highlighting R code on web pages. It also sticks in hyperlinks to the help pages for keywords. This will make blog posting code easier!</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<a href="http://www.inside-r.org/pretty-r/tool"><span style="font-family: Arial, Helvetica, sans-serif;">http://www.inside-r.org/pretty-r/tool</span></a><br />
<br />
<br />
<br />
<div style="overflow: auto;"><div class="geshifilter"><pre class="r geshifilter-R" style="font-family: monospace;">totalrecords <- <a href="http://inside-r.org/r-doc/base/nrow"><span style="color: #003399; font-weight: bold;">nrow</span></a><span style="color: #009900;">(</span>mydata<span style="color: #009900;">)</span>
trainfraction = <span style="color: #cc66cc;">0.7</span>
trainrecords = <a href="http://inside-r.org/r-doc/base/as.integer"><span style="color: #003399; font-weight: bold;">as.integer</span></a><span style="color: #009900;">(</span>totalrecords * trainfraction<span style="color: #009900;">)</span>
allrows <- <span style="color: #cc66cc;">1</span>:totalrecords
trainrows <- <a href="http://inside-r.org/r-doc/base/sample"><span style="color: #003399; font-weight: bold;">sample</span></a><span style="color: #009900;">(</span>totalrecords<span style="color: #339933;">,</span>trainrecords<span style="color: #009900;">)</span>
testrows <- allrows<span style="color: #009900;">[</span>-trainrows<span style="color: #009900;">]</span>
<span style="color: #666666; font-style: italic;">#check</span>
<a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>trainrows<span style="color: #009900;">)</span>
<a href="http://inside-r.org/r-doc/base/length"><span style="color: #003399; font-weight: bold;">length</span></a><span style="color: #009900;">(</span>testrows<span style="color: #009900;">)</span>
<span style="color: #666666; font-style: italic;">#then build model, something like...</span>
model <- <a href="http://inside-r.org/r-doc/stats/lm"><span style="color: #003399; font-weight: bold;">lm</span></a><span style="color: #009900;">(</span>theFormula<span style="color: #339933;">,</span> <a href="http://inside-r.org/r-doc/utils/data"><span style="color: #003399; font-weight: bold;">data</span></a>=mydata<span style="color: #009900;">[</span>trainrows<span style="color: #339933;">,</span><span style="color: #009900;">]</span><span style="color: #009900;">)</span></pre></div></div><a href="http://www.inside-r.org/pretty-r" title="Created by Pretty R at inside-R.org">Created by Pretty R at inside-R.org</a><br />
<br />
<br />
<br />
<br />
<br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com2tag:blogger.com,1999:blog-8236580214478682937.post-78325539192736122922012-01-01T19:04:00.000-08:002012-01-02T04:40:03.441-08:00Useful R Snippets<div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-family: Arial, Helvetica, sans-serif;">Every time I need to do something in R it nearly always means I have to do a Google search or trawl my previous code to see how I did it before. Here I am going to post some snippets of code - mainly for my own use so that I know where to find them. Much of this code will be 'borrowed' and probably not the most efficient (I like to write code the long way so I can follow what is going on) - but it seems to work. If anyone finds it doesn't work or there is a more efficient way then please let me know.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<b style="font-family: Arial, Helvetica, sans-serif;">1. Randomly sampling data into a train and test set</b><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">totalrecords <- nrow(mydata)</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">trainfraction = 0.7</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">trainrecords = as.integer(totalrecords * trainfraction)</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">allrows <- 1:</span><span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">totalrecords</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">trainrows <- sample(totalrecords,trainrecords)</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">testrows <- allrows[-trainrows]</span><br />
<span style="background-color: white; color: #38761d; font-family: Arial, Helvetica, sans-serif; font-size: x-small; white-space: pre;">#check</span><br />
<span class="Apple-tab-span" style="background-color: white; white-space: pre;"><span style="color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">length(trainrows) </span></span><br />
<span class="Apple-tab-span" style="background-color: white; white-space: pre;"><span style="color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">length(testrows)</span></span><br />
<span style="background-color: white; color: #38761d; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">#then build model, something like...</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;">model <- lm(theFormula, data=mydata[trainrows,])</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Helvetica, sans-serif; font-size: x-small;"><br />
</span><br />
<b style="font-family: Arial, Helvetica, sans-serif;">1a. Randomly sampling data into a train and test set</b><br />
<span style="font-family: Arial, Helvetica, sans-serif;">Thanks to Isamoor</span><br />
<span style="background-color: white; color: #cc0000; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;">trainrows <- runif(nrow(mydata)) > 0.7</span><br />
<span style="color: #cc0000;"><span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;">testrows <- !trainrows</span></span><br />
<span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"><br />
</span><br />
<span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 12px; line-height: 14px; text-align: -webkit-auto;"><br />
</span><br />
<br />
<span style="background-color: white; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; line-height: 14px; text-align: -webkit-auto;"><b>2. Repeated n fold cross validation</b></span><br />
<div style="text-align: -webkit-auto;"><span style="font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><br />
<span style="line-height: 14px;">This is to generate a cross-validation set, useful when wanting to know the expected error or for generating a set to use for getting ensemble weightings.</span><br />
</span><br />
</div><div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><br />
<br />
<br />
<pre style="background: #ffffff; color: black;"><span style="color: #3f7f59;">######################################</span>
<span style="color: #3f7f59;"># the error function</span>
calc_RMSE <- function(act,pred){
aact <- as.matrix(act)
ppred <- as.matrix(pred)
<span style="color: #7f0055; font-weight: bold;">if</span>(nrow(aact) == nrow(ppred)){
<span style="color: #7f0055; font-weight: bold;">return</span> (<span style="color: #7f0055; font-weight: bold;">sqrt</span>(sum(((ppred) - (aact)) ^ 2) / nrow(aact)))
} <span style="color: #7f0055; font-weight: bold;">else</span> {
<span style="color: #7f0055; font-weight: bold;">return</span> (-99)
}
}
<span style="color: #3f7f59;">#####################################</span>
<span style="color: #3f7f59;">###########################</span>
<span style="color: #3f7f59;">#Load and prepare data</span>
<span style="color: #3f7f59;">###########################</span>
databuild <- iris
datascore <- iris <span style="color: #3f7f59;">#put real score set here</span>
<span style="color: #3f7f59;">#target - what we are predicting</span>
theTarget <- <span style="color: #2a00ff;">'Sepal.Length'</span>
<span style="color: #3f7f59;">#set the formula</span>
theFormula <- as.formula(paste(theTarget,<span style="color: #2a00ff;">" ~ . "</span>))
<span style="color: #3f7f59;">#find the position of the target</span>
targindex <- which(names(databuild)==theTarget)
<span style="color: #3f7f59;">#actuals</span>
build_actuals <- databuild[,targindex]
<span style="color: #3f7f59;">#######################################</span>
<span style="color: #3f7f59;">#vectors to score the model outputs</span>
<span style="color: #3f7f59;">#######################################</span>
buildcases <- nrow(databuild)
scorecases <- nrow(datascore)
pred_train <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=buildcases)
pred_test <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=buildcases)
pred_score <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=scorecases)
pred_trainLoop <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=buildcases)
pred_testLoop <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=buildcases)
pred_scoreLoop <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=scorecases)
<span style="color: #3f7f59;">#settings</span>
numloops <- 300
numfolds <- 10
test_errors <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=numloops)
train_errors <- vector(<span style="color: #7f0055; font-weight: bold;">length</span>=numloops)
pred_testLoop <- 0
pred_trainLoop <- 0
pred_scoreLoop <- 0
modtype = <span style="color: #2a00ff;">'linear regression'</span>
<span style="color: #3f7f59;">#####################################</span>
<span style="color: #3f7f59;"># now the work</span>
<span style="color: #3f7f59;">##################################### </span>
<span style="color: #7f0055; font-weight: bold;">for</span>(loop in 1:numloops){
<span style="color: #3f7f59;"># generate the indicies for each fold </span>
id <- sample(rep(seq_len(numfolds), <span style="color: #7f0055; font-weight: bold;">length</span>.out=buildcases))
<span style="color: #3f7f59;"># lapply over them:</span>
indicies <- lapply(seq_len(numfolds), function(a) list(
test = which(id==a),
train = which(id!=a)
))
<span style="color: #3f7f59;">#reset the predictions for this loop</span>
pred_train <- 0
pred_test <- 0
pred_score <- 0
<span style="color: #7f0055; font-weight: bold;">for</span>(fold in 1:numfolds){
<span style="color: #3f7f59;">#set the cases for this fold</span>
rows_train <- indicies[[fold]]$train
rows_test <- indicies[[fold]]$test
<span style="color: #3f7f59;">#build the models - use any model</span>
model <- lm(theFormula, data=databuild[rows_train,])
<span style="color: #3f7f59;">#score up the model</span>
buildPred <- predict(model, databuild, type=<span style="color: #2a00ff;">"response"</span>)
scorepred <- predict(model, datascore, type=<span style="color: #2a00ff;">"response"</span>)
<span style="color: #3f7f59;">#now score the cv and scoring predictions</span>
z <- buildPred
z[rows_test] <- 0
pred_train <- pred_train + z
pred_test[rows_test] <- buildPred[rows_test]
pred_score <- pred_score + scorepred
} <span style="color: #3f7f59;">#next fold</span>
<span style="color: #3f7f59;">#average the predictions on the train set</span>
pred_train <- pred_train / (numfolds - 1)
pred_score <- pred_score / numfolds
<span style="color: #3f7f59;">#add to previous loop results</span>
pred_trainLoop <- pred_trainLoop + pred_train
pred_testLoop <- pred_testLoop + pred_test
pred_scoreLoop <- pred_scoreLoop + pred_score
<span style="color: #3f7f59;">#calculate the errors </span>
train_errors[loop] <- calc_RMSE(build_actuals,pred_trainLoop / loop)
test_errors[loop] <- calc_RMSE(build_actuals,pred_testLoop / loop)
<span style="color: #3f7f59;">#report</span>
cat(<span style="color: #2a00ff;">"</span><span style="color: #2a00ff;">\n</span><span style="color: #2a00ff;">loop = "</span>,loop,<span style="color: #2a00ff;">"train error = "</span>,train_errors[loop],<span style="color: #2a00ff;">"cv error = "</span>,test_errors[loop])
<span style="color: #3f7f59;">#plot a chart as we go</span>
<span style="color: #7f0055; font-weight: bold;">if</span>(loop>1){
plot(test_errors[1:loop],col=<span style="color: #2a00ff;">'blue'</span>,type=<span style="color: #2a00ff;">'l'</span>,main = paste(modtype,numloops,<span style="color: #2a00ff;">'by'</span>,numfolds,<span style="color: #2a00ff;">'-fold cross validation'</span>), xlab = <span style="color: #2a00ff;">'Repetitions'</span>, ylab = <span style="color: #2a00ff;">'RMSE'</span>,ylim = range(rbind(test_errors[1:loop],train_errors[1:loop])))
abline(h=test_errors[loop],col=<span style="color: #2a00ff;">'blue'</span>)
points(train_errors[1:loop],type=<span style="color: #2a00ff;">'l'</span>,col=<span style="color: #2a00ff;">'red'</span>)
abline(h=train_errors[loop],col=<span style="color: #2a00ff;">'red'</span>)
legend(<span style="color: #2a00ff;">'top'</span>,c(<span style="color: #2a00ff;">'test'</span>,<span style="color: #2a00ff;">'train'</span>),col=c(<span style="color: #2a00ff;">'blue'</span>,<span style="color: #2a00ff;">'red'</span>),lty=1)
}
} <span style="color: #3f7f59;">#loop</span>
<span style="color: #3f7f59;">########################</span>
<span style="color: #3f7f59;">#the cross validation predictions and scoring set predictions</span>
<span style="color: #3f7f59;">#this is what we are after</span>
cvPredictions <- pred_testLoop / numloops
scPredictions <- pred_scoreLoop / numloops
<span style="color: #3f7f59;">#plot should show decreasing test error with increasing train error</span>
plot(train_errors,test_errors,type=<span style="color: #2a00ff;">'p'</span>)
</pre><br />
</div><br />
<br />
<br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com3tag:blogger.com,1999:blog-8236580214478682937.post-88731154938421704142011-12-26T13:39:00.000-08:002011-12-26T13:39:35.289-08:00$3 million - hmmm...I've shown this plot a couple of times already, but here it is again, with another that shows the bar we have to limbo under to win $3 million.<br />
<br />
Anyone think this is possible?<br />
<br />
Any predictions of what the final winning error will be?<br />
<br />
I predict 0.453<br />
<br />
Place your bets here...<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIHFj-Stlh0rxVoaajP9fEstzgWX3DJ05mhVaccR3YW7gW26ZmddVH5DigJ7hGAb9iHKJkRglBMFy5s32HN5u5731nTy342uvfTMweF82XtJh8F6FDrjIy0Mdn4Hmfiw1QO0bDNozLWkIk/s1600/leaderboardthroughtimeDEC2011.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="320" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIHFj-Stlh0rxVoaajP9fEstzgWX3DJ05mhVaccR3YW7gW26ZmddVH5DigJ7hGAb9iHKJkRglBMFy5s32HN5u5731nTy342uvfTMweF82XtJh8F6FDrjIy0Mdn4Hmfiw1QO0bDNozLWkIk/s320/leaderboardthroughtimeDEC2011.png" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivHNpllypsN2KdN4yWH4PkXbnAFquTUMX5AmdOmP4hPN6b2pqN5VzGe7vFZEUM25Io9ey4r5R8hrpTH3UwVMyXOSsKo7BCivVq1RNCIj6o4Wf28Z-DJPYXbGoS7V_uStq-Fjyg0gIurGbn/s1600/leaderboardthroughtimeDEC2011+3million.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="320" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivHNpllypsN2KdN4yWH4PkXbnAFquTUMX5AmdOmP4hPN6b2pqN5VzGe7vFZEUM25Io9ey4r5R8hrpTH3UwVMyXOSsKo7BCivVq1RNCIj6o4Wf28Z-DJPYXbGoS7V_uStq-Fjyg0gIurGbn/s320/leaderboardthroughtimeDEC2011+3million.png" /></a></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com5tag:blogger.com,1999:blog-8236580214478682937.post-81923017881250198982011-12-17T13:54:00.000-08:002011-12-17T14:12:40.958-08:00What's Going On HereIn many of the analytics problems I have been involved in, the problem you end up dealing with is not the one you initially were briefed to solve.<br />
<br />
These new problems are always discovered by visualising the data in some way and spotting curious patterns.<br />
<br />
Here are a three of examples...<br />
<br />
<b>1. Algorithmic Trading Challenge</b><br />
<br />
The <a href="http://www.kaggle.com/c/AlgorithmicTradingChallenge">Algorithmic Trading Challenge</a> is based on data from the London Stock Exchange and is about things called 'Liquidity Shocks'. I know nothing about these but we had data, so the first thing I did was plot a few graphs to see id I could get a better understanding of things.<br />
<br />
The plot below shows the times these 'Liquidity Shocks' occur.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxDim19JzDfASGfZDzXFVrVbXku6We9AS_6xX6qs-iFkNnV8Wm0OZZ8hAbgmNWgWXA_7yNzZnm6NOeooY7L4hREYkFw9EkotrXyyyrZPEs61FFHKD3Euy_6MJzxV6CQmXdR5MVC0ekoJG_/s1600/shock+event+timings.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="202" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgxDim19JzDfASGfZDzXFVrVbXku6We9AS_6xX6qs-iFkNnV8Wm0OZZ8hAbgmNWgWXA_7yNzZnm6NOeooY7L4hREYkFw9EkotrXyyyrZPEs61FFHKD3Euy_6MJzxV6CQmXdR5MVC0ekoJG_/s320/shock+event+timings.png" /></a></div><br />
Now it is quite clear there is something going on at 1pm, 2:30pm, after 3:30pm and at 4pm.<br />
<br />
Interestingly these spikes are only evident when all commodities are looked at together, they are not as obvious in any individual commodity.<br />
<br />
My first question if I was solving a business problem would be to return to the business to get more insight in what was going on here. My initial thoughts were lunch breaks and the opening times of other Stock Exchanges around the world - as 3:30pm London time could be around opening time in New York.<br />
<br />
Understanding the cause of these peaks is important as you would expect the reaction to them (the problem to solve) to be a function of the cause. <br />
<br />
If we did discover it was the opening times of other exchanges, then I would ask for extra information like the specific dates, so I could calculate when these peaks would occur in the future when the clocks changed. We do not have this information at the current time, or even the day of the week (it can be inferred but not accurately as there will be public holidays when the exchanges are closed) <br />
<br />
As it stands any models built could potentially fail on the leaderboard (or real life) data as our model might think 2:30pm is a special time, wheras really it is when another exchange opens, or when people come back from lunch. We need this causal information rather than just dealing with the effect - time differences change - lunch breaks may change.<br />
<br />
The current competition data is potentially lacking the full information required to build a model that is as robust as possible over time.<br />
<br />
<b>2. Interesting Distributions</b><br />
<br />
One of the first things I do when receiving a data set is to scan the distributions of all variables to sanity check them for anything that looks out of place - but still things can sneak past you.<br />
<br />
The following is exam mark data in the range 0-100. If we bin it in 20 bins then things look reasonable, but if we zoom in then we get the 'what is going on here' question again. It is quite clear what is going on, but if exam marks is the thing we are trying to predict, how do we deal with this phenomenon and how would our algorithm cope looking at it blindly? And what if the pass mark changed or rules changes - the algorithm would fail. Again, we need to be aware of the underlying root cause and not just the effect.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFJV3qnuWpULDDwC7zwBzfskOMPrbk2EMZ2t0JxiiiGyu_68meXtD2NPryGog3i4fCwP-qaZ3OznbGqSiy4sJl14Qr7B34gcGAyHsFhSnOZfElTF9PrcnR8hhxCjIjVsReta9gQ5VseBqM/s1600/A1.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="162" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFJV3qnuWpULDDwC7zwBzfskOMPrbk2EMZ2t0JxiiiGyu_68meXtD2NPryGog3i4fCwP-qaZ3OznbGqSiy4sJl14Qr7B34gcGAyHsFhSnOZfElTF9PrcnR8hhxCjIjVsReta9gQ5VseBqM/s320/A1.PNG" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKjbToc8fu8gOGFxBTNq1OjLn_Idmz6w1muLFUs92nrD7S56QUtS6iK1KvnJMWiBH574m3L0PDUq7E83IpA4b-NwQxE6oi7KunI-LhX73-uuEb40ZAgN2VelKL_zaXlzAGOXcC8rOGJ7gX/s1600/A2.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="165" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhKjbToc8fu8gOGFxBTNq1OjLn_Idmz6w1muLFUs92nrD7S56QUtS6iK1KvnJMWiBH574m3L0PDUq7E83IpA4b-NwQxE6oi7KunI-LhX73-uuEb40ZAgN2VelKL_zaXlzAGOXcC8rOGJ7gX/s320/A2.PNG" /></a></div><br />
<br />
<b>3. Don't Get Kicked</b><br />
<br />
This is another <a href="http://www.kaggle.com/c/DontGetKicked">Kaggle Competition...</a> <br />
<br />
<i>Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.<br />
<br />
Modelers who can figure out which cars have a higher risk of being kick can provide real value to dealerships trying to provide the best inventory selection possible to their customers.<br />
<br />
The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy)</i><br />
<br />
This is a binary classification task and a quick way to spot data issues with this type of problem is to throw it in a decision tree in order to spot what are called 'gimmees'. These are cases that are easily perfectly predictable and are more than often a result of giving prediction data that just shouldn't be there as it is not known at the time (future information) - an extraction issue that would result in a useless model (It is common that people think they have built really good predictive models using future information without really questioning why their models are so good!).<br />
<br />
Another reason 'gimmees' occur are poorly defined target variables, that is not excluding certain cases (and example in target marketing would be not excluding dead people from your mailing list and then predicting they won't respond to your offer!)<br />
<br />
After a bit of data prep I threw the Don't Get Kicked Data into a Tiberius Decision Tree - the visual below immediately tells me there are clear cut cases of cars that will be kicked - it is almost black and white.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQSCi-Jyisy1eoVu2zm63NcN4mmVd_FNvOkbe_JxjClIZgIlFznyPofhhnpfKXL1X-GQbaCKaKnpiW7_6c4BJTXNbM4NcghLo3TTeUdyaLb43QiEhTgPMCqA1hiU1ACk5UZVZ9mwnadEtc/s1600/tree.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="230" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjQSCi-Jyisy1eoVu2zm63NcN4mmVd_FNvOkbe_JxjClIZgIlFznyPofhhnpfKXL1X-GQbaCKaKnpiW7_6c4BJTXNbM4NcghLo3TTeUdyaLb43QiEhTgPMCqA1hiU1ACk5UZVZ9mwnadEtc/s320/tree.PNG" /></a></div><br />
<br />
These 'gimmees' can be described by the rules...<br />
<br />
[WheelTypeID] = 'NULL' AND [Auction] <> 'MANHEIM'<br />
<br />
MANHEIM is an auctioneering company where cars are auctioned - there are 2 main auctioneers in the data set plus 'other'.<br />
<br />
Having worked extensively with car auction data before I know that there are certain auctions where only 'write off' cars are sold, that is those that are sold for scrap because they have been in accidents. I also know that different auction houses will record data differently.<br />
<br />
The above simple rule easily identifies cars that are more than likely going to be 'knocked' - but this is probably because they are 'knocked' in the first place (are we saying that someone in a coma is more likely to die). Is this useful? Is this a poorly defined definition of what is 'knocked'? Why does a missing value for WheelTypeID make such a big difference between auction houses?<br />
<br />
A bit more digging reveals location and the specific buyer drills down on these gimmees even more...<br />
<br />
[WheelTypeID] = 'NULL' AND [Auction] <> 'MANHEIM' AND [VNST] in ('NC','AZ') AND [BYRNO] NOT IN (99750,99761)<br />
<br />
and after excluding these 'gimmees' it becomes clear there are certain buyers that just don't but knocked cars, especially 99750 and 99761...<br />
<br />
byrno = 99750 and VNST in ('SC','NC','UT','ID','PA','WV','MO','WA')<br />
byrno = 99761 and Auction = 'MANHEIM'<br />
byrno = 99761 and MAKE = 'SUZUKI'<br />
byrno = 99761 and SIZE = 'VAN'<br />
byrno = 99761 and VNST IN ('FL','VA') <br />
<br />
Now is this actually useful? <br />
<br />
<i>The challenge of this competition is to predict if the car purchased at the Auction is a Kick (bad buy)</i><br />
<br />
The model is going to focus on who bought the car rather than the characteristics of the car itself. What happens if buyers suddenly change their policy? Wouldn't we rather just go and speak to these buyers to understand what their policy is and hence get some business understanding? Why is specific auction house location so important? Is it because of the specific auction house itself or that specific cars are actually routed to specific places (this does happen).<br />
<br />
Basically if this was a real client engagement I would be going back to them with a lot of questions to help me understand the data better so it can be used in a way that is going to be useful to them. <br />
<br />
<br />
<b>In Summary</b><br />
<br />
When doing predictive modelling, you can throw the latest hot algorithm at a problem such as a GBM, Neural Net or Random Forest and get impressive results, but unless you thoroughly understand and account for the real dynamics of what is going on then the models could disastrously fail when these dynamics change. I find visualisation the key to spotting and interpreting these dynamics - which is why I would rather have a good data miner who knows what he is doing using free software over a poor data miner with the most expensive software - see http://analystfirst.com/analyst-first-101/Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-70699551362942588232011-12-16T01:58:00.000-08:002011-12-16T17:35:51.288-08:00Two Become OneIn the previous post I looked at the HHP leaderboard and discovered some interesting patterns regarding certain teams.<br />
<br />
It looks as the evidence proved out to be true, with SD_John and Lily now all of a sudden merging into a single team.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJKOo8tGnMIA_p-okXmJS4FFLNB0oQN3fNjpuB8CbqOf0-XVcmBoKqv5OssjQL_fka_d1xbXkiL0lzsTPHFt8iVjK1yPa3-K9Gh3B83o6TTo98srtEwSp1LRyXdaCOjQ_ZNt3k6gKtsYve/s1600/SDJOHNLILY_MERGE.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="118" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJKOo8tGnMIA_p-okXmJS4FFLNB0oQN3fNjpuB8CbqOf0-XVcmBoKqv5OssjQL_fka_d1xbXkiL0lzsTPHFt8iVjK1yPa3-K9Gh3B83o6TTo98srtEwSp1LRyXdaCOjQ_ZNt3k6gKtsYve/s320/SDJOHNLILY_MERGE.PNG" /></a></div><br />
Interestingly they have also been in other competitions with very similar results.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPaFcSaId8XZtU9WeEpizoDAN9wTchC2ln0OdFNWpUNys7ZTh-rKw900Gbas1pg6YBTWvBTr27z-1bPjYnMQAPdXiq5oTAHVihH8-h9LCF_a0S5SMLcGlsN92wneX7tFY1JjW52ttKoQKb/s1600/JOHN_TEAM.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="222" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPaFcSaId8XZtU9WeEpizoDAN9wTchC2ln0OdFNWpUNys7ZTh-rKw900Gbas1pg6YBTWvBTr27z-1bPjYnMQAPdXiq5oTAHVihH8-h9LCF_a0S5SMLcGlsN92wneX7tFY1JjW52ttKoQKb/s320/JOHN_TEAM.PNG" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyG2CZuIOF_t7Y5UfLUIRhKlokCXxIlJFl6svX8NmiXXK3adZalkI5XNWQ6uZ9dWx7QPDmLDOwkl3oYk4YJZfx0zTVm4DrXp-D6z5Ievabc3N7ilaDtly7ox4NOQBtJdSPVp3ljJ6bRO5g/s1600/LILY_TEAM.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="225" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyG2CZuIOF_t7Y5UfLUIRhKlokCXxIlJFl6svX8NmiXXK3adZalkI5XNWQ6uZ9dWx7QPDmLDOwkl3oYk4YJZfx0zTVm4DrXp-D6z5Ievabc3N7ilaDtly7ox4NOQBtJdSPVp3ljJ6bRO5g/s320/LILY_TEAM.PNG" /></a></div><br />
This was the final standing in the Give Me Some Credit competiton,<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvLs1w9gMklORcJb4-gQuwC_hk2ghfauax-tbVYk6L7SYdAgh5xWPVxP4GbfR6OERzaKNPzlhpn_Bjx8O3mFzb4g9hJ7JFxjiEKSZdzHoeONN3qNCFIjKHfvZGQf4tGa4EIBWOaU-7BHa0/s1600/GIVEMESOMECREDIT.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="136" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgvLs1w9gMklORcJb4-gQuwC_hk2ghfauax-tbVYk6L7SYdAgh5xWPVxP4GbfR6OERzaKNPzlhpn_Bjx8O3mFzb4g9hJ7JFxjiEKSZdzHoeONN3qNCFIjKHfvZGQf4tGa4EIBWOaU-7BHa0/s320/GIVEMESOMECREDIT.PNG" /></a></div><br />
What is actually more interesting here is the demonstration of overfitting to the leaderboard. Opera Solutions & JYL are more than likely working together and we know Lily & SD_John are working together. If you look at the leaderboard just before the competition ended (on the 30%) you will see Opera near the top but the final position on the 70% was much worse. Similarly a few others found that relying on the leaderboard as an indication of the final position can be misplaced trust.<br />
<br />
If <a href="http://www.kaggle.com/c/GiveMeSomeCredit/forums/t/1153/magic-team-migration">you followed the competition forum</a>, you will see team VSU also had multiple accounts for the same person, and they seem to have also fallen into the same trap of overfitting to the leaderboard - they ended up 9th on the 70% when they were first on the 30%.<br />
<br />
The data mining lesson here is that you need to take all necessary steps to avoid overfitting, rather than just relying on the leaderboard feedback.<br />
<br />
Congratulations to Nathaniel, Eu Jin (small world - I used to work with Nathaniel at the National Australia Bank and regularly see Eu Jin at the Melbourne R user group) and Alec, who clearly did not overfit. A Perfect Storm!<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipxQqonHJjziAUZOhIJdBLLY2RwCQ-afjC9XGOo2OeVwyBG5K-s93DwgFuZBiVxoknmv5d78uE65UsTjsWW2Da7Mt1cB3wCsSnQ2YmlGK2GO2TV-JLcbNsTVsb3QtHr-jy4s9yIYWMXC1f/s1600/credit+risk+leaderboard+opera.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="320" width="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipxQqonHJjziAUZOhIJdBLLY2RwCQ-afjC9XGOo2OeVwyBG5K-s93DwgFuZBiVxoknmv5d78uE65UsTjsWW2Da7Mt1cB3wCsSnQ2YmlGK2GO2TV-JLcbNsTVsb3QtHr-jy4s9yIYWMXC1f/s320/credit+risk+leaderboard+opera.png" /></a></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-78692394105491893432011-12-14T13:01:00.000-08:002011-12-14T17:12:06.463-08:00Phantom of the OperaThere have been some <a href="http://www.heritagehealthprize.com/c/hhp/forums/t/1155/rules-and-terms-conditions/7112#post7112">recent announcements on Kaggle</a> reminding competitors about the rules regarding teams and that a single person can't have muliple accounts in order to get around the daily submission limit.<br />
<br />
I used the HHP leaderboard as an interesting data source to educate myself on the data manipulation capabilities in R and it became very evident that there was some curious behaviour going on.<br />
<br />
From a data scientist viewpoint, this demonstrates the power of the human eye in picking up things that will give you the insight that an algorithm won't. In most (probably all) of my professional projects the important data issues and findings have been a result of looking at visualisations of the data and asking the question "what's going on here!".<br />
<br />
The first curiosity on the leaderboard was by trying to discover if the competition was attracting new entrants by looking at the dates of the first submissions of entrants. The two plots below show different ways of looking at the same data. What is obvious is that the 29th Nov had an unusual number of new entrants.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlgfXEbviczSG14dV17EiFgFRxJBjTry6f_f_R2kM0-4sX7u02TKwVcDJk572DnV8AnvIRMr9jFtrO0e6m0IE-OaHaSVFdeBPQnQ44ON3GsSRFhLtLDwKjOmlDk5a3KDqGtrbM4uIjGBwq/s1600/first+submission+dates.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="246" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlgfXEbviczSG14dV17EiFgFRxJBjTry6f_f_R2kM0-4sX7u02TKwVcDJk572DnV8AnvIRMr9jFtrO0e6m0IE-OaHaSVFdeBPQnQ44ON3GsSRFhLtLDwKjOmlDk5a3KDqGtrbM4uIjGBwq/s320/first+submission+dates.png" /></a></div><br />
and looked at in another way...<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCrotqchCok1wikB7m2kSXDO7DEAMF3gb0vSkAzKwJc5nwFOEVpE3F6M3Xy9DIgtcGAx-LRUIG-be1sZlCIjQuHJydKrefMVc0oBGGs8rMpI_oEcoWd24TVUa7qizVA5MihDWgdfoMNVT-/s1600/first+submission+dates+cumulative.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="246" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhCrotqchCok1wikB7m2kSXDO7DEAMF3gb0vSkAzKwJc5nwFOEVpE3F6M3Xy9DIgtcGAx-LRUIG-be1sZlCIjQuHJydKrefMVc0oBGGs8rMpI_oEcoWd24TVUa7qizVA5MihDWgdfoMNVT-/s320/first+submission+dates+cumulative.png" /></a></div><br />
<br />
What's going on here?<br />
<br />
If you look at the team name of the entrants it is clear that all these accounts are somewhat connected - so no real mystery as to the cause of the blip for this date.<br />
<br />
"accnt002" "accnt003" "accnt004" "accnt005" "accnt006" <br />
"accnt007" "accnt008" "accnt009" "cyclops" "Faber" <br />
"Farbe" "Fortis" "glad5" "glad55" "gladiator" <br />
"gladiator1" "gladiator2" "gladiator3" "jackie" "Kaggleacctk"<br />
"KaggleK2" "sashik" <br />
<br />
The next two plots show the scores of the first submission of teams.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh87TpWX3rj3GtZZj3HLGMxUe6DwMnZtoL8DRr0TB04AVFtW1emAo_194X6-8LJBe0FbAEFcznH3ne1fDrpBu323nXDZwnNPWcOGNCnweUwMkkIyi8vyESZtKJY4Oek-SwHDziaBV6CH0D9/s1600/first+scores+1.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="246" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh87TpWX3rj3GtZZj3HLGMxUe6DwMnZtoL8DRr0TB04AVFtW1emAo_194X6-8LJBe0FbAEFcznH3ne1fDrpBu323nXDZwnNPWcOGNCnweUwMkkIyi8vyESZtKJY4Oek-SwHDziaBV6CH0D9/s320/first+scores+1.png" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPepPa8QQfXqXEqk5tuwCEHpq2ZaicKPJkRjsBs3R_qTGqxglFVq3AjrUb_CQthqZuoK_7nlENPC1Bdzwt9tX8tQCZXrUpj9-fq-E2ZMyq9GSoUXSsBQnOhoq2pqfrpYfd7hJiU9j9c42I/s1600/first+scores+2.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="246" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPepPa8QQfXqXEqk5tuwCEHpq2ZaicKPJkRjsBs3R_qTGqxglFVq3AjrUb_CQthqZuoK_7nlENPC1Bdzwt9tX8tQCZXrUpj9-fq-E2ZMyq9GSoUXSsBQnOhoq2pqfrpYfd7hJiU9j9c42I/s320/first+scores+2.png" /></a></div><br />
<br />
What's going on here?<br />
<br />
The common scores where the steps are seen are the all zeros benchmark, optimised constant benchmark and the code we posted in our writeup - so this is explained. There is another common first score which is another very simple model that many teams independently thought of.<br />
<br />
What does raise an eyebrow from the cumulative plot is one team stands out as having a very impressive first score. This is team YARONG who posted a very impressive model of 0.457698 on the first attempt and it still remains their best score 22 attempts later. This is possible (you don't need to submit models to blend them if you have your own holdout set - see the IBM writeup in the KDD Cup Orange Challenge) but somewhat unlikely as we know from the writeups that an individual model will get you no where near this score.<br />
<br />
If you look at the dates teams submit and look at some sort of correlation of entry dates, one team appears twice towards the top - SD_John, and they are also at the top of the leaderboard. <br />
<br />
<br />
<br />
td.row td.col pairs correl<br />
UCI-CS273A-RegAll Alex_Tot 27 0.9979902<br />
rutgers HappyAcura 29 0.9978254<br />
SD_John lily 34 0.9974190<br />
Roger99 Krakozjabra 21 0.9956643<br />
SD_John JYL 24 0.9950884<br />
The_Cuckoo's_Nest NumberNinja 23 0.9931073<br />
NumberNinja Chris_R 29 0.9924864<br />
<br />
<br />
<br />
What's is going on here?<br />
<br />
If you plot the submissions and scores you will see SD_John and Lily seem to perfectly track each other in both the days they submit, the times they submit and the scores they get.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHTsNX9i9vdfPk_f-LUHmP0gmtVIU52GqFt6OmrurwBPxrNe_CtdHt117m8f4xdUf7QhRG84VVVmL4S1cfs_baoRXVZowZUGlA2fKCHuj-Yc0jevxE-pzAQfQoI1PH2IA3VsIMYAMk0XLe/s1600/opera+submissions.png" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="246" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHTsNX9i9vdfPk_f-LUHmP0gmtVIU52GqFt6OmrurwBPxrNe_CtdHt117m8f4xdUf7QhRG84VVVmL4S1cfs_baoRXVZowZUGlA2fKCHuj-Yc0jevxE-pzAQfQoI1PH2IA3VsIMYAMk0XLe/s320/opera+submissions.png" /></a></div><br />
And on one particular day they get exactly the same score within 5 minutes of each other...<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMRTJw012T-8aEIfl6biBNPrh6wbetdicEfMz4mBPHdPojjS2_JOX2FJ-ykCCoZREax_ZHJiMYpDOjFw71lHd5ao6eKTJlAdreePdyXUTKBoDtm9cSFrORUcnek4bFa553-1YgyR8zweQo/s1600/lily+sd+john.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="231" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiMRTJw012T-8aEIfl6biBNPrh6wbetdicEfMz4mBPHdPojjS2_JOX2FJ-ykCCoZREax_ZHJiMYpDOjFw71lHd5ao6eKTJlAdreePdyXUTKBoDtm9cSFrORUcnek4bFa553-1YgyR8zweQo/s320/lily+sd+john.PNG" /></a></div><br />
SD_John and JYL seem to also track each other in submission dates. Interestingly JYL has a very similar profile to a member of Opera, and a little digging would suggest this is one and the same person. <br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0qRyIlNv9bb60U4bPZA1iHz0rE6aNyNQ9P7by16FTUgnMkr4ZinpjAUIwwehZ9396f569wwdX-KT3IBhoMPr9i0GUSJU-UxBU6NmZLqj9Qei8K5zK0_3eoVVctz2B48uq46ZtVSNIJBpG/s1600/Jeong3.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="282" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0qRyIlNv9bb60U4bPZA1iHz0rE6aNyNQ9P7by16FTUgnMkr4ZinpjAUIwwehZ9396f569wwdX-KT3IBhoMPr9i0GUSJU-UxBU6NmZLqj9Qei8K5zK0_3eoVVctz2B48uq46ZtVSNIJBpG/s320/Jeong3.PNG" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBJ8ohAHY17drVqdQrnaFHLZUJGXz98xuk7EUw6RxTWS9VKZAW3jjP5agcHe-bieFNElpEjCtkxZOlUC_Q41Q35IYXy6nyPbyCo8D-qgg-S8KYtAVzBZ3cwolbYqCXIPw9ANm5xGMeZi8m/s1600/jeong1.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="215" width="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBJ8ohAHY17drVqdQrnaFHLZUJGXz98xuk7EUw6RxTWS9VKZAW3jjP5agcHe-bieFNElpEjCtkxZOlUC_Q41Q35IYXy6nyPbyCo8D-qgg-S8KYtAVzBZ3cwolbYqCXIPw9ANm5xGMeZi8m/s320/jeong1.PNG" /></a></div><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhj-PtvIy7o7I0IU5potXGkKDa7clcaKMjB3knUO2eAHbIoggd04NPQpNg8RITpT6555FguXBlLC36yreiOtv7UhiGb9ZUzBQBUV9jOrhyphenhyphenvhQk4bpHUOB9iT27Ng3ccdxDS20O2XWm_VHU5/s1600/Jeong2.PNG" imageanchor="1" style="margin-left:1em; margin-right:1em"><img border="0" height="320" width="313" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhj-PtvIy7o7I0IU5potXGkKDa7clcaKMjB3knUO2eAHbIoggd04NPQpNg8RITpT6555FguXBlLC36yreiOtv7UhiGb9ZUzBQBUV9jOrhyphenhyphenvhQk4bpHUOB9iT27Ng3ccdxDS20O2XWm_VHU5/s320/Jeong2.PNG" /></a></div><br />
So here we can hypothesize that SD_john, lily, JYL and Opera (and evidence also suggests many more teams) are collaborating in some way.<br />
<br />
Interesting - all from following your nose, which is what good data mining is all about.<br />
<br />
In conclusion, the top of the leaderboard is not really what it appears to be - which I hope will encourage others to keep trying.<br />
<br />
The main reason for this investigation was to help me discover what R can do to manipulate data - and the answer is basically anything you want it to do. You first have to know what you want to achieve then do some Googling and you will find some code to help you somewhere.Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com5tag:blogger.com,1999:blog-8236580214478682937.post-16104243846684849422011-11-17T04:18:00.000-08:002011-11-17T04:27:50.227-08:00The Pack is Catching Up<div dir="ltr" style="text-align: left;" trbidi="on"><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">If you have been keeping an eye on the leaderboard, then you will notice there has been apparently little activity since the milestone 1 deadline. On some occasions there has been no change in the top 40 positions for over one week. This is quite an eerie silence and I suspect there may be a lack of submissions so teams can merge (the total submissions of teams merging has to be less than the number of days the comp has been running).<br />
</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">There have also been some very interesting movements if you look closer - more to come on this in a later post.</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">If you look further down though, the pack is catching up. In the past six weeks more teams are heading towards the 0.461 mark, which is the point the early leaders got to straight away and was the score to beat. Now it is only a top 50 place.<br />
</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">This score is a good single model. To improve dramatically from there though it is probably necessary to ensemble various models. What is pretty clear though is that the benchmark of 0.40 for the 3 million is impossible (hopefully this might be adjusted?).</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I tried to put some nice colours in the chart below, which is generated in R, but could not find any up to date listing of colour codes in R. This is one of the disadvantages of the open source movement - documentation is very low on the contributors list of priorities (and what documentation there is leads a lot to be desired if R is to be used by 'regular' types of people).<br />
</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I did find the following link though, which is where I got the colours for the plot,</span><br />
<br />
<a href="http://colorbrewer2.org/"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">http://colorbrewer2.org/</span></a><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"> </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">click on the image to enlarge</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRXh3VoMlgkjdeRBwaU94ju8nSfL3hYQNe3Hk2OrFGuKxi_bL2LnmAwGI0ZhQQdA9FMVLDW6jXGKC64cimm0-MRr8tUVO0oU6-9xK1eo32q8mrQcQKyfaZkGsS_0GR3bS-m_WSH7CIB-7W/s1600/leaderboard+through+time+16+Nov.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhRXh3VoMlgkjdeRBwaU94ju8nSfL3hYQNe3Hk2OrFGuKxi_bL2LnmAwGI0ZhQQdA9FMVLDW6jXGKC64cimm0-MRr8tUVO0oU6-9xK1eo32q8mrQcQKyfaZkGsS_0GR3bS-m_WSH7CIB-7W/s320/leaderboard+through+time+16+Nov.png" width="320" /></a></div><br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com2tag:blogger.com,1999:blog-8236580214478682937.post-1828950747579148602011-10-07T15:46:00.000-07:002011-10-07T15:53:07.608-07:00If you can't measure it, you can't manage it!<div dir="ltr" style="text-align: left;" trbidi="on"><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">"If you can't measure it, you can't manage it!" - this is often quoted in the marketing and analytics world.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">Apparently </span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><i>"It comes from a remark attributed to GALILEO, who said 'Count what is<br />
countable, measure what is measurable. What is not measurable, make<br />
measurable'."</i></span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">see <a href="http://answers.google.com/answers/threadview?id=139473">here</a> for more details on the origins of the quote.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">Anyway, the point is that if you don't measure the before and after, you have no idea if the actions made a difference.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">Last week in my <a href="http://anotherdataminingblog.blogspot.com/2011/10/code-for-respectable-hhp-model.html">previous blog post</a>, I made available some code that would result in a score of around 0.4635 on the HHP leaderboard. The hope was that others would be able to get this code to work and re-ignite interest in the HHP and solving the worlds health problems.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">When I looked at the leaderboard today, it looked on the face of it that this was happening, with a few high risers around the 0.4635 mark.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"><span style="font-size: small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPxblXvwSKCMXwVYrqB-J8rGDnuUxiTSpSDNXoSgyGf0EXdmCEyeTI95H_w3YO-JwK3T388tGDWuaiH3XLjbkmuFIiQ6Qpa_jNzBt7O6sNiZvsm-bdasINsxjeOb56omFIv8equOqdUun2/s1600/leaderboard+after+code1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPxblXvwSKCMXwVYrqB-J8rGDnuUxiTSpSDNXoSgyGf0EXdmCEyeTI95H_w3YO-JwK3T388tGDWuaiH3XLjbkmuFIiQ6Qpa_jNzBt7O6sNiZvsm-bdasINsxjeOb56omFIv8equOqdUun2/s320/leaderboard+after+code1.PNG" width="247" /></a></span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">This was the after, but the before is not shown, so on this evidence there is a bit of a hole in the claim that the code was responsible for this change - we are only showing the part of the leaderboard that we want to - such movement could be going on all over the place.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">Good old Jeff Moser at Kaggle has been hard at work making the leaderboard dynamic, so we can actually go back in time. Knowing this I modified a previous R script I posted to plot the leaderboard at 3 points in time, the day I released the code, and a week before and after (the new script can be found at the end of this post).</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">This resulted in the following 2 plots...</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"><span style="font-size: small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimLjuKeQC0W3RzdQ0I23OqHxAKE7x_CwBx2pnRFmBpJREfcErCncKgHHO-M0gUboghhWvwz7FKZZdrIzeAXa9msF898FNUE_4MZolNUgBnlqwcEOHikvZM1BYxec9deLz07yCbnfssTuUJ/s1600/leaderboardthroughtime1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimLjuKeQC0W3RzdQ0I23OqHxAKE7x_CwBx2pnRFmBpJREfcErCncKgHHO-M0gUboghhWvwz7FKZZdrIzeAXa9msF898FNUE_4MZolNUgBnlqwcEOHikvZM1BYxec9deLz07yCbnfssTuUJ/s320/leaderboardthroughtime1.png" width="320" /></a></span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div class="separator" style="clear: both; font-family: Arial,Helvetica,sans-serif; text-align: center;"><span style="font-size: small;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6q57U5a5r3BDaaR5LxAP-jqovkxBatdfPSOEX3JqhqcARg8iCfNK0CJC2Wu63Ge_VYIZxDunzGXQA1aSp73lWnxO9MTrLJjFZajHH7DsNYI3Dg527TtuTw98NbjNtAeVPKQNm62abbjgJ/s1600/leaderboardthroughtime2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6q57U5a5r3BDaaR5LxAP-jqovkxBatdfPSOEX3JqhqcARg8iCfNK0CJC2Wu63Ge_VYIZxDunzGXQA1aSp73lWnxO9MTrLJjFZajHH7DsNYI3Dg527TtuTw98NbjNtAeVPKQNm62abbjgJ/s320/leaderboardthroughtime2.png" width="320" /></a></span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">These are now a lot more convincing that the code did make a difference. If you follow Eu Jin Lok on the first plot, you will see he has improved his score but actually gone backwards in rank over the two week period. The second plot shows that the number of submissions to score 0.4635 has dropped.</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;"><br />
</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span style="font-size: small;">Anyway, the point of this post is that data scientist have the power to make massive differences to the bottom lines of companies. I have been there and done it, but failed to get too excited as I am unassuming and knew exactly what the result would be because my models told me in advance. If analytics - and us analysts - are to be taken more seriously and given the recognition we deserve, we need to start banging our own drum more than we do. Those who get the recognition and rewards are not always those who do the best work, but those who shout the loudest!</span><br />
<br />
<span style="font-size: small;">And here is the R script to generate those plots... </span></div><br />
<br />
<div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><br />
<pre style="background: #ffffff; color: black;"><span style="color: #3f7f59;">##########################################</span>
<span style="color: #3f7f59;"># Some R function to plot the Kaggle </span>
<span style="color: #3f7f59;"># leaderboard at different points in time</span>
<span style="color: #3f7f59;">##########################################</span>
plotKaggleLeaderboard <- function(theURL
,theDates
,myTeam
,topX=100
,title
,plottype=1){
<span style="color: #3f7f59;">#this library does all the hard work</span>
<span style="color: #3f7f59;">#for windows users to install this packages see</span>
<span style="color: #3f7f59;">#</span><span style="color: #3f3fbf;">http://cran.r-project.org/bin/windows/contrib/r-release/ReadMe</span>
<span style="color: #3f7f59;">#and </span><span style="color: #3f3fbf;">http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.13/</span>
library(XML)
theColours <- c(552,254,26)
<span style="color: #7f0055; font-weight: bold;">for</span> (i in 1:<span style="color: #7f0055; font-weight: bold;">length</span>(compDates))
{
<span style="color: #3f7f59;">#go and read the tables from the web page</span>
thisDate <- paste(theURL,<span style="color: #2a00ff;">"?asOf="</span>,compDates[i],sep=<span style="color: #2a00ff;">""</span>)
tables <- readHTMLTable(thisDate)
<span style="color: #3f7f59;">#get the table of interest</span>
<span style="color: #3f7f59;">#names(tables)</span>
lb <- tables[[<span style="color: #2a00ff;">'leaderboard-table'</span>]]
<span style="color: #3f7f59;">#see what the columns are</span>
<span style="color: #3f7f59;">#colnames(lb)</span>
<span style="color: #3f7f59;">#nrow(lb)</span>
<span style="color: #3f7f59;">#convert to numeric - see ?readHTMLTable</span>
<span style="color: #3f7f59;">#numeric_cols <- c('AUC','Entries')</span>
<span style="color: #3f7f59;">#numeric_cols <- c('RMSLE','Entries')</span>
numeric_cols <- c(4,5)
lb[numeric_cols] = lapply(lb[numeric_cols], function(x) as.numeric(gsub(<span style="color: #2a00ff;">".* "</span>, <span style="color: #2a00ff;">""</span>, as.character(x))))
<span style="color: #3f7f59;">#extract the team name from a messy field</span>
team_col <- c(<span style="color: #2a00ff;">'Team Name'</span>)
<span style="color: #3f7f59;">#lb[team_col]</span>
<span style="color: #3f7f59;">#split the field by "\r\n" than denotes the break between fields within a field </span>
lb[,team_col] <- sapply(strsplit(as.character(lb[,team_col]), <span style="color: #2a00ff;">"[</span><span style="color: #2a00ff;">\r</span><span style="color: #2a00ff;">\n</span><span style="color: #2a00ff;">]"</span>), function (x) x[1])
<span style="color: #3f7f59;">#####################</span>
<span style="color: #3f7f59;"># plot</span>
<span style="color: #3f7f59;">#####################</span>
myRank <- which(lb[team_col] == myTeam)
myText = paste(<span style="color: #2a00ff;">"following team"</span>,myTeam)
error_data <- lb[,4]
entries_data <- lb[,5]
<span style="color: #7f0055; font-weight: bold;">if</span>(plottype==1){
<span style="color: #7f0055; font-weight: bold;">if</span>(i==1) plot(error_data[1:topX],col = theColours[i],type=<span style="color: #2a00ff;">"l"</span>,xlab=<span style="color: #2a00ff;">'Rank'</span>,ylab=<span style="color: #2a00ff;">'Error'</span>,main = title, <span style="color: #7f0055; font-weight: bold;">sub</span>=myText)
<span style="color: #7f0055; font-weight: bold;">if</span>(i>1) points(error_data[1:topX],col = theColours[i],type=<span style="color: #2a00ff;">"l"</span>)
<span style="color: #3f7f59;">#mark position</span>
points(myRank,error_data[myRank],col=theColours[i],pch=19)
}
<span style="color: #7f0055; font-weight: bold;">if</span>(plottype==2){
<span style="color: #7f0055; font-weight: bold;">if</span>(i==1){ plot(error_data[1:topX],entries_data[1:topX],type = <span style="color: #2a00ff;">'p'</span>
,main = title
, xlab = <span style="color: #2a00ff;">'Error'</span>
, ylab = <span style="color: #2a00ff;">'Entries'</span>
, col= theColours[i]
)}
<span style="color: #7f0055; font-weight: bold;">if</span>(i>1) points(error_data[1:topX],entries_data[1:topX],type = <span style="color: #2a00ff;">'p'</span>, col= theColours[i])
lines(lowess(error_data[1:topX],entries_data[1:topX]), col=theColours[i], lwd=2.5) <span style="color: #3f7f59;"># lowess line (x,y)</span>
<span style="color: #3f7f59;">#mark my position</span>
points(error_data[myRank],entries_data[myRank],col=theColours[i],pch=19)
}
}
legend(<span style="color: #2a00ff;">"topleft"</span>,legend=theDates,text.col=theColours)
}
<span style="color: #3f7f59;">###########################</span>
<span style="color: #3f7f59;"># End of Function</span>
<span style="color: #3f7f59;"># plotKaggleLeaderboard</span>
<span style="color: #3f7f59;">###########################</span>
compURL <- <span style="color: #2a00ff;">"</span><span style="color: #3f3fbf;">http://www.heritagehealthprize.com/c/hhp/Leaderboard</span><span style="color: #2a00ff;">"</span>
compDates <- c(<span style="color: #2a00ff;">'2011-09-24'</span>,<span style="color: #2a00ff;">'2011-10-01'</span>,<span style="color: #2a00ff;">'2011-10-07'</span>)
compTeam <- <span style="color: #2a00ff;">'Eu Jin Lok'</span>
compTopX <- 150
compTitle <- <span style="color: #2a00ff;">'HHP Leaderboard Through Time'</span>
plotKaggleLeaderboard(theURL = compURL
,theDates = compDates
,myTeam = compTeam
,topX = compTopX
,title = compTitle
,plottype=1
)
</pre><br />
<br />
</div><br />
<br />
<br />
<br />
<br />
<br />
<br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com3tag:blogger.com,1999:blog-8236580214478682937.post-84862058154161249022011-10-01T06:43:00.000-07:002011-10-01T06:55:37.909-07:00Code for a Respectable HHP Model<div dir="ltr" style="text-align: left;" trbidi="on"><div style="font-family: Arial,Helvetica,sans-serif;">Our team Market Makers did well in the HHP first milestone prize, taking out top spot just above Willem Mestrom. Congratulations to all who took part.<br />
<br />
For those of you who haven't seen the presentation, there is a recording at <a href="http://www.youtube.com/watch?v=xziMY3nDB_E">http://www.youtube.com/watch?v=xziMY3nDB_E</a> <br />
<br />
Reports were written on our methods and are available here... <br />
<a href="http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1">http://www.heritagehealthprize.com/c/hhp/Leaderboard/milestone1</a> </div><div style="font-family: Arial,Helvetica,sans-serif;"><br />
Other competitors now have 30 days to read the reports and come back with any concerns.<br />
<br />
In our report we included some source code, which is a bit of a pain to copy and paste, so I have reproduced it here. There are two bits of code to run, and at the end a file ready for submission should appear.<br />
<br />
The first bit is the data preparation which I did in SQL. I used SQL Server, but it should be easy to run it in any SQL database without much messing around. This took about 8 minutes to run, and at the end is a table ready for use by the second part, which is the modelling performed in R, which takes about 15 minutes.<br />
<br />
There are two modifications to the code to make - in the SQL you need to set the correct path to the source files - and in R you need to put in the correct server name so the data can be read directly from the database.<br />
<br />
So hopefully in two mouse clicks there will be a file sitting there ready for submission.<br />
<br />
If you are not sure about using SQL Server, then the earlier blog posts should point you in the right direction...<br />
<br />
<a href="http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html">http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html</a><br />
<br />
Note that the SQL provided does not generate all the variables we used - others were created, all using SQL, but the code was a little more complicated. The extra variables are described in our write up. Also the parameters used in the GBM modelling are not the most efficient. You will need more trees and a lower shrinkage factor to tease out a better model (see the R documentaion) - but this one runs in a reasonable time.</div><div style="font-family: Arial,Helvetica,sans-serif;"><br />
I am keen to see if this code is useful to anybody - if you can't get it to work then please post a comment hear. If you do use it then I would also like to hear from you!<br />
<br />
</div><div style="font-family: Arial,Helvetica,sans-serif;">SQL Code</div><br />
<div style="border-color: black; border-style: solid; border-width: 1px; height: 300px; overflow: auto; width: 500px;"><span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/**************************************************************</span><br />
<span style="font-size: xx-small;">* SQL Code to create an example data set for the HHP</span><br />
<span style="font-size: xx-small;">*</span><br />
<span style="font-size: xx-small;">* Edit the path in the 'bulk insert' commands to locate</span><br />
<span style="font-size: xx-small;">* the source data</span><br />
<span style="font-size: xx-small;">* The end result is a table called 'modelling_set' which can </span><br />
<span style="font-size: xx-small;">* then be used to build predictive models</span><br />
<span style="font-size: xx-small;">* </span><br />
<span style="font-size: xx-small;">* created in SQL server express</span><br />
<span style="font-size: xx-small;">* http://www.microsoft.com/sqlserver/en/us/editions/express.aspx</span><br />
<span style="font-size: xx-small;">*****************************************************************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/**************************</span><br />
<span style="font-size: xx-small;">create a new database</span><br />
<span style="font-size: xx-small;">**************************/</span><br />
<span style="font-size: xx-small;">CREATE DATABASE HHP_comp</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;">USE HHP_comp</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/**************************</span><br />
<span style="font-size: xx-small;">load in the raw data</span><br />
<span style="font-size: xx-small;">**************************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">--claims</span><br />
<span style="font-size: xx-small;">CREATE TABLE Claims</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID VARCHAR(8) --integers starting with 0, could be text!</span><br />
<span style="font-size: xx-small;">, ProviderID VARCHAR(7) --integers starting with 0, could be text!</span><br />
<span style="font-size: xx-small;">, Vendor VARCHAR(6) --integers starting with 0, could be text!</span><br />
<span style="font-size: xx-small;">, PCP VARCHAR(5) --integers starting with 0, could be text!</span><br />
<span style="font-size: xx-small;">, Year VARCHAR(2)</span><br />
<span style="font-size: xx-small;">, Specialty VARCHAR(25)</span><br />
<span style="font-size: xx-small;">, PlaceSvc VARCHAR(19)</span><br />
<span style="font-size: xx-small;">, PayDelay VARCHAR(4)</span><br />
<span style="font-size: xx-small;">, LengthOfStay VARCHAR(10)</span><br />
<span style="font-size: xx-small;">, DSFS VARCHAR(12)</span><br />
<span style="font-size: xx-small;">, PrimaryConditionGroup VARCHAR(8)</span><br />
<span style="font-size: xx-small;">, CharlsonIndex VARCHAR(3)</span><br />
<span style="font-size: xx-small;">, ProcedureGroup VARCHAR(4)</span><br />
<span style="font-size: xx-small;">, SupLOS TINYINT</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT Claims</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release2\Claims.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">--members</span><br />
<span style="font-size: xx-small;">CREATE TABLE Members</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID_M VARCHAR(8) --integers starting with 0, could be text!</span><br />
<span style="font-size: xx-small;">, AgeAtFirstClaim VARCHAR(5)</span><br />
<span style="font-size: xx-small;">, Sex VARCHAR(1)</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT Members</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release2\Members.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- drug count</span><br />
<span style="font-size: xx-small;">CREATE TABLE DrugCount</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID INT</span><br />
<span style="font-size: xx-small;">, Year VARCHAR(2)</span><br />
<span style="font-size: xx-small;">, DSFS VARCHAR(12)</span><br />
<span style="font-size: xx-small;">, DrugCount VARCHAR(2)</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT DrugCount</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release3\DrugCount.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- Lab Count</span><br />
<span style="font-size: xx-small;">CREATE TABLE LabCount</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID INT</span><br />
<span style="font-size: xx-small;">, Year VARCHAR(2)</span><br />
<span style="font-size: xx-small;">, DSFS VARCHAR(12)</span><br />
<span style="font-size: xx-small;">, LabCount VARCHAR(3)</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT LabCount</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release3\LabCount.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">--DaysInHospital_Y2</span><br />
<span style="font-size: xx-small;">CREATE TABLE DaysInHospital_Y2</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID INT</span><br />
<span style="font-size: xx-small;">, ClaimsTruncated TINYINT</span><br />
<span style="font-size: xx-small;">, DaysInHospital TINYINT</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT DaysInHospital_Y2</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y2.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- DaysInHospital_Y3 </span><br />
<span style="font-size: xx-small;">CREATE TABLE DaysInHospital_Y3</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID INT</span><br />
<span style="font-size: xx-small;">, ClaimsTruncated TINYINT</span><br />
<span style="font-size: xx-small;">, DaysInHospital TINYINT</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT DaysInHospital_Y3</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release2\DaysInHospital_Y3.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- Target</span><br />
<span style="font-size: xx-small;">CREATE TABLE Target</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MemberID INT</span><br />
<span style="font-size: xx-small;">, ClaimsTruncated TINYINT</span><br />
<span style="font-size: xx-small;">, DaysInHospital TINYINT</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">BULK INSERT Target</span><br />
<span style="font-size: xx-small;">FROM 'F:\comps\hhp\raw data\HHP_release2\Target.csv'</span><br />
<span style="font-size: xx-small;">WITH</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">MAXERRORS = 0,</span><br />
<span style="font-size: xx-small;">FIRSTROW = 2,</span><br />
<span style="font-size: xx-small;">FIELDTERMINATOR = ',',</span><br />
<span style="font-size: xx-small;">ROWTERMINATOR = '\n'</span><br />
<span style="font-size: xx-small;">)</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/*************************</span><br />
<span style="font-size: xx-small;">adjust the claims data to</span><br />
<span style="font-size: xx-small;">convert text to integers</span><br />
<span style="font-size: xx-small;">**************************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- PayDelay</span><br />
<span style="font-size: xx-small;">ALTER TABLE Claims </span><br />
<span style="font-size: xx-small;">ADD PayDelayI integer</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Claims</span><br />
<span style="font-size: xx-small;">SET PayDelayI = CASE WHEN PayDelay = '162+' THEN 162 ELSE CAST(PayDelay AS integer) END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">--dsfs</span><br />
<span style="font-size: xx-small;">ALTER TABLE Claims </span><br />
<span style="font-size: xx-small;">ADD dsfsI integer</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Claims</span><br />
<span style="font-size: xx-small;">SET dsfsI =</span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '0- 1 month' THEN 1</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '1- 2 months' THEN 2</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '2- 3 months' THEN 3</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '3- 4 months' THEN 4</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '4- 5 months' THEN 5</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '5- 6 months' THEN 6</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '6- 7 months' THEN 7</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '7- 8 months' THEN 8</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '8- 9 months' THEN 9</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '9-10 months' THEN 10</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '10-11 months' THEN 11</span><br />
<span style="font-size: xx-small;">WHEN dsfs = '11-12 months' THEN 12</span><br />
<span style="font-size: xx-small;">WHEN dsfs IS NULL THEN NULL</span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- CharlsonIndex</span><br />
<span style="font-size: xx-small;">ALTER TABLE Claims </span><br />
<span style="font-size: xx-small;">ADD CharlsonIndexI INTEGER</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Claims</span><br />
<span style="font-size: xx-small;">SET CharlsonIndexI =</span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN CharlsonIndex = '0' THEN 0 </span><br />
<span style="font-size: xx-small;">WHEN CharlsonIndex = '1-2' THEN 2</span><br />
<span style="font-size: xx-small;">WHEN CharlsonIndex = '3-4' THEN 4</span><br />
<span style="font-size: xx-small;">WHEN CharlsonIndex = '5+' THEN 6 </span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- LengthOfStay</span><br />
<span style="font-size: xx-small;">ALTER TABLE Claims </span><br />
<span style="font-size: xx-small;">ADD LengthOfStayI INTEGER</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Claims</span><br />
<span style="font-size: xx-small;">SET LengthOfStayI =</span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '1 day' THEN 1</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '2 days' THEN 2</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '3 days' THEN 3</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '4 days' THEN 4</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '5 days' THEN 5</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '6 days' THEN 6 </span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '1- 2 weeks' THEN 11</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '2- 4 weeks' THEN 21</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '4- 8 weeks' THEN 42</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay = '26+ weeks' THEN 180</span><br />
<span style="font-size: xx-small;">WHEN LengthOfStay IS NULL THEN null</span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/**************************</span><br />
<span style="font-size: xx-small;">create a summary table </span><br />
<span style="font-size: xx-small;">at the member/year level</span><br />
<span style="font-size: xx-small;">***************************/</span><br />
<span style="font-size: xx-small;">SELECT </span><br />
<span style="font-size: xx-small;">year</span><br />
<span style="font-size: xx-small;">,Memberid</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,COUNT(*) AS no_Claims</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT ProviderID) AS no_Providers</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT Vendor) AS no_Vendors</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT PCP) AS no_PCPs</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT PlaceSvc) AS no_PlaceSvcs</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT Specialty) AS no_Specialities</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT PrimaryConditionGroup) AS no_PrimaryConditionGroups</span><br />
<span style="font-size: xx-small;">,COUNT(DISTINCT ProcedureGroup) AS no_ProcedureGroups</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,MAX(PayDelayI) AS PayDelay_max</span><br />
<span style="font-size: xx-small;">,MIN(PayDelayI) AS PayDelay_min</span><br />
<span style="font-size: xx-small;">,AVG(PayDelayI) AS PayDelay_ave</span><br />
<span style="font-size: xx-small;">,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(PayDelayI) END) AS PayDelay_stdev</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,MAX(LengthOfStayI) AS LOS_max</span><br />
<span style="font-size: xx-small;">,MIN(LengthOfStayI) AS LOS_min</span><br />
<span style="font-size: xx-small;">,AVG(LengthOfStayI) AS LOS_ave</span><br />
<span style="font-size: xx-small;">,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(LengthOfStayI) END) AS LOS_stdev</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 0 THEN 1 ELSE 0 END) AS LOS_TOT_UNKNOWN</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN LENGTHOFSTAY IS NULL AND SUPLOS = 1 THEN 1 ELSE 0 END) AS LOS_TOT_SUPRESSED </span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN LENGTHOFSTAY IS NOT NULL THEN 1 ELSE 0 END) AS LOS_TOT_KNOWN </span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,MAX(dsfsI) AS dsfs_max</span><br />
<span style="font-size: xx-small;">,MIN(dsfsI) AS dsfs_min</span><br />
<span style="font-size: xx-small;">,MAX(dsfsI) - MIN(dsfsI) AS dsfs_range</span><br />
<span style="font-size: xx-small;">,AVG(dsfsI) AS dsfs_ave</span><br />
<span style="font-size: xx-small;">,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(dsfsI) END) AS dsfs_stdev</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,MAX(CharlsonIndexI) AS CharlsonIndexI_max</span><br />
<span style="font-size: xx-small;">,MIN(CharlsonIndexI) AS CharlsonIndexI_min</span><br />
<span style="font-size: xx-small;">,AVG(CharlsonIndexI) AS CharlsonIndexI_ave</span><br />
<span style="font-size: xx-small;">,MAX(CharlsonIndexI) - MIN(CharlsonIndexI) AS CharlsonIndexI_range</span><br />
<span style="font-size: xx-small;">,(CASE WHEN COUNT(*) = 1 THEN 0 ELSE STDEV(CharlsonIndexI) END) AS CharlsonIndexI_stdev</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'MSC2a3' THEN 1 ELSE 0 END) AS pcg1</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'METAB3' THEN 1 ELSE 0 END) AS pcg2</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'ARTHSPIN' THEN 1 ELSE 0 END) AS pcg3</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'NEUMENT' THEN 1 ELSE 0 END) AS pcg4</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'RESPR4' THEN 1 ELSE 0 END) AS pcg5</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'MISCHRT' THEN 1 ELSE 0 END) AS pcg6</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'SKNAUT' THEN 1 ELSE 0 END) AS pcg7</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'GIBLEED' THEN 1 ELSE 0 END) AS pcg8</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'INFEC4' THEN 1 ELSE 0 END) AS pcg9</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'TRAUMA' THEN 1 ELSE 0 END) AS pcg10</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'HEART2' THEN 1 ELSE 0 END) AS pcg11</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL3' THEN 1 ELSE 0 END) AS pcg12</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'ROAMI' THEN 1 ELSE 0 END) AS pcg13</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL5' THEN 1 ELSE 0 END) AS pcg14</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'ODaBNCA' THEN 1 ELSE 0 END) AS pcg15</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'UTI' THEN 1 ELSE 0 END) AS pcg16</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'COPD' THEN 1 ELSE 0 END) AS pcg17</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'GYNEC1' THEN 1 ELSE 0 END) AS pcg18</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRB' THEN 1 ELSE 0 END) AS pcg19</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'FXDISLC' THEN 1 ELSE 0 END) AS pcg20</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'AMI' THEN 1 ELSE 0 END) AS pcg21</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'PRGNCY' THEN 1 ELSE 0 END) AS pcg22</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'HEMTOL' THEN 1 ELSE 0 END) AS pcg23</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'HEART4' THEN 1 ELSE 0 END) AS pcg24</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'SEIZURE' THEN 1 ELSE 0 END) AS pcg25</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'APPCHOL' THEN 1 ELSE 0 END) AS pcg26</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'CHF' THEN 1 ELSE 0 END) AS pcg27</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'GYNECA' THEN 1 ELSE 0 END) AS pcg28</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup IS NULL THEN 1 ELSE 0 END) AS pcg29</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'PNEUM' THEN 1 ELSE 0 END) AS pcg30</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL2' THEN 1 ELSE 0 END) AS pcg31</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'GIOBSENT' THEN 1 ELSE 0 END) AS pcg32</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'STROKE' THEN 1 ELSE 0 END) AS pcg33</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRA' THEN 1 ELSE 0 END) AS pcg34</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'FLaELEC' THEN 1 ELSE 0 END) AS pcg35</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'MISCL1' THEN 1 ELSE 0 END) AS pcg36</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'HIPFX' THEN 1 ELSE 0 END) AS pcg37</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'METAB1' THEN 1 ELSE 0 END) AS pcg38</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'PERVALV' THEN 1 ELSE 0 END) AS pcg39</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'LIVERDZ' THEN 1 ELSE 0 END) AS pcg40</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'CATAST' THEN 1 ELSE 0 END) AS pcg41</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'CANCRM' THEN 1 ELSE 0 END) AS pcg42</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'PERINTL' THEN 1 ELSE 0 END) AS pcg43</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'PNCRDZ' THEN 1 ELSE 0 END) AS pcg44</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'RENAL1' THEN 1 ELSE 0 END) AS pcg45</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PrimaryConditionGroup = 'SEPSIS' THEN 1 ELSE 0 END) AS pcg46</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Internal' THEN 1 ELSE 0 END) AS sp1</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Laboratory' THEN 1 ELSE 0 END) AS sp2</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'General Practice' THEN 1 ELSE 0 END) AS sp3</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Surgery' THEN 1 ELSE 0 END) AS sp4</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Diagnostic Imaging' THEN 1 ELSE 0 END) AS sp5</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Emergency' THEN 1 ELSE 0 END) AS sp6</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Other' THEN 1 ELSE 0 END) AS sp7</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Pediatrics' THEN 1 ELSE 0 END) AS sp8</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Rehabilitation' THEN 1 ELSE 0 END) AS sp9</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Obstetrics and Gynecology' THEN 1 ELSE 0 END) AS sp10</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Anesthesiology' THEN 1 ELSE 0 END) AS sp11</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty = 'Pathology' THEN 1 ELSE 0 END) AS sp12</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN Specialty IS NULL THEN 1 ELSE 0 END) AS sp13</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'EM' THEN 1 ELSE 0 END ) AS pg1</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'PL' THEN 1 ELSE 0 END ) AS pg2</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'MED' THEN 1 ELSE 0 END ) AS pg3</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SCS' THEN 1 ELSE 0 END ) AS pg4</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'RAD' THEN 1 ELSE 0 END ) AS pg5</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SDS' THEN 1 ELSE 0 END ) AS pg6</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SIS' THEN 1 ELSE 0 END ) AS pg7</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SMS' THEN 1 ELSE 0 END ) AS pg8</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'ANES' THEN 1 ELSE 0 END ) AS pg9</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SGS' THEN 1 ELSE 0 END ) AS pg10</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SEOA' THEN 1 ELSE 0 END ) AS pg11</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SRS' THEN 1 ELSE 0 END ) AS pg12</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SNS' THEN 1 ELSE 0 END ) AS pg13</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SAS' THEN 1 ELSE 0 END ) AS pg14</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SUS' THEN 1 ELSE 0 END ) AS pg15</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup IS NULL THEN 1 ELSE 0 END ) AS pg16</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SMCD' THEN 1 ELSE 0 END ) AS pg17</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN ProcedureGroup = 'SO' THEN 1 ELSE 0 END ) AS pg18</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Office' THEN 1 ELSE 0 END) AS ps1</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Independent Lab' THEN 1 ELSE 0 END) AS ps2</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Urgent Care' THEN 1 ELSE 0 END) AS ps3</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Outpatient Hospital' THEN 1 ELSE 0 END) AS ps4</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Inpatient Hospital' THEN 1 ELSE 0 END) AS ps5</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Ambulance' THEN 1 ELSE 0 END) AS ps6</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Other' THEN 1 ELSE 0 END) AS ps7</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc = 'Home' THEN 1 ELSE 0 END) AS ps8</span><br />
<span style="font-size: xx-small;">,SUM(CASE WHEN PlaceSvc IS NULL THEN 1 ELSE 0 END) AS ps9</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">INTO claims_per_member</span><br />
<span style="font-size: xx-small;">FROM Claims</span><br />
<span style="font-size: xx-small;">GROUP BY year,Memberid</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- remove some nulls</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET LOS_max = 0 WHERE LOS_max IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET LOS_min = 0 WHERE LOS_min IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET LOS_ave = 0 WHERE LOS_ave IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET LOS_stdev = -1 WHERE LOS_stdev IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET dsfs_max = 0 WHERE dsfs_max IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET dsfs_min = 0 WHERE dsfs_min IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET dsfs_ave = 0 WHERE dsfs_ave IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET dsfs_stdev = -1 WHERE dsfs_stdev IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET dsfs_range = -1 WHERE dsfs_range IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE claims_per_member</span><br />
<span style="font-size: xx-small;">SET CharlsonIndexI_range = -1 WHERE CharlsonIndexI_range IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/***********************************</span><br />
<span style="font-size: xx-small;">Members</span><br />
<span style="font-size: xx-small;">***********************************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- create binary flags for age</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_05 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_15 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_25 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_35 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_45 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_55 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_65 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_75 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_85 INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members ADD age_MISS INT</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_05 = CASE WHEN ageATfirstclaim = '0-9' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_15 = CASE WHEN ageATfirstclaim = '10-19' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_25 = CASE WHEN ageATfirstclaim = '20-29' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_35 = CASE WHEN ageATfirstclaim = '30-39' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_45 = CASE WHEN ageATfirstclaim = '40-49' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_55 = CASE WHEN ageATfirstclaim = '50-59' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_65 = CASE WHEN ageATfirstclaim = '60-69' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_75 = CASE WHEN ageATfirstclaim = '70-79' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_85 = CASE WHEN ageATfirstclaim = '80+' THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;">UPDATE Members SET age_MISS = CASE WHEN ageATfirstclaim IS NULL THEN 1 ELSE 0 END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">--create binary flags for sex</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members</span><br />
<span style="font-size: xx-small;">ADD sexMALE INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Members</span><br />
<span style="font-size: xx-small;">SET SexMALE = </span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN Sex = 'M' THEN 1 ELSE 0</span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members</span><br />
<span style="font-size: xx-small;">ADD sexFEMALE INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Members</span><br />
<span style="font-size: xx-small;">SET SexFEMALE = </span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN Sex = 'F' THEN 1 ELSE 0</span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE Members</span><br />
<span style="font-size: xx-small;">ADD sexMISS INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE Members</span><br />
<span style="font-size: xx-small;">SET SexMISS = </span><br />
<span style="font-size: xx-small;">CASE</span><br />
<span style="font-size: xx-small;">WHEN Sex IS NULL THEN 1 ELSE 0</span><br />
<span style="font-size: xx-small;">END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/******************</span><br />
<span style="font-size: xx-small;">DRUG COUNTS</span><br />
<span style="font-size: xx-small;">******************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- convert to integers</span><br />
<span style="font-size: xx-small;">ALTER TABLE drugcount ADD DrugCountI INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;">UPDATE DRUGCOUNT</span><br />
<span style="font-size: xx-small;">SET DrugCountI = </span><br />
<span style="font-size: xx-small;">CASE WHEN DrugCount = '7+' THEN 7 ELSE DrugCount END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT</span><br />
<span style="font-size: xx-small;">memberID AS memberID_dc</span><br />
<span style="font-size: xx-small;">,Year AS YEAR_dc</span><br />
<span style="font-size: xx-small;">,MAX(drugcountI) AS drugCount_max</span><br />
<span style="font-size: xx-small;">,MIN(drugcountI) AS drugCount_min</span><br />
<span style="font-size: xx-small;">,AVG(drugcountI * 1.0) AS drugCount_ave</span><br />
<span style="font-size: xx-small;">,COUNT(*) AS drugcount_months</span><br />
<span style="font-size: xx-small;">INTO DRUGCOUNT_SUMMARY</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">drugcount</span><br />
<span style="font-size: xx-small;">GROUP BY </span><br />
<span style="font-size: xx-small;">memberID</span><br />
<span style="font-size: xx-small;">,Year</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/******************</span><br />
<span style="font-size: xx-small;">LAB COUNTS</span><br />
<span style="font-size: xx-small;">******************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- convert to integers</span><br />
<span style="font-size: xx-small;">ALTER TABLE LabCount ADD LabCountI INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;">UPDATE LabCount</span><br />
<span style="font-size: xx-small;">SET LabCountI = </span><br />
<span style="font-size: xx-small;">CASE WHEN LabCount = '10+' THEN 10 ELSE LabCount END</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT</span><br />
<span style="font-size: xx-small;">memberID AS memberID_lc</span><br />
<span style="font-size: xx-small;">,Year AS YEAR_lc</span><br />
<span style="font-size: xx-small;">,MAX(labcountI) AS labCount_max</span><br />
<span style="font-size: xx-small;">,MIN(labcountI) AS labCount_min</span><br />
<span style="font-size: xx-small;">,AVG(labcountI * 1.0) AS labCount_ave</span><br />
<span style="font-size: xx-small;">,COUNT(*) AS labcount_months</span><br />
<span style="font-size: xx-small;">INTO LABCOUNT_SUMMARY</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">labcount</span><br />
<span style="font-size: xx-small;">GROUP BY </span><br />
<span style="font-size: xx-small;">memberID</span><br />
<span style="font-size: xx-small;">,Year</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/********************************</span><br />
<span style="font-size: xx-small;">Targets</span><br />
<span style="font-size: xx-small;">********************************/</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT * </span><br />
<span style="font-size: xx-small;">INTO DIH</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">(</span><br />
<span style="font-size: xx-small;">SELECT </span><br />
<span style="font-size: xx-small;">MemberID AS MemberID_t</span><br />
<span style="font-size: xx-small;">,'Y1' AS YEAR_t</span><br />
<span style="font-size: xx-small;">,ClaimsTruncated</span><br />
<span style="font-size: xx-small;">,DaysInHospital </span><br />
<span style="font-size: xx-small;">,1 AS trainset</span><br />
<span style="font-size: xx-small;">FROM DaysInHospital_Y2</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UNION ALL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT </span><br />
<span style="font-size: xx-small;">MemberID AS MemberID_t</span><br />
<span style="font-size: xx-small;">,'Y2' AS YEAR_t</span><br />
<span style="font-size: xx-small;">,ClaimsTruncated</span><br />
<span style="font-size: xx-small;">,DaysInHospital </span><br />
<span style="font-size: xx-small;">,1 AS trainset</span><br />
<span style="font-size: xx-small;">FROM DaysInHospital_Y3</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UNION ALL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT</span><br />
<span style="font-size: xx-small;">MemberID AS MemberID_t</span><br />
<span style="font-size: xx-small;">,'Y3' AS YEAR_t</span><br />
<span style="font-size: xx-small;">,ClaimsTruncated</span><br />
<span style="font-size: xx-small;">,null AS DaysInHospital</span><br />
<span style="font-size: xx-small;">,0 AS trainset</span><br />
<span style="font-size: xx-small;">FROM Target</span><br />
<span style="font-size: xx-small;">) a</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">/*****************************</span><br />
<span style="font-size: xx-small;">Now merge them all together to</span><br />
<span style="font-size: xx-small;">create the modeling data SET</span><br />
<span style="font-size: xx-small;">******************************/</span><br />
<span style="font-size: xx-small;">SELECT a.*,b.*</span><br />
<span style="font-size: xx-small;">INTO #temp1</span><br />
<span style="font-size: xx-small;">FROM </span><br />
<span style="font-size: xx-small;">DIH a</span><br />
<span style="font-size: xx-small;">LEFT OUTER JOIN </span><br />
<span style="font-size: xx-small;">members b</span><br />
<span style="font-size: xx-small;">on a.MemberID_t = B.Memberid_M</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp1 DROP COLUMN Memberid_M</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp1 DROP COLUMN AgeAtFirstClaim</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp1 DROP COLUMN Sex</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT a.*,b.*</span><br />
<span style="font-size: xx-small;">INTO #temp2</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">#temp1 a</span><br />
<span style="font-size: xx-small;">LEFT OUTER JOIN</span><br />
<span style="font-size: xx-small;">claims_per_member b</span><br />
<span style="font-size: xx-small;">on a.MemberID_t = B.Memberid</span><br />
<span style="font-size: xx-small;">AND a.YEAR_t = b.year</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp2 DROP COLUMN Memberid</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp2 DROP COLUMN year</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT a.*,b.*</span><br />
<span style="font-size: xx-small;">INTO #temp3</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">#temp2 a</span><br />
<span style="font-size: xx-small;">LEFT OUTER JOIN</span><br />
<span style="font-size: xx-small;">DRUGCOUNT_SUMMARY b</span><br />
<span style="font-size: xx-small;">on a.MemberID_t = B.Memberid_dc</span><br />
<span style="font-size: xx-small;">AND a.YEAR_t = b.YEAR_dc</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp3 DROP COLUMN Memberid_dc</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp3 DROP COLUMN YEAR_dc</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT a.*,b.*</span><br />
<span style="font-size: xx-small;">INTO #temp4</span><br />
<span style="font-size: xx-small;">FROM</span><br />
<span style="font-size: xx-small;">#temp3 a</span><br />
<span style="font-size: xx-small;">LEFT OUTER JOIN</span><br />
<span style="font-size: xx-small;">LABCOUNT_SUMMARY b</span><br />
<span style="font-size: xx-small;">on a.MemberID_t = B.Memberid_lc</span><br />
<span style="font-size: xx-small;">AND a.YEAR_t = b.YEAR_lc</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp4 DROP COLUMN Memberid_lc</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp4 DROP COLUMN YEAR_lc</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">-- removel nulls for those who had </span><br />
<span style="font-size: xx-small;">-- no lab or drug information</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp4 ADD labNull INT</span><br />
<span style="font-size: xx-small;">ALTER TABLE #temp4 ADD drugNull INT</span><br />
<span style="font-size: xx-small;">GO</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labNull = 0</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labNull = 1 WHERE labCount_max IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugNull = 0</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugNull = 1 WHERE drugCount_max IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labCount_max = 0 WHERE labCount_max IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labCount_min = 0 WHERE labCount_min IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labCount_ave = 0 WHERE labCount_ave IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET labcount_months = 0 WHERE labcount_months IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugCount_max = 0 WHERE drugCount_max IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugCount_min = 0 WHERE drugCount_min IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugCount_ave = 0 WHERE drugCount_ave IS NULL</span><br />
<span style="font-size: xx-small;">UPDATE #temp4 SET drugcount_months = 0 WHERE drugcount_months IS NULL</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;">SELECT * </span><br />
<span style="font-size: xx-small;">INTO modelling_set</span><br />
<span style="font-size: xx-small;">FROM #temp4</span><br />
<span style="font-size: xx-small;"><br />
</span><br />
<span style="font-size: xx-small;"><br />
</span></div><br />
R code<br />
<br />
<div style="border-color: black; border-style: solid; border-width: 1px; height: 300px; overflow: auto; width: 500px;"><span style="font-size: xx-small;"><br />
</span><br />
<pre style="background: none repeat scroll 0% 0% rgb(255, 255, 255); color: black;"><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># Example GBM model for HHP</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># scores ~ 0.4635 on leaderboard</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># which would be 55th position of 510</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># as at 9th Sept 2011</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># Requires the data having been prepared</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># using the SQL supplied</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
starttime </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> proc</span><span style="color: #808030; font-size: xx-small;">.</span><span style="color: #400000; font-size: xx-small;">time</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#load the data</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
library</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">RODBC</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#set a connection to the database </span><span style="font-size: xx-small;">
conn </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> odbcDriverConnect</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #0000e6; font-size: xx-small;">"driver=SQL Server;database=HHP_comp;server=servername</span><span style="color: #0f69ff; font-size: xx-small;">\\</span><span style="color: #0000e6; font-size: xx-small;">SQLEXPRESS;"</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#or this method involves setting up a DSN (Data Source Name) called HHP_comp</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#conn <- odbcConnect("HHP_comp")</span><span style="font-size: xx-small;">
alldata </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> sqlQuery</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">conn</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: #0000e6; font-size: xx-small;">"select * from modelling_set"</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># arrange the data</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#identify train and leaderboard data</span><span style="font-size: xx-small;">
trainrows </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> which</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">alldata$trainset </span><span style="color: #808030; font-size: xx-small;">=</span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">1</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
scorerows </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> which</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">alldata$trainset </span><span style="color: #808030; font-size: xx-small;">=</span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">0</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#sanity check the size of each set</span><span style="font-size: xx-small;">
</span><span style="color: #400000; font-size: xx-small;">length</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">trainrows</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: #400000; font-size: xx-small;">length</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">scorerows</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#display the column names</span><span style="font-size: xx-small;">
colnames</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">alldata</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#memberid is required as key for submission set</span><span style="font-size: xx-small;">
memberid </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> alldata</span><span style="color: #808030; font-size: xx-small;">[</span><span style="font-size: xx-small;">scorerows</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: #0000e6; font-size: xx-small;">'MemberID_t'</span><span style="color: #808030; font-size: xx-small;">]</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#remove redundant fields</span><span style="font-size: xx-small;">
alldata$MemberID_t </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> NULL
alldata$YEAR_t </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> NULL
alldata$trainset </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> NULL
</span><span style="color: dimgrey; font-size: xx-small;">#target - what we are predicting</span><span style="font-size: xx-small;">
theTarget </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> </span><span style="color: #0000e6; font-size: xx-small;">'DaysInHospital'</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#put the target on the log scale</span><span style="font-size: xx-small;">
alldata</span><span style="color: #808030; font-size: xx-small;">[</span><span style="font-size: xx-small;">trainrows</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">theTarget</span><span style="color: #808030; font-size: xx-small;">]</span><span style="font-size: xx-small;"> </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> log1p</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">alldata</span><span style="color: #808030; font-size: xx-small;">[</span><span style="font-size: xx-small;">trainrows</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">theTarget</span><span style="color: #808030; font-size: xx-small;">]</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#find the position of the target</span><span style="font-size: xx-small;">
targindex </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> which</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">names</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">alldata</span><span style="color: #808030; font-size: xx-small;">)</span><span style="color: #808030; font-size: xx-small;">=</span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;">theTarget</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;"># build the model</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#GBM model settings, these can be varied</span><span style="font-size: xx-small;">
GBM_NTREES </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">500</span><span style="font-size: xx-small;">
GBM_SHRINKAGE </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">0.</span><span style="color: #008c00; font-size: xx-small;">05</span><span style="font-size: xx-small;">
GBM_DEPTH </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">4</span><span style="font-size: xx-small;">
GBM_MINOBS </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> </span><span style="color: #008c00; font-size: xx-small;">50</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#build the GBM model</span><span style="font-size: xx-small;">
library</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">gbm</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
GBM_model </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> gbm</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">fit</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">
</span><span style="color: #808030; font-size: xx-small;">x</span><span style="font-size: xx-small;"> </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> alldata</span><span style="color: #808030; font-size: xx-small;">[</span><span style="font-size: xx-small;">trainrows</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;">targindex</span><span style="color: #808030; font-size: xx-small;">]</span><span style="font-size: xx-small;">
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: maroon; font-size: xx-small; font-weight: bold;">y </span><span style="color: maroon; font-size: xx-small;">=</span><span style="color: #0000e6; font-size: xx-small;"> alldata[trainrows,targindex]</span><span style="font-size: xx-small;">
</span><span style="color: #0000e6; font-size: xx-small;"> ,distribution </span><span style="color: maroon; font-size: xx-small;">=</span><span style="color: #0000e6; font-size: xx-small;"> "gaussian"</span><span style="font-size: xx-small;">
</span><span style="color: #0000e6; font-size: xx-small;"> ,n.trees </span><span style="color: maroon; font-size: xx-small;">=</span><span style="font-size: xx-small;"> GBM_NTREES
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">shrinkage </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> GBM_SHRINKAGE
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">interaction</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">depth </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> GBM_DEPTH
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">n</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">minobsinnode </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> GBM_MINOBS
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">verbose </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> TRUE</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#list variable importance</span><span style="font-size: xx-small;">
summary</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">GBM_model</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">GBM_NTREES</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#predict for the leaderboard data</span><span style="font-size: xx-small;">
prediction </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> predict</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">gbm</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">object </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> GBM_model
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">newdata </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> alldata</span><span style="color: #808030; font-size: xx-small;">[</span><span style="font-size: xx-small;">scorerows</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;">targindex</span><span style="color: #808030; font-size: xx-small;">]</span><span style="font-size: xx-small;">
</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">GBM_NTREES</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#put on correct scale and cap</span><span style="font-size: xx-small;">
prediction </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> expm1</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">prediction</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
prediction </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> pmin</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #008c00; font-size: xx-small;">15</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">prediction</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
prediction </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> pmax</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #008c00; font-size: xx-small;">0</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">prediction</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#plot the submission distribution</span><span style="font-size: xx-small;">
hist</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">prediction</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;"> breaks</span><span style="color: #808030; font-size: xx-small;">=</span><span style="color: #008c00; font-size: xx-small;">500</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">#write the submission to file</span><span style="font-size: xx-small;">
</span><span style="color: dimgrey; font-size: xx-small;">########################################</span><span style="font-size: xx-small;">
submission </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> cbind</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">memberid</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">prediction</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
colnames</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">submission</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;"> </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> c</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #0000e6; font-size: xx-small;">"MemberID"</span><span style="color: #808030; font-size: xx-small;">,</span><span style="color: #0000e6; font-size: xx-small;">"DaysInHospital"</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
fnname </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> </span><span style="color: #0000e6; font-size: xx-small;">"C:</span><span style="color: #0f69ff; font-size: xx-small;">\\</span><span style="color: #0000e6; font-size: xx-small;">GBM_demo1.csv"</span><span style="font-size: xx-small;">
</span><span style="color: #400000; font-size: xx-small;">write</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">csv</span><span style="color: #808030; font-size: xx-small;">(</span><span style="font-size: xx-small;">submission</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;"> file</span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;">fnname</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;"> row</span><span style="color: #808030; font-size: xx-small;">.</span><span style="font-size: xx-small;">names </span><span style="color: #808030; font-size: xx-small;">=</span><span style="font-size: xx-small;"> FALSE</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
elapsedtime </span><span style="color: #808030; font-size: xx-small;"><</span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> proc</span><span style="color: #808030; font-size: xx-small;">.</span><span style="color: #400000; font-size: xx-small;">time</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;"> </span><span style="color: #808030; font-size: xx-small;">-</span><span style="font-size: xx-small;"> starttime
cat</span><span style="color: #808030; font-size: xx-small;">(</span><span style="color: #0000e6; font-size: xx-small;">"</span><span style="color: #0f69ff; font-size: xx-small;">\n</span><span style="color: #0000e6; font-size: xx-small;">Finished</span><span style="color: #0f69ff; font-size: xx-small;">\n</span><span style="color: #0000e6; font-size: xx-small;">"</span><span style="color: #808030; font-size: xx-small;">,</span><span style="font-size: xx-small;">elapsedtime</span><span style="color: #808030; font-size: xx-small;">)</span><span style="font-size: xx-small;">
</span></pre></div><br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com14tag:blogger.com,1999:blog-8236580214478682937.post-44233091381856322912011-08-31T23:30:00.000-07:002011-09-01T00:16:39.299-07:00First Milestone<div dir="ltr" style="text-align: left;" trbidi="on"><div style="font-family: Arial,Helvetica,sans-serif;">The time has now been and gone for submitting an entry for the first milestone, and the chance to win either $30,000 or $20,000 for the top two places.</div><div style="font-family: Arial,Helvetica,sans-serif;"><br />
</div><div style="font-family: Arial,Helvetica,sans-serif;">The images below are the state of play (on the 30% visible data!) just after the first milestone ( I didn't capture this image in time so some teams have made an extra submission since).</div><div style="font-family: Arial,Helvetica,sans-serif;"></div><div style="font-family: Arial,Helvetica,sans-serif;">There was a bit of movement in the last 2 days.</div><div style="font-family: Arial,Helvetica,sans-serif;"><br />
</div><ul style="text-align: left;"><li><span style="font-family: Arial,Helvetica,sans-serif;"> Tom & Jerry changed their name to reveal themselves (Petterson & Caetano). This was interesting for me as they are from a relatively new machine learning lab here in Australia that I had not been aware of. They jumped up to 3rd position before the last submission but were overhauled at the last minute</span></li>
<li><span style="font-family: Arial,Helvetica,sans-serif;"> Willem Mestrom made a big leap on his last submission to get into 3rd position. He was obviously keeping his best (or was he?) until the very last moment. This might be a good tactical play if he looking towards the $3 million rather than the short term smaller prizes.</span></li>
<li><span style="font-family: Arial,Helvetica,sans-serif;">Planet Melbourne, after making consistent daily submissions throughout the competition did not take the opportunity to make one last submission.</span></li>
<li><span style="font-family: Arial,Helvetica,sans-serif;">It is good to see Chris R still in the top 10. Chris has been up there since the very start.Interestingly Chris was cut off from submitting for a while due to Hurricane Irene taking his power away.</span></li>
</ul><span style="font-family: Arial,Helvetica,sans-serif;"> Interestingly, positions 2,3 & 4 are now taken by members of <a href="http://www.the-ensemble.com/">The Ensemble</a> from Netflix fame. Position 1 contains Dave, who was in an original team called Ensemble Team, and also entered Netflix as a member of <a href="http://www.findmorefives.com/">Ensemble Experts</a></span> ,<span style="font-family: Arial,Helvetica,sans-serif;"> and I (also a member of Market Makers) ran a competition on <a href="http://www.tiberius.biz/ausdm09/index.html">Ensembling the Netflix Data</a> and have looked at <a href="http://www.tiberius.biz/pakdd07.html">ensembling other competitions</a>.</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;">There seems to be a pattern here!</span><br />
<div><span style="font-family: Arial,Helvetica,sans-serif;"> </span><ul style="text-align: left;"></ul><div style="font-family: Arial,Helvetica,sans-serif;"></div><div style="font-family: Arial,Helvetica,sans-serif;">Note you can click on the images to enlarge them.</div><div style="font-family: Arial,Helvetica,sans-serif;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIFBRx7Oje-k2TW44CpFmfBVXrKRjNvCFuuIZ9z5GRVp5u75wL_XRCqfYwOJLBLnyjF5DZ4BNmVBvtDolb6wff9b7gCj7V-YiRgCVQ_abpMNHVaI6h_XazGlCRHm6XBNV3GyN1CM2M9b67/s1600/leaderboard_after_milestone.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgIFBRx7Oje-k2TW44CpFmfBVXrKRjNvCFuuIZ9z5GRVp5u75wL_XRCqfYwOJLBLnyjF5DZ4BNmVBvtDolb6wff9b7gCj7V-YiRgCVQ_abpMNHVaI6h_XazGlCRHm6XBNV3GyN1CM2M9b67/s320/leaderboard_after_milestone.PNG" width="320" /></a></div><br />
<br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWNz2M1LspaiY5UCFNEP598TZ20dpl3wygi1GiV3yjlBDUkmXRqtgGBY_rJhs6q0Z-cfyDeWbKfcXBQFMpbJZikn3EyqyEP8CQ1OOePC9qUrsM7dJvliClMYXW9mkgggLFjokkLd7gkd1s/s1600/leaderboardrank.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgWNz2M1LspaiY5UCFNEP598TZ20dpl3wygi1GiV3yjlBDUkmXRqtgGBY_rJhs6q0Z-cfyDeWbKfcXBQFMpbJZikn3EyqyEP8CQ1OOePC9qUrsM7dJvliClMYXW9mkgggLFjokkLd7gkd1s/s320/leaderboardrank.PNG" width="310" /></a></div><br />
<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH8SBJkjmX-xPw08BTcBeXKgu1wvuHD-NI_Y63t2c1BH1u6WFfyhv9FX6Wi5fpyo1sAAcxzlYQBTUJpeNy-v2Qqmbrnh16sG8czWHYwJ9NpCYZCuCaaVLtg6czk-ISeofGtCwA_J1ykZNw/s1600/leaderboardranktop100.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgH8SBJkjmX-xPw08BTcBeXKgu1wvuHD-NI_Y63t2c1BH1u6WFfyhv9FX6Wi5fpyo1sAAcxzlYQBTUJpeNy-v2Qqmbrnh16sG8czWHYwJ9NpCYZCuCaaVLtg6czk-ISeofGtCwA_J1ykZNw/s320/leaderboardranktop100.PNG" width="320" /></a></div><br />
<br />
<br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">The plot below shows the Lowess line fit of the error as a function of the number of entries submitted.</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJocPGJDUV7uIYpQ_NVlpH2boSTtmFSWpoI5LZeAnlNiUPna9VkLnGraQUy12dOy5DS4X2kg5VOgA-xO9WSEbNeCb3p6MI4NKV3F2KFYSXQI__iflVuVQS_FVOhbd_R_emnQQE1NkNigh0/s1600/leaderboardloess.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJocPGJDUV7uIYpQ_NVlpH2boSTtmFSWpoI5LZeAnlNiUPna9VkLnGraQUy12dOy5DS4X2kg5VOgA-xO9WSEbNeCb3p6MI4NKV3F2KFYSXQI__iflVuVQS_FVOhbd_R_emnQQE1NkNigh0/s320/leaderboardloess.PNG" width="320" /></a></div><br />
</div></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com1tag:blogger.com,1999:blog-8236580214478682937.post-85997298058965524672011-08-27T22:29:00.000-07:002011-08-27T22:30:13.226-07:00First Milestone Just Around the Corner<div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-family: Arial,Helvetica,sans-serif;">There is about 4 days to go until the deadline for the first milestone, so I thought it would be interesting to capture the leaderboard now and then in 4 days, to see if things change wildly. At this stage it can be like a game of poker.</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">All the top 10 teams are still in the hunt and making daily submissions. Tom & Jerry were late starters but have made astonishing progress to 3rd place. They currently split 3 others who were part of <a href="http://www.the-ensemble.com/">'The Ensemble'</a>, Edward, Willem and B.Yang.</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">So many teams at the top all have a proven history in data mining competitions. I will be very happy if those ensemble dudes repeat their performance in the Netflix Prize ;-). </span><br />
<span style="font-family: Arial,Helvetica,sans-serif;"> </span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjurvzDlarTXrRO72NqEfHNbAwM5PrtNTfDbD7nEzyi46U1772X-dKUZrVxvQX_D4ww6EtSCb4f1gxUon4XlYtb8AUQs26npdldJFFx5sqLsvywhAbYamLSIKTXPtTaQ3zIVm2qNhZ_WVL8/s1600/leaderboard4daystogo.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjurvzDlarTXrRO72NqEfHNbAwM5PrtNTfDbD7nEzyi46U1772X-dKUZrVxvQX_D4ww6EtSCb4f1gxUon4XlYtb8AUQs26npdldJFFx5sqLsvywhAbYamLSIKTXPtTaQ3zIVm2qNhZ_WVL8/s320/leaderboard4daystogo.PNG" width="320" /></a></div><br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com2tag:blogger.com,1999:blog-8236580214478682937.post-79686711828346387892011-08-19T03:34:00.000-07:002011-08-19T03:38:11.227-07:001 minute of fame<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The HHP competition and Kaggle are catching the attention of the media over here in Australia.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I was on TV recently (and yes, it does add a few Kg to you),</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<a href="http://www.abc.net.au/catalyst/stories/3296837.htm"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">http://www.abc.net.au/catalyst/stories/3296837.htm</span></a><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">and in the paper (2nd last paragraph interesting!),</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<a href="http://www.smh.com.au/business/miners-of-a-different-kind-in-line-for-riches-of-their-own-20110731-1i6bi.html"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">http://www.smh.com.au/business/miners-of-a-different-kind-in-line-for-riches-of-their-own-20110731-1i6bi.html</span></a><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Blog posts may be limited in the next 2 weeks as its nose down to try and catch Edward.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com1tag:blogger.com,1999:blog-8236580214478682937.post-43861501244855333162011-08-03T09:02:00.000-07:002011-08-03T09:04:49.232-07:00Bath Mawr Coch<div dir="ltr" style="text-align: left;" trbidi="on"><br />
<span style="font-family: Arial,Helvetica,sans-serif;">In </span><a href="http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html" style="font-family: Arial,Helvetica,sans-serif;">one of my first posts</a><span style="font-family: Arial,Helvetica,sans-serif;"> you will see how my eagerness to gloat of my success unleashed Dave into the HHP. He has subsequently been at the top of the leaderboard from the beginning, whereas my progress was more gradual. Eventually we seemed to converge to a similar score and even held positions 1 & 2 for a while, which was very surprising to me given the caliber of contestants that this competition has attracted.</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">We have now decided to join forces, hence the appearance of team Bath Mawr Coch. Hopefully there will be some synergy in our methods.</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;">Dave and I are both competition junkies. We first cam into contact as a result of the<a href="http://www.sigkdd.org/kddcup/index.php?section=2004&method=result"> KDDCup 2004 contest</a>, which Dave won - and you will see my effort in a lowly 11th place. The leaderboard was left open after the conclusion of the contest, and<a href="http://osmot.cs.cornell.edu/cgi-bin/newtable.pl?prob=phy"> I eventually got to top spot</a>, which is obviously not that impressive when the winners had already revealed what they did - but I had learnt things and been able to implement them. I then contacted Dave to gloat - and he immediately overtook me.</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;">We have since worked on a couple of other competitions together, and won the <a href="http://www.tiberius.biz/accuracy.html">NCDM Analytic challenge in 2008.</a></span><br />
<span style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;">Competitions like the HHP are a great way to learn and to also meet others with similar interests. I'm sure many collaborations and friendships will be made as a result, to the benefit of everyone.</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<span style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<br />
<span style="font-family: Arial,Helvetica,sans-serif;"> </span><br />
<br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-34566670636144079312011-08-02T04:50:00.000-07:002011-08-02T05:15:57.906-07:00Gone Shopping<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b>Another New Comp</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Kaggle have just posted a new competition<a href="http://www.kaggle.com/c/dunnhumbychallenge"> Dunnhumby's Shopper Challenge</a> that is a very interesting one. The data is historical records of when customers visited a store and how much they spent. The goal is to predict when customers will next visit the store and how much they will spend.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">This is probably a novel real world data set and I expect the interest in this competition to be high. The reasons I think this are</span><br />
<br />
<ol style="text-align: left;"><li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The data is simple</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">There is not a massive amount of data, so processing power will not be an issue</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">It is a novel problem where creativity is needed and a new algorithm will probably have to be developed. It is not going to be a case of who can pre-process the data best and build the best ensemble using existing algorithms</span></li>
</ol><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b>So, where to start?</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The first thing I always do is run the data through my Nifty Tool <a href="http://anotherdataminingblog.blogspot.com/2011/05/progress-loading-hhp-data.html">(see here for example</a>) to check it and generate some SQL so I can load it into a database.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Here is what comes out...</span><br />
<br />
<div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><br />
<br />
<pre style="background: #ffffff; color: black;"><span style="color: #7f0055; font-weight: bold;">CREATE</span> <span style="color: #7f0055; font-weight: bold;">DATABASE</span> dunnhumby
<span style="color: #7f0055; font-weight: bold;">USE</span> dunnhumby
<span style="color: #7f0055; font-weight: bold;">CREATE</span> <span style="color: #7f0055; font-weight: bold;">TABLE</span> training
(
customer_id <span style="color: #7f0055; font-weight: bold;">int</span>
, visit_date <span style="color: #7f0055; font-weight: bold;">date</span>
, visit_spend <span style="color: #7f0055; font-weight: bold;">float</span>
)
<span style="color: #7f0055; font-weight: bold;">BULK</span> <span style="color: #7f0055; font-weight: bold;">INSERT</span> training
<span style="color: #7f0055; font-weight: bold;">FROM</span> <span style="color: #2a00ff;">'E:\comps\dunnhumby\training.csv'</span>
<span style="color: #7f0055; font-weight: bold;">WITH</span>
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = <span style="color: #2a00ff;">','</span>,
ROWTERMINATOR = <span style="color: #2a00ff;">'\n'</span>
)
<span style="color: #3f7f59;">--(12146637 row(s) affected)</span>
<span style="color: #7f0055; font-weight: bold;">CREATE</span> <span style="color: #7f0055; font-weight: bold;">TABLE</span> test
(
customer_id <span style="color: #7f0055; font-weight: bold;">int</span>
, visit_date <span style="color: #7f0055; font-weight: bold;">date</span>
, visit_spend <span style="color: #7f0055; font-weight: bold;">float</span>
)
<span style="color: #7f0055; font-weight: bold;">BULK</span> <span style="color: #7f0055; font-weight: bold;">INSERT</span> test
<span style="color: #7f0055; font-weight: bold;">FROM</span> <span style="color: #2a00ff;">'E:\comps\dunnhumby\test.csv'</span>
<span style="color: #7f0055; font-weight: bold;">WITH</span>
(
MAXERRORS = 0,
FIRSTROW = 2,
FIELDTERMINATOR = <span style="color: #2a00ff;">','</span>,
ROWTERMINATOR = <span style="color: #2a00ff;">'\n'</span>
)
</pre><br />
</div><br />
<br />
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The data set is relatively nice, just 3 coulumns;</span></div><ul style="text-align: left;"><li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">customer ID</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">date</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">amount spent</span></li>
</ul><div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">the dates are from 1st April 2010 to 18th July 2011.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b>A Naive Guess</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I always like to get in a submission without actually really looking at the data, just to make sure I am reading the data correctly and that the submission file is in the correct format. </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">In forecasting, a naive prediction is not going to be far wrong in the long run. If you want to forecast the weather, then saying tomorrow is going to be the same as today is going to serve you well in the long run if you have no other information to go off. So for this problem, we will say the next spend will be the same as the last spend, and the gap in days will also be the same.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">To achieve this in SQL is a bit tricky, but can be done. The code below does the trick and when submitted gets 9.5% correct on the leaderboard, the same as the simple baseline benchmark. </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span></div><div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<pre style="background: #ffffff; color: black;"><span style="color: #3f7f59;">/*************************************************</span>
<span style="color: #3f7f59;">SQL to generate a naive prediction</span>
<span style="color: #3f7f59;">- spend will be the same as the last visit</span>
<span style="color: #3f7f59;">- the next visit will be in the same number</span>
<span style="color: #3f7f59;"> of days as the gap between the previous visit</span>
<span style="color: #3f7f59;">**************************************************/</span>
<span style="color: #3f7f59;">-- append a visit number to the data - visit 1 = most recent visit</span>
<span style="color: #7f0055; font-weight: bold;">select</span> <span style="background-attachment: initial; background-clip: initial; background-color: #dd0000; background-image: initial; background-origin: initial; color: white; font-style: italic; font-weight: bold;">*</span>, Rank() over (Partition <span style="color: #7f0055; font-weight: bold;">BY</span> customer_id <span style="color: #7f0055; font-weight: bold;">order</span> <span style="color: #7f0055; font-weight: bold;">by</span> visit_date <span style="color: #7f0055; font-weight: bold;">desc</span>) <span style="color: #7f0055; font-weight: bold;">as</span> visit_number
<span style="color: #7f0055; font-weight: bold;">into</span> <span style="color: #3f7f59;">#temp1</span>
<span style="color: #7f0055; font-weight: bold;">from</span> test
<span style="color: #3f7f59;">-- create field which is days since previous visit</span>
<span style="color: #7f0055; font-weight: bold;">select</span> a.*
,b.visit_date <span style="color: #7f0055; font-weight: bold;">as</span> previous_visit_date
,DATEDIFF(DD,b.visit_date,a.visit_date) <span style="color: #7f0055; font-weight: bold;">as</span> days_since_previous_visit
<span style="color: #7f0055; font-weight: bold;">into</span> <span style="color: #3f7f59;">#temp2</span>
<span style="color: #7f0055; font-weight: bold;">from</span> <span style="color: #3f7f59;">#temp1 a inner join #temp1 b</span>
<span style="color: #7f0055; font-weight: bold;">on</span> a.customer_id = b.customer_id
and a.visit_number = b.visit_number - 1
<span style="color: #7f0055; font-weight: bold;">where</span> a.visit_number = 1
<span style="color: #3f7f59;">-- generate the submission file, makink sure 1st April is earliest data</span>
<span style="color: #7f0055; font-weight: bold;">select</span></pre><pre style="background-attachment: initial; background-clip: initial; background-color: white; background-image: initial; background-origin: initial; background-position: initial initial; background-repeat: initial initial;">customer_id
,
(<span style="color: #7f0055; font-weight: bold;">case</span>
<span style="color: #7f0055; font-weight: bold;">when</span> dateadd(dd,days_since_previous_visit,visit_date) < <span style="color: #2a00ff;">'2011-04-01'</span>
<span style="color: #7f0055; font-weight: bold;">then</span> <span style="color: #2a00ff;">'2011-04-01'</span>
<span style="color: #7f0055; font-weight: bold;">else</span> dateadd(dd,days_since_previous_visit,visit_date)
<span style="color: #7f0055; font-weight: bold;">end</span>)
<span style="color: #7f0055; font-weight: bold;">as</span> visit_date
,visit_spend
<span style="color: #7f0055; font-weight: bold;">from</span> <span style="color: #3f7f59;">#temp2</span>
<span style="color: #7f0055; font-weight: bold;">order</span> <span style="color: #7f0055; font-weight: bold;">by</span> customer_id
</pre></div><br />
<br />
<b><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Now Lets Look at the Data</span></b><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">In order to get a feel for things, we will aggregate the data to a daily level and look at how the population as a whole goes shopping. We will create a count of the total number of visits per day and generate some time based fields that will help us understand the daily and seasonal patterns in shopping. The SQL below will generate that data, which is small enough to save in Excel or as a text file and then load in your favourite analysis package, or we could just create a table in the database and load directly from there (don't leave as a view as it takes a while to generate).</span><br />
<br />
<div style="border-bottom: black 1px solid; border-left: black 1px solid; border-right: black 1px solid; border-top: black 1px solid; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><br />
<br />
<pre style="background: #ffffff; color: black;"><span style="color: #3f7f59;">-- add a field which is the days since 1st April (1st April 2010)</span>
<span style="color: #7f0055; font-weight: bold;">alter</span> <span style="color: #7f0055; font-weight: bold;">table</span> training <span style="color: #7f0055; font-weight: bold;">add</span> daysSinceStart <span style="color: #7f0055; font-weight: bold;">int</span>
<span style="color: #7f0055; font-weight: bold;">update</span> training <span style="color: #7f0055; font-weight: bold;">set</span> daysSinceStart = <span style="color: #7f0055; font-weight: bold;">DATEDIFF</span>(dd,<span style="color: #2a00ff;">'2010-04-01'</span>,visit_date)
<span style="color: #3f7f59;">-- create a daily summary dataset</span>
<span style="color: #7f0055; font-weight: bold;">select</span> visit_date
<span style="color: #3f7f59;">-- fields of interest</span>
, <span style="color: #7f0055; font-weight: bold;">COUNT</span>(*) <span style="color: #7f0055; font-weight: bold;">as</span> visits
, <span style="color: #7f0055; font-weight: bold;">avg</span>(visit_spend) <span style="color: #7f0055; font-weight: bold;">as</span> avg_visit_spend
, stdev(visit_spend) <span style="color: #7f0055; font-weight: bold;">as</span> stdv_visit_spend
<span style="color: #3f7f59;">-- time based predictor variables</span>
, <span style="color: #7f0055; font-weight: bold;">min</span>(daysSinceStart) <span style="color: #7f0055; font-weight: bold;">as</span> daysSinceStart
, <span style="color: #7f0055; font-weight: bold;">sin</span>(2 * pi() * (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(DAYOFYEAR, visit_date) * 1.0 / 365.0)) <span style="color: #7f0055; font-weight: bold;">as</span> doySin
, <span style="color: #7f0055; font-weight: bold;">cos</span>(2 * pi() * (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(DAYOFYEAR, visit_date) * 1.0 / 365.0)) <span style="color: #7f0055; font-weight: bold;">as</span> doyCos
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 1 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowSun
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 2 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowMon
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 3 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowTue
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 4 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowWed
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 5 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowThu
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 6 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowFri
, (<span style="color: #7f0055; font-weight: bold;">case</span> <span style="color: #7f0055; font-weight: bold;">when</span> (<span style="color: #7f0055; font-weight: bold;">DATEPART</span>(WEEKDAY, visit_date)) = 7 <span style="color: #7f0055; font-weight: bold;">then</span> 1 <span style="color: #7f0055; font-weight: bold;">else</span> 0 <span style="color: #7f0055; font-weight: bold;">end</span>) <span style="color: #7f0055; font-weight: bold;">as</span> dowSat
<span style="color: #7f0055; font-weight: bold;">from</span> dbo.training
<span style="color: #7f0055; font-weight: bold;">group</span> <span style="color: #7f0055; font-weight: bold;">by</span> visit_date
<span style="color: #7f0055; font-weight: bold;">order</span> <span style="color: #7f0055; font-weight: bold;">by</span> visit_date <span style="color: #7f0055; font-weight: bold;">asc</span>
</pre></div><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><a href="http://www.tiberius.biz/">Tiberius</a> is an ideal tool for analysing this type of data, as I initially wrote it to look at a similar type of data set - electricity consumption - and the patterns I want to discover will be similar.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I have built a model to predict visits as a function of time. The model predictors are weekly, seasonal and trend components, so we are trying to model the number of visits based solely on time. The plot below shows the errors in the model, the points are in date order.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">(Click on the plots to enlarge)</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDDdPGgrrEOB8is3gYE3NHSwpUFTD0vBP2-IIwJ1HHQM99ghzBDwpKU7ktf5kr2xLzkAcXr_BUNw6IWQ4MTV70eWUPyezMVrkQ1qI9qeNAoEBtuAtX4-3o16nvJXSNKcnKu9X6uFq5Y2cO/s1600/model_errors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="173" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDDdPGgrrEOB8is3gYE3NHSwpUFTD0vBP2-IIwJ1HHQM99ghzBDwpKU7ktf5kr2xLzkAcXr_BUNw6IWQ4MTV70eWUPyezMVrkQ1qI9qeNAoEBtuAtX4-3o16nvJXSNKcnKu9X6uFq5Y2cO/s320/model_errors.png" width="320" /></a></div><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">What we see are that there are certain dates where the errors stand out. These are easily identifed in Tiberius by rolling your mouse over the image and the dates are displayed.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The two big anomalies are 25th Dec 2010 and 1st Jan 2011 - Christmas Day and New Years Day. So if we are trying to deduce where this data set is from, China is probably crossed off the list.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Other dates with errors are</span><br />
<br />
<ol style="text-align: left;"><li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">4th April 2010</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">3rd May 2010</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">31st May 2010</span></li>
<li><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">30th August 2010</span></li>
</ol><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><a href="http://www.woodlands-junior.kent.sch.uk/customs/questions/bankholidays.html">A bit of research</a> will identify that these are all England & Wales public holidays, so I think we have figured out where the data set originates from (hello Tesco!).</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The model of visits can be decomposed into weekly, annual and a trend components. </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The weekly component shows Sunday is the quietest day, and Friday the busiest day. Monday, Tuesday and Wednesday are all similar.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpclvngm6C_-tzf5Goue_cDq5WS2NcQGV08Hvdy4lCd3V2SVB_rkwOTpebc93qSgBvCPHf75fN8HntcRj2XWl_oRkOmPM-3AEwzsrp8emX6RpEUd7YPNM1VD2iFL53HhgZcKj6eX0Ckm2j/s1600/weekly_component.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="173" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgpclvngm6C_-tzf5Goue_cDq5WS2NcQGV08Hvdy4lCd3V2SVB_rkwOTpebc93qSgBvCPHf75fN8HntcRj2XWl_oRkOmPM-3AEwzsrp8emX6RpEUd7YPNM1VD2iFL53HhgZcKj6eX0Ckm2j/s320/weekly_component.PNG" width="320" /></a></div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">The Seasonal component shows early April is the busiest time of year.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjezrxgbDxhEWbyoxvN7_NGFDMcNSdsl-Khb4591RXg3-206tFxp2i8tpJLSqnYcL2jbrTecGiiRVjUa1ax2wKKqj0XgUR7IJfMpyS07UMBdCvpkj-hTZKGfhTAVVvB9jvFlsHAZkJo59CL/s1600/seasonal_component.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="239" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjezrxgbDxhEWbyoxvN7_NGFDMcNSdsl-Khb4591RXg3-206tFxp2i8tpJLSqnYcL2jbrTecGiiRVjUa1ax2wKKqj0XgUR7IJfMpyS07UMBdCvpkj-hTZKGfhTAVVvB9jvFlsHAZkJo59CL/s320/seasonal_component.png" width="320" /></a></div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">And unfortunately the trend looks to be going downhill.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8E8iIB-3s9m6tD4sTVNacf5LG90RlYIY1RbDmjymwIi02jbqbeyfOVBdWYqS5giqUYrGo8zMe8C_oQA0teSG1z-68aZCF6HAoaQDnZCFvI8NsmikN2rjICPcz9kPwTnGzLgPln6niFh-5/s1600/trend_component.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="172" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8E8iIB-3s9m6tD4sTVNacf5LG90RlYIY1RbDmjymwIi02jbqbeyfOVBdWYqS5giqUYrGo8zMe8C_oQA0teSG1z-68aZCF6HAoaQDnZCFvI8NsmikN2rjICPcz9kPwTnGzLgPln6niFh-5/s320/trend_component.PNG" width="320" /></a></div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Finally here is an interesting plot of the standard deviation of the spend,</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnWUYj_vkDZM2pCScTH8g2ONBivGfNbnDpORCcfoIrvnXS-k0uCFgi-I4IcSL1klubfbdOZpg-ll7lg3VhfjPqv7lrIbQZEych9y9h5y4DeZLRKRcLSqU7-gtE-sxL0lzHBcJhip-qm2tB/s1600/spend_standard_deviation.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnWUYj_vkDZM2pCScTH8g2ONBivGfNbnDpORCcfoIrvnXS-k0uCFgi-I4IcSL1klubfbdOZpg-ll7lg3VhfjPqv7lrIbQZEych9y9h5y4DeZLRKRcLSqU7-gtE-sxL0lzHBcJhip-qm2tB/s320/spend_standard_deviation.png" width="320" /></a></div><br />
<br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">So we are on our way to understanding this problem. </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Good luck everyone, hope the challenge keeps you busy!</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com6tag:blogger.com,1999:blog-8236580214478682937.post-91629535772205229552011-07-30T17:24:00.000-07:002011-07-30T17:39:03.192-07:00No Change at the Top<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;">I just spotted something on the HHP leaderboard that I have not seen before. There has been no change in the top 12 positions for 1 week. </span><br />
<span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;">Does this mean we are all running out of ideas? Teams are submitting but the incremental improvement seems to be relatively minuscule.</span><br />
<span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial,Helvetica,sans-serif;">Dave (<a href="http://anotherdataminingblog.blogspot.com/2011/05/day-i-led-hhp-briefly.html">He is smarter than me!</a>) and Willem seem to have a big gap at the top, and Edward is close on my tail for 3rd place. The team I would look out for though are some Old Dogs With New Tricks.</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJtg3ByhIM-K9FEl5cWpBhhcYKyp9m_lxgfHivulzg20x4LVq23gwBnluWZpZRb9g8QJ52YPHMXwgJLwazl8sOnWmM_Bw2-HHelVhZTZOfNwdX4Iqswm46i68rPlTNjFXmqcQ9RUOyP3QU/s1600/leaderboardnochange1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiJtg3ByhIM-K9FEl5cWpBhhcYKyp9m_lxgfHivulzg20x4LVq23gwBnluWZpZRb9g8QJ52YPHMXwgJLwazl8sOnWmM_Bw2-HHelVhZTZOfNwdX4Iqswm46i68rPlTNjFXmqcQ9RUOyP3QU/s320/leaderboardnochange1.PNG" width="320" /></a></div><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUnG3o7xk54OHGJEUnjrxPmPNc0lx-Hu4jY7WIhcOtbvhR4CbWFKE3Y1NfVDhn7Z0FYiOOPdQ8Ag-ugw78C403GTjUbneBifoIg4p3PtMzYiQTwIhcqA_qOeETdxgRTAYbEuebWCOZUoqD/s1600/top50.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUnG3o7xk54OHGJEUnjrxPmPNc0lx-Hu4jY7WIhcOtbvhR4CbWFKE3Y1NfVDhn7Z0FYiOOPdQ8Ag-ugw78C403GTjUbneBifoIg4p3PtMzYiQTwIhcqA_qOeETdxgRTAYbEuebWCOZUoqD/s320/top50.PNG" width="320" /></a></div><br />
</div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-35805984761471073742011-07-29T01:52:00.000-07:002011-07-29T01:52:34.669-07:00A1 Melbourne and MelbURN: R vs SAS<div dir="ltr" style="text-align: left;" trbidi="on"><div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Last night was another double header in the Melbourne analytic scene. Someone else has already described the events I attended, so pointing to their post will save me some typing. </span></div><div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span></div><a href="http://analystfirst.com/2011/07/28/574/a1-melbourne-and-melburn-r-vs-sas/#.TjJ0D_pHtaA.blogger"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">A1 Melbourne and MelbURN: R vs SAS</span></a></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-83571564367871418332011-07-20T17:01:00.000-07:002011-07-20T17:06:19.711-07:00Damn Computers II<div dir="ltr" style="text-align: left;" trbidi="on"><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">In an earlier post I grumbled a bit about how computers can be very frustrating...</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><a href="http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html">http://anotherdataminingblog.blogspot.com/2011/05/damn-computers.html</a></span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">here is another one for the collection!</span><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b>Chrome v Internet Explorer 9</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">As described in the previous post I have just upgraded my PC. Now generally my experience with Windows is that when new, computers are quite quick but with time they seem to slow down and will eventually take 30 mins just to turn on. There is a school of thought that this is an inbuilt Microsoft conspiracy to ensure you keep upgrading your operating system when they release a new one - which seems to be every second week at the moment. The other theory is that in time you install more software that want to do things for you without asking (why does Adobe think I need it to check for updates every 5 mins). </span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Wanting to not clutter my new machine I am resisting installing anything that is not essential at the moment. I declined the 'trial' version of the latest Microsoft Office suite during the install of the o/s. This caused a problem when I quickly wanted to jot some numbers down in a spreadsheet. Rather than install my old licenced version of Office 2000, I downloaded the free <a href="http://www.openoffice.org/">Open Office</a> and was very impressed - seamless integration with Microsoft Office so no issues you won't be able to read files others send you and it basically looks the same - I think I will be sticking with that from now on.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I also resisted downloading the Firefox browser and stuck with IE, until now that is.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I wanted to reply to a blog post. Having spent a while composing the reply, I tried to submit it and got stuck in an endless loop of being prompted to log into my Google account. Having experienced things like this before, my first option these days is to try a different browser. I went for Chrome this time, just on the hunch that seeing Chrome was a Google offering, as is Blogger then it should work. And it did - first time.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Having developed a few web sites, these differences between browsers are a bit of a pain. You really have to test everything out for all scenarios these days. On the Kaggle leaderboard I'm getting different views depending on what browser I am using. In IE my team name is just in bold, in Firefox and Chrome the whole cell is red making it easier to spot me, which I think is the look that it intended.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Internet Explorer - no colouring of the cells</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv2sToe7FjMps9CtvQbUgCZWaR9Xa_my53pe_laxoL7vFskUsVBF8pWdCTMrezTFKm-SNro0Mq0iYrz-MHnIbooGC7PVRfrtxf4VDJ03V0Ud3PCtkGoi6OQS6ztELowl1InTBIzpMA6cbn/s1600/teamnameIE.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="95" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv2sToe7FjMps9CtvQbUgCZWaR9Xa_my53pe_laxoL7vFskUsVBF8pWdCTMrezTFKm-SNro0Mq0iYrz-MHnIbooGC7PVRfrtxf4VDJ03V0Ud3PCtkGoi6OQS6ztELowl1InTBIzpMA6cbn/s320/teamnameIE.PNG" width="320" /></a></div><br />
<br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Chrome & Firefox - I think this is the intended look</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"> </span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUdJbawwGPdVpCizzVuWJZJscTD8MtHy3l-k62sbfSllPn_kqwaCRzlR0orHRxu8Kz5iiaVd2gWnuth2mo9pZ3djRGmhsV5CS6AnRc-h4Ev6ttGdjH4H-Sm9n_3tUWRihwXZv_UsaL53tm/s1600/teamnameChrome.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUdJbawwGPdVpCizzVuWJZJscTD8MtHy3l-k62sbfSllPn_kqwaCRzlR0orHRxu8Kz5iiaVd2gWnuth2mo9pZ3djRGmhsV5CS6AnRc-h4Ev6ttGdjH4H-Sm9n_3tUWRihwXZv_UsaL53tm/s320/teamnameChrome.PNG" width="320" /></a></div><span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b>Keyboard Issues</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><b><br />
</b></span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">With this new PC, I am having trouble typing a single quote ( ' ), which is required quite a lot in programming. If I press the key once, nothing happens. I then press it again an two of them appear and I have to delete one to get the single quote required.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">I'm sure there is a setting somewhere to prevent this as I've never had this issue before. Rather than me waste all day trying to figure it out, I'm hoping someone can point me in the right direction.</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span class="Apple-style-span" style="font-family: Arial, Helvetica, sans-serif;">Thanks in advance....</span></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com0tag:blogger.com,1999:blog-8236580214478682937.post-61801389044854745432011-07-20T05:24:00.000-07:002013-10-23T04:11:04.854-07:00Pump Up The Volume<div dir="ltr" style="text-align: left;" trbidi="on"><span style="font-family: Arial, Helvetica, sans-serif;">The HHP competition convinced me that it was time to upgrade my tired old computer. As data sets get bigger, having more RAM and a 64 bit operating system can make things a little quicker.</span><br />
<br />
<span style="font-family: Arial;">For those who aren't aware, if you have a 32-bit operating system then there is no point having more than 4GB of RAM, as it can't be used. So if you need more RAM, then you will also need a 64-bit operating system.</span><br />
<br />
<span style="font-family: Arial;">I had a chat with the guy in the computer shop and he convinced me that it was going to be worth my while just to buy a new PC rather than upgrade my old one. My specifications were 16GB RAM, Windows 7 64-bit and the latest i7 processors. Size was also a consideration - due to the impending take over of my office by my daughter and me being relegated to the cupboard under the stairs.</span><br />
<br />
<span style="font-family: Arial;">I considered a laptop, but apparently the i7 processors that go in them are not as fast as the desktop versions. Instead I opted for a really small case, which increased the price as the case was more expensive and apparently you need a smaller motherboard to fit in them. A bog standard graphics card was added and Bobs your uncle.</span><br />
<br />
<span style="font-family: Arial;">The next day the shop rang me - with some bad news. The power supply that is required for the i7 processors won't fit into the nice little case I had opted for, so they recommended another for me that is kind of normal size (but still half the size and weight of my incumbent that comes with 5 fans and neon lights!).</span><br />
<br />
<span style="font-family: Arial;">The downgrade in case saved me some money, and the total cost was AU$1,300 - no monitor, keyboard or mouse included though. Incidentally, it would have still cost about $800 to upgrade my old PC (or so the salesman told me), and I now have 2 (actually about 8 as I never seem to get around to getting rid of my old ones).</span><br />
<br />
<span style="font-family: Arial;">Keen to see what it could do, I ran my old algorithms through the new beast, and it seemed to accomplish the same tasks as my Win 7 64-bit i5 laptop in 66% of the time. I then looked at the task manager to see what was going on under the hood.</span><br />
<br />
<span style="font-family: Arial;">The first thing I noticed is that with no applications running at all, just the operating system, it seems to be using 1.71GB, which seems a little excessive. </span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQY8Gcyql2ZBEWDLN60jLixDPRe6wDweRHRagM5AGSeeelPrumC2XzYrHzaCdhQJ3CSZqJum0ULOadpAYj6IfziD6xjEYWMW2hCXuXJ21RQ71L5c58QA7nO9DeLwvZLRiobn6BTYaiZy_J/s1600/taskmanagerA.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-family: Arial, Helvetica, sans-serif;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgQY8Gcyql2ZBEWDLN60jLixDPRe6wDweRHRagM5AGSeeelPrumC2XzYrHzaCdhQJ3CSZqJum0ULOadpAYj6IfziD6xjEYWMW2hCXuXJ21RQ71L5c58QA7nO9DeLwvZLRiobn6BTYaiZy_J/s1600/taskmanagerA.PNG" /></span></a></div><br />
<span style="font-family: Arial, Helvetica, sans-serif;">I then kicked off my algorithms and was a bit dismayed to see that only 14% of the available CPU was being used - so essentially only 1 of the 8 cores.</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoDHOrX5vJrcnvxODRRgDafcvN3spQT9lAV81OANp61kD1LCg7oddgjF_Pe13JVFQS4JtTaBWfDrlt3sIBUo_tktOSuWu7rEDT1sW0crb09ncqDFVxhXrGC4CCMSxlpmtqOxMP9wb3T5fB/s1600/taskmanager.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoDHOrX5vJrcnvxODRRgDafcvN3spQT9lAV81OANp61kD1LCg7oddgjF_Pe13JVFQS4JtTaBWfDrlt3sIBUo_tktOSuWu7rEDT1sW0crb09ncqDFVxhXrGC4CCMSxlpmtqOxMP9wb3T5fB/s320/taskmanager.PNG" width="320" /></a></div><br />
<span style="font-family: Arial, Helvetica, sans-serif;">This is because the code I was running was doing things serially. It was doing a set of tasks that needed the prior task to be completed before the next one could begin. The extra processors would enable me to work on different tasks at the same time (such as browsing the Internet) - but the whole point of the investment was to speed up the number crunching.</span><br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">One solution to utilise the extra processing power would have been to run several algorithms at the same time - but I wanted one algorithm to complete 8 times quicker.</span><br />
<br />
<span style="font-family: Arial;">A particularly popular algorithm is <strong>Random Forests</strong>. Now this is essentially an ensemble of lots of individual decision trees that are all built independently. There is no need to wait for one tree being finished before we can start building the next - an ideal candidate for parallel processing.</span><br />
<br />
<span style="font-family: Arial;"><span style="background-color: white;">It took me a while to hack together a solution, but eventually I got 100% CPU usage and my model built in 1/8th of the time<span style="background-color: white;">.</span></span></span><span style="font-family: Arial, Helvetica, sans-serif;"><br />
</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"></span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuVOTyV_RhTJZN6QMsz2SCC-5y_qPu0uI7wLmrz0fnCdgCS-u0z2XRKHSnFbCfWIQnLOFIY6A-BOfTa_KRp6MDKLqKRM-tTGQfWpZsJo4GWxbLW-7FCWWOMtryshbmdv3zzlp_t6d0mdaH/s1600/taskmanager1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuVOTyV_RhTJZN6QMsz2SCC-5y_qPu0uI7wLmrz0fnCdgCS-u0z2XRKHSnFbCfWIQnLOFIY6A-BOfTa_KRp6MDKLqKRM-tTGQfWpZsJo4GWxbLW-7FCWWOMtryshbmdv3zzlp_t6d0mdaH/s320/taskmanager1.PNG" width="320" /></a></div><br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">Below is the R code that will multi thread the building of Random Forests. The multithreading piece came from <a href="http://www.kaggle.com/c/overfitting/forums/t/443/how-to-parallelise-r-code-included">a generous Kaggle post in the Don't Overfit competition</a> by Chris Pardy, and with that bit of info, it was trivial to adapt for my purposes.</span><br />
<br />
<span style="font-family: Arial;">Incidentally, Chris posted this code completely unexpectedly in response to some other code I had already shared. This was a nice surprise and is why the Kaggle concept is such a great learning environment.</span><br />
<br />
<span style="font-family: Arial;">There is now a dilemma though to sharing - a slight matter of $3million.</span><br />
<br />
<span style="font-family: Arial;">From what I understand, in order to win one of the milestone cash prizes you essentially have to tell everyone what you did in a way that they can replicate your results. If you do this, you are essentially putting yourself at the bottom of the leaderboard, as everyone else has your solution plus their own.</span><br />
<br />
<span style="font-family: Arial;">The milestone prizes increase in value every 6 months, so would it be a good strategy for the leader to not take the $30,000 in the first one and hold out for $50,000 6 months later or $60,000 12 months later, given that you will potentially be improving your model by learning from those who do take the money.</span><br />
<br />
<span style="font-family: Arial;">Any thoughts?</span><br />
<br />
<span style="font-family: Arial;"><strong>R code for parallel random forests</strong> </span><br />
<br />
<div style="border: 1px solid black; height: 500px; overflow-x: auto; overflow-y: auto; width: 500px;"><br />
<pre style="background: rgb(255, 255, 255); color: black;"><span style="color: #3f7f59;">######################################</span>
<span style="color: #3f7f59;"># some example code to multithread</span>
<span style="color: #3f7f59;"># the building of random forests</span>
<span style="color: #3f7f59;">######################################</span>
<span style="color: #3f7f59;">#load the required libraries</span>
library(randomForest)
library(snowfall)
library(rlecuyer)
<span style="color: #3f7f59;">#load the data</span>
theData <- iris
<span style="color: #3f7f59;">#set the formula</span>
theFormula <- as.formula(<span style="color: #2a00ff;">"Species ~ . "</span>)
<span style="color: #3f7f59;">#specify number of threads</span>
threads <- 8
<span style="color: #3f7f59;">#total number of trees in the random forest</span>
totalTrees <- 1000
<span style="color: #3f7f59;">#number of trees to build per thread</span>
trees <- as.<span style="color: #7f0055; font-weight: bold;">integer</span>(totalTrees / threads)
<span style="color: #3f7f59;">###########################</span>
<span style="color: #3f7f59;"># now the funky stuff</span>
<span style="color: #3f7f59;">###########################</span>
<span style="color: #3f7f59;">#the function each thread calls</span>
parallelRF <- function(i){
<span style="color: #7f0055; font-weight: bold;">return</span>(randomForest(theFormula, data=theData, ntree=trees))
}
<span style="color: #3f7f59;"># Initialise "cluster"</span>
sfInit(parallel = TRUE, cpus = threads, type = <span style="color: #2a00ff;">"SOCK"</span>)
<span style="color: #3f7f59;"># Make data available to other R instances / nodes</span>
sfExport(list = c(<span style="color: #2a00ff;">"theData"</span>, <span style="color: #2a00ff;">"theFormula"</span>,<span style="color: #2a00ff;">"trees"</span>))
<span style="color: #3f7f59;"># To load a library on each R instance / node</span>
sfClusterEval(library(randomForest))
<span style="color: #3f7f59;"># Use a parallel RNG to avoid correlated random numbers</span>
<span style="color: #3f7f59;"># Requires library(rlecuyer) installed on all nodes</span>
sfClusterSetupRNG()
<span style="color: #3f7f59;"># build the random forests</span>
allForests <- sfClusterApplyLB(1:threads, parallelRF)
sfStop()
<span style="color: #3f7f59;">#everything finished, so</span>
<span style="color: #3f7f59;">#merge all the forests into one</span>
myRandomForest <- allForests[[1]]
<span style="color: #7f0055; font-weight: bold;">for</span>(i in 2:threads){
myRandomForest <- combine(myRandomForest,allForests[[i]])
}
<span style="color: #3f7f59;">#convince oursleves we have them all combined</span>
myRandomForest$ntree
<span style="color: #3f7f59;">#what is important</span>
myRandomForest$importance
</pre><br />
</div></div>Sali Malihttp://www.blogger.com/profile/14741877332990090234noreply@blogger.com7